The other day I got halfway through writing a very irate support ticket to AWS, stopped to do some fact checking, and learned something deeply annoying.
One of the teams I work with manages a bunch of services. One of these “services” is some common Amazon Route53 infrastructure that is set up using AWS CloudFormation. Over the history of the project, the deployment in the development account the team uses has been a little flaky. Each time we hit a deployment problem, the problem always turned out to be rate limiting. It never happened in production, so the flakiness didn’t quite get the attention that it could have.
Rate limiting in Route53
Some background: Route 53 has a hard limit of 5 requests per second per AWS account. For most folks, this is fine. However, it wasn’t working well for this team.
The team raised support tickets and got advice like “CloudFormation will attempt to create your resources in parallel. One option to avoid rate limiting is to add
DependsOn links to serialize the resource creation.” We weren’t super-happy with that answer. Granted, there is a CloudFormation roadmap item to fix it, but we needed something in the interim, and it worked … mostly.
On this day a deployment had failed again, and after going after the usual suspects and making sure that the resources were properly serialized, I got very irate. I was halfway through writing a support ticket:
We are still encountering Route53 rate limits and our CloudFormation stack deployment / updates are intermittently failing, sometimes after only two resources are created. There are no Route53 API calls being made by our applications, only through CloudFormation.
We are quite frustrated at this point and would like to request a session with a solution architect to help us understand how we should be doing this and
and I paused.
Check yourself before you wreck yourself
“There is no point in using the word ‘impossible’ to describe something that has clearly happened.” — Douglas Adams, Dirk Gently’s Holistic Detective Agency
“Is it true that there are no other Route53 API calls being made?” I asked myself. A quick jaunt into AWS CloudTrail told me the answer, and also opened a gaping pit beneath my feet.
There were 436 Route53 API calls made in the 2-minute period surrounding our CloudFormation failure. If you do the math, that’s 3.6 requests per second on average, so it’s not at all surprising that we maybe tipped over the limit of 5 at some point in there.
“But where are these coming from?” was my immediate question, and it was immediately answered.
Virtually all of these requests were being made by an EC2 instance that was part of an Amazon Elastic Kubernetes Service (EKS) cluster.
The production account doesn’t have the EKS cluster, so it’s not overwhelmed with Route53 API calls, which explains why deployment never failed there.
I wanted to decommission the cluster immediately, but unfortunately some teams still need it, so I wasn’t able to.
external-dns documentation says that one workaround for the issue of the controller eating your entire Route53 request budget is to extend the interval that the controller reconciliation loop runs at. In this particular cluster, the reconciliation loop was running every minute (the default!) to reconcile a set of records that change approximately never. I followed the instructions, set the
interval to a week, and settled in to see what happened.
The first thing I noticed was that immediately the calls to Route53 stopped. Not surprising, but great to get confirmation. Several hours after the change, there were still no calls from the previously-misbehaving cluster.
All is well now, and I get to put away my detective hat for another day.
“The light works,” he said, indicating the window, “the gravity works,” he said, dropping a pencil on the floor. “Anything else we have to take our chances with.” — Douglas Adams, Dirk Gently’s Holistic Detective Agency
What I learned
First, CloudTrail was instrumental here. I’m still a novice, but I’m learning how powerful a tool it is. Once I knew what to look for, it was immediately obvious what the source of the rate limiting was. The events in CloudTrail identified the EC2 instance and even made it clear that the source of the requests was in an EKS cluster.
Second, I was reminded that Kubernetes is not a get-out-of-ops-free card. There is a lot of expertise involved in running Kubernetes well, even when you’re using a managed service like EKS. I knew this before, but this was an example of a cluster I didn’t even know existed (don’t worry: someone more responsible did know!) having side effects way outside its scope.
Got any detective stories you’d like to share? Comment below, I’d love to hear them!