It’s common for infrastructure engineers to put their web servers in a private subnet to protect them from the outside world. An ELB sits in a public subnet and acts as a middleman for the traffic headed to the web servers. The web servers’ responses return to the browser through the http connection the ELB established. If the web servers were in a public subnet, but still had an ELB in front of them, the interaction wouldn’t be much different.
But there’s an important aspect to keep in mind: if the web servers in a private subnet have to initiate a request (to S3, for instance), their request will have to be routed to the outside world through a NAT. And that can have a big impact on performance.
We were fortunate to have the ability to run some significant simulations and compare the performance of “public” web servers and “private” web servers. Our use case requires us to download a file from S3, decrypt it, and return it to the browser. We collect metrics all over the place, but the one we’re focused on most in this simulation is what we call localization time: the time it takes us to issue a GET request to S3 and save the file to disk.
It’s fairly well known that Amazon doesn’t guarantee that every file will be downloaded lickety-split from S3. We know that some times, some files will appear to come to us slowly. So we track that: how often does it take more than 1 second to localize a file?
Notice that in Figure 1 there are many times that there are no files that took more than 1 second to download in the public subnet, while it almost always happens in the private subnet.
Here in Figure 2 we can see that the average amount of time it takes to localize a file is noticeably different between the private and public subnets. First, there is significantly more variation in localization times in the private subnet than in the public subnet. Second, if we trimmed off the outliers from the private subnet, its localization times would still be higher than in the public subnet.
Now, does this matter?
To us, it’s an emphatic yes! Our bread-and-butter is all about serving up files that live in S3 (which for security reasons we can’t expose directly to our users). These delays are more than we are able to tolerate. To you, the answer might be no. Either way, it is valuable to know that your NAT server can introduce noticeable performance degradations.
EDIT: March 24, 2015 at 1:30pm
Some folks on reddit brought up the question of how the NAT server itself is performing. It’s a fantastic question. I have a two-fold response.
First, that’s pretty much the question I hope people ask when they read this article … not just about our NAT server, but more importantly, theirs. The NAT server(s) need just as much TLC as any other. The fact that AWS offers to set it up for you increases the likelihood that you’re going to forget about it.
Second, I totally should have put our NAT server stats in here. So I’m adding them now. Our NAT servers are m3.large instances.
It’s clear from this graph that our NAT servers are able to support considerably more traffic than our simulation is creating. So it’s not network throughput that’s slowing us down.
So how about CPU load? The 1-minute CPU load average didn’t exceed 0.25 during the time the simulation was running.
And memory usage? Memory usage stayed static at 750 megabytes used per box, with 3 gigabytes free.
No disk activity was happening during the simulation. So iowait won’t show us anything useful; neither will inode usage, or any other disk measurement for that matter.
So, in summary …
- CPU, not the culprit
- Memory, not the culprit
- Disk, not the culprit
- On-NAT Networking, not the culprit