Building an RPM repository in AWS (Part 2)

This is the second installment of a multi-part series. You can start at the beginning with Building an RPM repository in AWS.

What just happened? Turns out you were relying on a repository that pruned really old packages. And they just pruned one that your application relies on. So what are you going to do? Try like hell to find it somewhereanywhereplease-God-let-me-be-able-to-find-it. Good luck with that.

Now, we’re going to talk about how you can prevent this occurrence in the future.

Let’s outline our guiding principles, and then describe an architecture that will help us make those principles real.

  • I will host locally all packages that my infrastructure and application rely on. I will not rely on someone else to keep my packages available.
  • I want to support having more than one repository server, all with the exact same packages on them. (This might be useful, for instance, if you have some web servers in the US and others in the UK. In this case, you can have a repository server in the US and an identical one in the UK.)
  • I want to be able to publish new packages. This includes updated custom packages as well as updated base packages.
  • My repository servers will be built according to the same fundamental principles as all other servers in my infrastructure. This means using CloudFormation, Chef, etc.

With those in mind, let’s take a look at a high-level architecture that will deliver a usable solution.

An architecture diagram
A diagram that shows one way to architect an infrastructure in support of an AWS-hosted RPM server.

Let’s walk through this.

  • First, you’re going to need to download all the packages from a remote repository. (Think rsync.) They’ll live (only temporarily) on a development server.
  • Second, let’s upload all those packages to an S3 bucket. Why there? First, I like the storage reliability that S3 provides. Second, when have updated packages, we’ll already have a place to put them. We could keep all of our packages in EBS, but I’d be afraid of accidentally deleting the volume. To increase durability, we could backup the EBS volume into a new image. But, then, each time we updated our packages, we’d have to make a new image.
  • Third, let’s update our CloudFormation stacks and templates (not shown in diagram) to build out our new repo servers. As part of the build, we should download the files from S3 and create the repository.
  • Fourth, we’ll modify our Chef scripts to tell our other servers to use the new, local repository server.

In the next installment, we’ll dive into the details and share some scripts to help make all this possible. To be continued.

Building an RPM repository in AWS (Part 2)

AWS and NAT: How private subnets can affect your app’s performance

It’s common for infrastructure engineers to put their web servers in a private subnet to protect them from the outside world. An ELB sits in a public subnet and acts as a middleman for the traffic headed to the web servers. The web servers’ responses return to the browser through the http connection the ELB established. If the web servers were in a public subnet, but still had an ELB in front of them, the interaction wouldn’t be much different.

But there’s an important aspect to keep in mind: if the web servers in a private subnet have to initiate a request (to S3, for instance), their request will have to be routed to the outside world through a NAT. And that can have a big impact on performance.

We were fortunate to have the ability to run some significant simulations and compare the performance of “public” web servers and “private” web servers. Our use case requires us to download a file from S3, decrypt it, and return it to the browser. We collect metrics all over the place, but the one we’re focused on most in this simulation is what we call localization time: the time it takes us to issue a GET request to S3 and save the file to disk.

It’s fairly well known that Amazon doesn’t guarantee that every file will be downloaded lickety-split from S3. We know that some times, some files will appear to come to us slowly. So we track that: how often does it take more than 1 second to localize a file?

Graph showing how frequently files localize too slowly
Figure 1: Web servers in a private subnet have more slow localizations than web servers in a public subnet.

Notice that in Figure 1 there are many times that there are no files that took more than 1 second to download in the public subnet, while it almost always happens in the private subnet.

Graph showing how frequently files localize too slowly
Figure 2: Web servers in a private subnet have a higher average localization time than web servers in a public subnet.

Here in Figure 2 we can see that the average amount of time it takes to localize a file is noticeably different between the private and public subnets. First, there is significantly more variation in localization times in the private subnet than in the public subnet. Second, if we trimmed off the outliers from the private subnet, its localization times would still be higher than in the public subnet.

Now, does this matter?

To us, it’s an emphatic yes! Our bread-and-butter is all about serving up files that live in S3 (which for security reasons we can’t expose directly to our users). These delays are more than we are able to tolerate. To you, the answer might be no. Either way, it is valuable to know that your NAT server can introduce noticeable performance degradations.

EDIT: March 24, 2015 at 1:30pm

Some folks on reddit brought up the question of how the NAT server itself is performing. It’s a fantastic question. I have a two-fold response.

First, that’s pretty much the question I hope people ask when they read this article … not just about our NAT server, but more importantly, theirs. The NAT server(s) need just as much TLC as any other. The fact that AWS offers to set it up for you increases the likelihood that you’re going to forget about it.

Second, I totally should have put our NAT server stats in here. So I’m adding them now. Our NAT servers are m3.large instances.

Graph showing NAT network traffic
Figure 3: The NATs are able to support much more network traffic than our simulation is creating.

It’s clear from this graph that our NAT servers are able to support considerably more traffic than our simulation is creating. So it’s not network throughput that’s slowing us down.

So how about CPU load? The 1-minute CPU load average didn’t exceed 0.25 during the time the simulation was running.

And memory usage? Memory usage stayed static at 750 megabytes used per box, with 3 gigabytes free.

No disk activity was happening during the simulation. So iowait won’t show us anything useful; neither will inode usage, or any other disk measurement for that matter.

So, in summary …

  • CPU, not the culprit
  • Memory, not the culprit
  • Disk, not the culprit
  • On-NAT Networking, not the culprit
AWS and NAT: How private subnets can affect your app’s performance

Building an RPM repository in AWS (Part 1)

This is the first of a multi-part series. Check back for future installments.

When you have a SaaS application that the FDA cares about, there are certain things you have to do. Usually there are really good reasons. Take for instance the regulatory requirement that when you install your application on a web server, the server must pass an installation qualification. In layman’s terms, that means you have to prove one way or another that what you want installed on that box is actually what’s installed on that box.

Maybe your application is written in PHP, and your installation qualification document says that you need the following packages installed (among others, of course). Why does it say that? Because you’ve gone through an extensive validation process to all but prove mathematically that your application works the way you say it will.

  • php-5.3.3-40.el6_6.x86_64
  • php-symfony-2.3.9-1.el6.noarch
  • php-pdo-5.3.3-40.el6_6.x86_64

You did things the modern way: you got yourself an AWS account and used CloudFormation, Autoscaling, Chef and other great tools to create a stack of web servers that work perfectly. You go the extra mile and make sure that all the packages listed in your IQ are pinned in Chef. Good for you.

Time passes, Amazon shuts down some of your instances. No surprise there. They come back up fine. Just like you designed it. You knew it would work correctly.

A couple more months pass. Your architect has suggested that you consider upgrading your version of PHP to, you know, something WRITTEN THIS DECADE. The quality assurance team is happy to play their role, but boss man grits his teeth when QA tells him that it’ll take 2 months to redo validation and update the installation qualification. Man, that’s expensive. But you’re also thinking about upgrading MySQL, and that’ll require re-validation and an updated IQ, and hell … you might as well only go through that process once. The team decides to upgrade both PHP and MySQL at the same time. No problem. You guys are good.

Two weeks into the project, your alerting tools start letting you know there’s a problem with production. When increased traffic hit your web server stack just after lunch, your two new web servers didn’t come up correctly. Checking your logs, you see:

No package php-5.3.3-40.el6_6 available.

You have a problem.

To be continued …

Read part two of Building an RPM repository in AWS.

Building an RPM repository in AWS (Part 1)

Command Line Magic: Graphite, and Sending an email

Let’s take a look at sending an email from the Linux command line. How about we send a one-line email to ourselves? Note: The -s flag specifies the email’s subject.

$> echo "Hi friend!" | mail -s "This is an email"

Outside of this being really simple, there’s not much use to it, is there? Let’s do something more interesting: attaching an image to the email.

$> ( echo "Hi friend!" ; uuencode myimage.jpg myimage.jpg ) | mail -s "This is an email"

So what’s going on here? Not only are we placing our greeting (“Hi friend!”) within the body of the email, we’re also putting a uuencoded image in there. The way we’re using uuencode takes two parameters. The first myimage.jpg tells uuencode to encode the local file myimage.jpg. (This overrides the default behavior of reading from standard in.) The second myimage.jpg will be the name of the attachment within the email.

Is this more interesting? Yes. More useful? Maybe not. But let’s see if we can fix that.

How about we send an image from our monitoring solution (like Graphite).

$> curl --compressed > cpuUsage.png;
$> (echo "Recent CPU usage for `date`" ; uuencode cpuUsage.png cpuUsage.png ) | mail -s "CPU Usage for `date`" $

Throw that in cron, and you can get Graphite graphs emailed to you whenever you want.

Note: The command shown above probably doesn’t do everything you need it to. Think of it as inspiration. I mean, you’re going to need to log in to graphite because your monitoring isn’t exposed to the outside world. (Right?) Also, you may not want to retrieve a cached image. The final command, in all of it’s glory, is here.

$> cd /tmp; export AUTH=`curl -sv -u username:password "" 2>&1 | grep "Authorization: Basic" | awk '{print $4;}'` ;
$> curl '' -H 'Pragma: no-cache' -H 'Accept-Encoding: gzip,deflate,sdch'  -H 'Accept: */*'  -H "Authorization: Basic ${AUTH}" -H 'Connection: keep-alive' -H 'Cache-Control: no-cache' --compressed > /tmp/cpuUsage.png ;
$> (echo "Here is cpuUsage at `date`"  ; uuencode  cpuUsage.png cpuUsage.png )  | mail -s "CPU Usage - `date`"

Now, credit is owed where credit is due. I really wish I could claim credit for putting all of this together. But I can’t. This came from a co-worker who is known as our “command line guru”. I would link to his blog, if he had one.

Command Line Magic: Graphite, and Sending an email

A favorite interview question: more than meets the eye

I love being in on hiring interviews, especially interviews for technical candidates. One of my favorite questions, whether we’re hiring a systems administrator, software engineer, DevOps specialist, or QA engineer, is deceptive. At first glance, it appears so easy. And on the surface, it is. But once you dive into the gory details, you realize that the question touches on so many important concepts.

What does this do:

$> wget

Continue reading “A favorite interview question: more than meets the eye”

A favorite interview question: more than meets the eye

What is your read:write ratio in MySQL?

In a recent post, I mentioned that we needed a MySQL clustering solution that scaled well with both reads and writes. I was fairly sure we handled more reads than writes. So I asked around: what percentage of our operations were reads? Nobody knew. Not content to leave the question unanswered, I spent 5 minutes figuring it out.

Do you know how to figure out what percentage of your MySQL operations are reads vs. writes?

Continue reading “What is your read:write ratio in MySQL?”

What is your read:write ratio in MySQL?

Web Applications and Security

Verizon recently came out with a great analysis of thousands of security breaches. Here’s what fascinated me most:

The universe of threats may seem limitless, but 92% of the 100,000 incidents we’ve analyzed from the last 10 years can be described by just nine basic patterns.

– Verizon’s 2014 Data Breach Investigations Report

Those nine patterns are:

  • Point-of-sale intrusions
  • Web app attacks
  • Insider and privilege misuse
  • Physical theft and loss
  • Miscellaneous errors
  • Crimeware
  • Payment card skimmers
  • Denial of service
  • Cyber-espionage

Within the realm of web application security, Verizon has done a nice job of highlighting some important basic controls that can improve your application’s security.

  • Single-password fail
  • Rethink CMS
  • Validate inputs
  • Enforce lockout policies
  • Monitor outbound connections

You can download the report without registering. Pour yourself some coffee, sit down, and read it. We’re all better off if our technology is more secure.

Web Applications and Security

Researching High Availability MySQL: How we chose Percona’s XtraDB

Our application relies on MySQL, and we’ve got a simple master-slave setup. That solution, though it works for us now, cannot follow us to AWS. It’s not reliable enough. We have to plan on any given availability zone going down, and taking all of its instances with it. When this happens to us, we want it to be a zero-downtime event.

So we put on our thinking caps, and came up with a number of solutions that could work for us:

  1. Master-Slave
  2. Master-Master
  3. Master-Master-Fan
  4. Active-Passive Shared Storage
  5. NDB
  6. Galera or Percona XtraDB Cluster

Many of these solutions were described quite nicely in this slide deck by Oracle.

Which was right for us? (And would it be right for you?)

Continue reading “Researching High Availability MySQL: How we chose Percona’s XtraDB”

Researching High Availability MySQL: How we chose Percona’s XtraDB