Fubaredness Is Contagious

Dmitriy Samovskiy’s Blog

Extending EC2 API – ec2-describe-ipaddress-ranges

September 1st, 2010 · No Comments

Do you remember how we used to programmatically consume services on the web before proliferation of APIs? That’s right – scraping! And do you know what prevents us from using this technique now, when some piece of data you need for your application, is not available via API? That’s right – absolutely nothing!

I recently came across something that was not available via EC2 API – lists of IP address ranges for each EC2 region. AWS team maintain such lists in their forums – currently at http://developer.amazonwebservices.com/connect/ann.jspa?annID=735 (I say “currently” because annID has already changed at least once). I asked for API, but in the meantime since I needed this now, I wrote a simple python script to scrape and parse the data – http://gist.github.com/559397 or embedded below. Enjoy! By the way, who would have thought that one would need to resort to scraping AWS, the very pioneers of infrastructure APIs, to get this simple bit of information?

→ No CommentsTags: cloud computing · python

The Biggest Challenge for Infrastructure as Code

August 17th, 2010 · 6 Comments

What do you do when you come across a piece of open source software that you’d like to try? You could download its source code tarball, extract the files, build and install it following the rules and conventions for a given programming language (./configure && make && make install, ruby setup.rb build, python setup.py install, perl Makefile.PL) – and you end up with a usable product. This simple fact is at the very core of entire open source ecosystem – without an easy and reliable way to transform source code into runnable software, open source potentially would not even exist.

I think that the biggest challenge for Infrastructure as Code today is its current lack of anything resembling a Makefile – a relatively simple description of how input could be transformed into output ready for use end to end, given a set of basic tools or a preset build environment (for example, for a project written in C it would be apt-get install build-essential on Debian and its derivatives). If you want an example, please take a look at deployment instructions for openstack/nova (”on the cloud controller, do this… on volume node, do that…”). While it is indeed infrastructure code, its end-to-end build and deployment instructions are provided in textual form, not as code.

Why is it a problem you may ask. First and foremost, build/deploy instructions provided in textual form can’t be easily consumed by a machine – it feels like we are back in the dark ages, without APIs where all work must be performed manually.

Secondly, because they are not fully formalized, they can’t be as easily shared – there could be many uncaptured context requirements that could lead to different people transforming identical inputs to outputs that would not function identically. And if they are not shared, same functionality is being worked on by many separate teams at the same time, which leads to incompatible, sometimes competing implementations and creates wasted effort by not allowing code reuse.

Thirdly, since they are not code, they are not as easy to test and verify test coverage for, or to fork and merge, or to port to other platforms.

My point is that while individual parts or steps of an infrastructure deployment could be automated, a whole thing rarely is, especially when a system is to be deployed to multiple hosts connected over the network. This would be similar to a software project with various directories, each with its own Makefile but without a top-level Makefile – such that you’d have to follow a HOWTO telling you which arguments to pass to make in each directory and in which order to run the commands.

What to do? I call on all infrastructure projects to make every attempt to ship deployment instructions not as textual step-by-step howto documents, but as code – be it Chef cookbooks, Puppet recipes, shell scripts, Fabric/Capistrano scripts and so on, or a combination of any of the above. Please consider providing cloud images (in at least one region of at least one public cloud) with your canonical build environment (your equivalent of build-essential). Please consider including canonical network topologies for your deployment – since you can’t predict IP addresses each user is going to allocate, all configuration files will need to be autogenerated or built from templates.

I am well aware it’s easier said than done, but if we do this, I hope a tentative consensus on best practices for infrastructure as code deployments could emerge over time which could then facilitate creation of a common “infrastructure make” tool.

→ 6 CommentsTags: devops · infrastructure development

Parallelize Your EC2 API Calls with Python, Boto and Threading

August 3rd, 2010 · No Comments

I started a small new project on Github – http://github.com/somic/ec2-multiregion. It includes several small tools that facilitate EC2 API operations that involve multiple regions at the same time.

If I were to query each endpoint one after another, I quickly discovered it would take too long. Therefore, I created a small helper class called BotoWorkerPool (in lib/boto_worker_pool.py), which wraps Python’s standard threading module around calls to boto – this helps achieve some amount of parallelism without introducing significant complexity of dealing with and sharing data among multiple processes. This also allows to potentially migrate to processing or multiprocessing libraries in the future, which offer threading-like interfaces for a multi-process model.

There are 2 tools at the moment.

onesnapshot.py creates new snapshots for all volumes that already have one snapshot marked with “__onesnapshot__” token. The rationale for this tool came in part from the following statement on AWS main page for EBS about durability:

The durability of your volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot.

imageequiv.py takes AMI ID, kernel ID or ramdisk ID and finds equivalent IDs in all regions, based on matching name or manifest file location. This tool is a response to the following tweet of mine:

wanted – equivalence lists for kernel and ramdisk images (aki-, ari-) across all ec2 regions

Hope these are useful to someone.

→ No CommentsTags: cloud computing · python

Russell’s Paradox and Cloud Computing

July 20th, 2010 · 1 Comment

I am sure you’ve heard of Bertrand Russell’s paradox and one of its more widely known versions – Barber paradox. But let me rephrase the Wikipedia article:

Suppose there is a town with just one public IaaS cloud provider, and that every business in the town runs their own IT: some by hosting it on-premises, some by running it in the cloud. It seems reasonable to imagine that the cloud provider obeys the following rule: it runs IT for all and only those businesses that do not run their IT on-premises.

Under this scenario, we can ask the following question: is cloud provider’s own IT in the cloud or on-premises? (remember that the cloud provider itself is also a business)

Of course this “faux” paradox is not a paradox at all. As I suggested on Twitter, “if you build a cloud, to your customers it will indeed look like a cloud; but to you it will look like a regular datacenter.”

→ 1 CommentTags: cloud computing · fun

Are You a Responsible Owner of Your Availability?

July 6th, 2010 · 1 Comment

Last month AWS released Reduced Redundancy Storage feature of S3. There were several aspects of this announcement that appeal to different people, but I especially appreciated one part – S3 now offers a choice of less availability for a lower price.

Availability of your system, just as any other part of your service, is a feature. Just as with anything else, one needs to invest time, effort and resources in building it out. And whatever you dedicate to availability (such as development time) can’t be used for other features – this is what’s known as opportunity cost. If you could put same resources to a better use somewhere else, investing them in availability may not be the optimal decision. Additionally, availability draws from your complexity budget which is going to impact other areas – HA systems tend to be more complex and hence require more effort to develop, maintain and improve them over time. Availability, just as any other feature, has a price tag that you will have to pay to get it. Because you own your site’s availability, it’s up to you to decide how much availability you want AND can afford to build.

The last point is very important. Our daily lives are filled with points of failure – home appliances (can break), a usual route you take to work (could be impacted by road construction), your regular coffee place (your favorite barista could transfer to a different location). Do you maintain 2 different non-overlapping routes to work? Or do you frequent 2 coffee shops in order to have an alternative if one shop drops from your list? In other words, in our lives we regularly forgo availability when it doesn’t make sense – why shouldn’t we follow the same rule in our professional lives?

Availability is not a binary option. You could have all-active N-tuple, you could have active-active pair, you could have an active-passive pair with automatic failover, or same active-passive pair with manual failover. And finally, in today’s cloudy world, you could also have just a single resource with ability to replace this resource quickly if it goes down. Options include geographic redundancy, vendor/provider diversity, and so on. Availability could be as simple as host your systems at a very reliable provider. Or at the very least – be able to detect when there is a problem and be able to restore the system within a preset amount of time. Different levels of availability obviously don’t cost the same – pick one that you want and can afford.

Secondly, if your overall service consists of multiple smaller parts, you are free to choose different levels of availability for individual parts. Anything which responds to synchronous calls (a call that expects a reply immediately) – like web front door – may have one level of availability (higher), background jobs may have lower level. Designing each subsystem with appropriate level of availability will reduce your costs and most likely will let you save some of complexity budget for other things.

Thirdly, while availability is a single metric, problems that impact it are not. Some problems could be frequent and easy to deal with, other problems could be rare and catastrophic. Do you want to build your service to withstand a failure of a host, all hosts, all of your ISP, entire Internet? It’s all about the tradeoffs between costs, severity of each type of problem and probability of these problems to occur.

Fourthly, remember that availability measures that you build are your defenses against problems. A particular type of problem that you want to protect against, requires an availability measure targeted at this very problem – matching it by functionality, size and cost (a single defense measure may work against multiple threats). Imbalance in any of these three categories between your defenses and the problems they are meant to prevent will lead to suboptimal results. After all, you don’t use a shield to defend against a cannon and you don’t duplicate your entire operation into the second datacenter just to protect against a router failure.

And finally, beware of peer pressure. If your web front door’s availability costs $1m per month and it’s bringing in $10m per month worth of revenues,  it can be a no-brainer. But if you are investing 50% of your complexity budget in availability just because everybody else is doing it, I think it could be a problem.

Going back to AWS and putting my amateur behavioral economist’s hat on, I am curious how many people decided to take advantage of lower price for lower availability of RRS. And even more interestingly, if S3 initially were at RRS availability and AWS announced better availability for higher price, would we end up with the same distribution of people using higher and lower availability?

→ 1 CommentTags: devops · infrastructure development