<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Fubaredness Is Contagious &#187; infrastructure development</title>
	<atom:link href="http://somic.org/category/infrastructure-development/feed/" rel="self" type="application/rss+xml" />
	<link>http://somic.org</link>
	<description>Dmitriy Samovskiy's Blog</description>
	<lastBuildDate>Wed, 01 Sep 2010 07:55:05 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>The Biggest Challenge for Infrastructure as Code</title>
		<link>http://somic.org/2010/08/17/the-biggest-challenge-for-infrastructure-as-code/</link>
		<comments>http://somic.org/2010/08/17/the-biggest-challenge-for-infrastructure-as-code/#comments</comments>
		<pubDate>Tue, 17 Aug 2010 08:07:15 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[devops]]></category>
		<category><![CDATA[infrastructure development]]></category>

		<guid isPermaLink="false">http://somic.org/?p=1637</guid>
		<description><![CDATA[What do you do when you come across a piece of open source software that you&#8217;d like to try? You could download its source code tarball, extract the files, build and install it following the rules and conventions for a given programming language (./configure &#38;&#38; make &#38;&#38; make install, ruby setup.rb build, python setup.py install, [...]]]></description>
			<content:encoded><![CDATA[<p>What do you do when you come across a piece of open source software that you&#8217;d like to try? You could download its source code tarball, extract the files, build and install it following the rules and conventions for a given programming language (./configure &amp;&amp; make &amp;&amp; make install, ruby setup.rb build, python setup.py install, perl Makefile.PL) &#8211; and you end up with a usable product. This simple fact is at the very core of entire open source ecosystem &#8211; without an easy and reliable way to transform source code into runnable software, open source potentially would not even exist.</p>
<p><strong>I think that the biggest challenge for <a href="http://stochasticresonance.wordpress.com/2009/07/12/infrastructure-renaissance/">Infrastructure as Code</a> today is its current lack of anything resembling a Makefile &#8211; a relatively simple description of how input could be transformed into output ready for use end to end, given a set of basic tools or a preset build environment </strong>(for example, for a project written in C it would be <em>apt-get install build-essential</em> on Debian and its derivatives). If you want an example, please take a look at deployment instructions for <a href="http://nova.openstack.org/getting.started.html">openstack/nova</a> (&#8221;on the cloud controller, do this&#8230; on volume node, do that&#8230;&#8221;). <strong>While it is indeed infrastructure code, its end-to-end build and deployment instructions are provided in textual form, not as code.</strong></p>
<p>Why is it a problem you may ask. First and foremost, build/deploy instructions provided in textual form can&#8217;t be easily consumed by a machine &#8211; it feels like we are back in the dark ages, without APIs where all work must be performed manually.</p>
<p>Secondly, because they are not fully formalized, they can&#8217;t be as easily shared &#8211; there could be many uncaptured context requirements that could lead to different people transforming identical inputs to outputs that would not function identically. And if they are not shared, same functionality is being worked on by many separate teams at the same time, which leads to incompatible, sometimes competing implementations and creates wasted effort by not allowing code reuse.</p>
<p>Thirdly, since they are not code, they are not as easy to test and verify test coverage for, or to fork and merge, or to port to other platforms.</p>
<p>My point is that while individual parts or steps of an infrastructure deployment could be automated, a whole thing rarely is, especially when a system is to be deployed to multiple hosts connected over the network. This would be similar to a software project with various directories, each with its own Makefile but without a top-level Makefile &#8211; such that you&#8217;d have to follow a HOWTO telling you which arguments to pass to make in each directory and in which order to run the commands.</p>
<p><strong>What to do? I call on all infrastructure projects to make every attempt to ship deployment instructions not as textual step-by-step howto documents, but as code &#8211; be it Chef cookbooks, Puppet recipes, shell scripts, Fabric/Capistrano scripts and so on, or a combination of any of the above. Please consider providing cloud images (in at least one region of at least one public cloud) with your canonical build environment (your equivalent of <em>build-essential</em>). Please consider including canonical network topologies for your deployment &#8211; since you can&#8217;t predict IP addresses each user is going to allocate, all configuration files will need to be autogenerated or built from templates.</strong></p>
<p>I am well aware it&#8217;s easier said than done, but if we do this, I hope a tentative consensus on best practices for infrastructure as code deployments could emerge over time which could then facilitate creation of a common &#8220;infrastructure make&#8221; tool.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2010/08/17/the-biggest-challenge-for-infrastructure-as-code/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Are You a Responsible Owner of Your Availability?</title>
		<link>http://somic.org/2010/07/06/are-you-a-responsible-owner-of-your-availability/</link>
		<comments>http://somic.org/2010/07/06/are-you-a-responsible-owner-of-your-availability/#comments</comments>
		<pubDate>Tue, 06 Jul 2010 07:59:10 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[devops]]></category>
		<category><![CDATA[infrastructure development]]></category>

		<guid isPermaLink="false">http://somic.org/?p=1500</guid>
		<description><![CDATA[Last month AWS released Reduced  Redundancy Storage feature of S3. There were several aspects of  this announcement that appeal to different people, but I especially  appreciated one part &#8211; S3 now offers a choice of less  availability for a lower price.
Availability of your system, just as any other part of your [...]]]></description>
			<content:encoded><![CDATA[<p>Last month AWS released <a href="http://aws.typepad.com/aws/2010/05/new-amazon-s3-reduced-redundancy-storage-rrs.html">Reduced  Redundancy Storage</a> feature of S3. There were several aspects of  this announcement that appeal to different people, but I especially  appreciated one part &#8211; S3 now offers a <strong>choice of </strong><strong>less  availability for a lower price</strong>.</p>
<p>Availability of your system, just as any other part of your service, is a feature. Just as with anything else, one needs to invest time, effort and resources in building it out. And whatever you dedicate to availability (such as development time) can&#8217;t be used for other features &#8211; this is what&#8217;s known as <a href="http://en.wikipedia.org/wiki/Opportunity_cost">opportunity cost</a>. If you could put same resources to a better use somewhere else, investing them in availability may not be the optimal decision. Additionally, availability draws from your <a href="http://somic.org/2010/01/11/normal-accidents-in-complex-it-systems/">complexity budget</a> which is going to impact other areas &#8211; HA systems tend to be more complex and hence require more effort to develop, maintain and improve them over time. <strong>Availability, just as any other feature, has a price tag that you will have to pay to get it.</strong> <strong>Because you own your site&#8217;s availability, it&#8217;s up to you to decide how much availability you want AND can afford to build.</strong></p>
<p>The last point is very important. Our daily lives are filled with points of failure &#8211; home appliances (can break), a usual route you take to work (could be impacted by road construction), your regular coffee place (your favorite barista could transfer to a different location). Do you maintain 2 different non-overlapping routes to work? Or do you frequent 2 coffee shops in order to have an alternative if one shop drops from your list? In other words, in our lives we regularly forgo availability when it doesn&#8217;t make sense &#8211; why shouldn&#8217;t we follow the same rule in our professional lives?</p>
<p>Availability is not a binary option. You could have all-active N-tuple, you could have active-active pair, you could have an active-passive pair with automatic failover, or same active-passive pair with manual failover. And finally, in today&#8217;s cloudy world, you could also have just a single resource with ability to replace this resource quickly if it goes down. Options include geographic redundancy, vendor/provider diversity, and so on. Availability could be as simple as host your systems at a very reliable provider. Or at the very least &#8211; be able to detect when there is a problem and be able to restore the system within a preset amount of time. <strong>Different levels of availability obviously don&#8217;t cost the same &#8211; pick one that you want and can afford.<br />
</strong></p>
<p>Secondly, if your overall service consists of multiple smaller parts, you are free to choose different levels of availability for individual parts. Anything which responds to synchronous calls (a call that expects a reply immediately) &#8211; like web front door &#8211; may have one level of availability (higher), background jobs may have lower level. Designing each subsystem with appropriate level of availability will reduce your costs and most likely will let you save some of complexity budget for other things.</p>
<p>Thirdly, while availability is a single metric, problems that impact it are not. Some problems could be frequent and easy to deal with, other problems could be rare and catastrophic. Do you want to build your service to withstand a failure of a host, all hosts, all of your ISP, entire Internet? It&#8217;s all about the tradeoffs between costs, severity of each type of problem and probability of these problems to occur.</p>
<p>Fourthly, remember that availability measures that you build are your defenses against problems. A particular type of problem that you want to protect against, requires an availability measure targeted at this very problem &#8211; matching it by functionality, size and cost (a single defense measure may work against multiple threats). Imbalance in any of these three categories between your defenses and the problems they are meant to prevent will lead to suboptimal results. After all, you don&#8217;t use a shield to defend against a cannon and you don&#8217;t duplicate your entire operation into the second datacenter just to protect against a router failure.</p>
<p>And finally, beware of peer pressure. If your web front door&#8217;s availability costs $1m per month and it&#8217;s bringing in $10m per month worth of revenues,  it can be a no-brainer. But if you are investing 50% of your complexity budget in availability just because everybody else is doing it, I think it could be a problem.</p>
<p>Going back to AWS and putting my amateur behavioral economist&#8217;s hat on, I am curious how many people decided to take advantage of lower price for lower availability of RRS. And even more interestingly, if S3 initially were at RRS availability and AWS announced better availability for higher price, would we end up with the same distribution of people using higher and lower availability?</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2010/07/06/are-you-a-responsible-owner-of-your-availability/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The Rise of DevOps</title>
		<link>http://somic.org/2010/03/02/the-rise-of-devops/</link>
		<comments>http://somic.org/2010/03/02/the-rise-of-devops/#comments</comments>
		<pubDate>Tue, 02 Mar 2010 09:00:57 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[devops]]></category>
		<category><![CDATA[infrastructure development]]></category>

		<guid isPermaLink="false">http://somic.org/?p=1277</guid>
		<description><![CDATA[If you are in IT, you probably noticed that most of the industry&#8217;s technical buzz lately has been centered around one of three huge areas &#8211; cloud computing, nosql and devops. Unlike Web 2.0 or Social Web, which are about content generation and content consumption models on the Internet, these three are actually about how [...]]]></description>
			<content:encoded><![CDATA[<p>If you are in IT, you probably noticed that most of the industry&#8217;s technical buzz lately has been centered around one of three huge areas &#8211; cloud computing, nosql and devops. Unlike Web 2.0 or Social Web, which are about content generation and content consumption models on the Internet, these three are actually about how software systems are built and operated &#8211; it is &#8220;engineering&#8221; vs &#8220;product.&#8221;</p>
<p>DevOps is on the rise as a newly re-defined standalone discipline, as <a href="http://stochasticresonance.wordpress.com/2009/07/12/infrastructure-renaissance/">evidenced</a> by <a href="http://dev2ops.org/blog/2010/2/22/what-is-devops.html">increased</a> <a href="http://www.kartar.net/2010/02/what-devops-means-to-me/">number</a> of <a href="http://jedi.be/blog/2010/02/12/what-is-this-devops-thing-anyway/">good</a> <a href="http://verveguy.blogspot.com/2010/02/devops-about-time.html">articles</a> about it around the blogosphere. In this post, I am going to take a stab at outlining what DevOps means to me.</p>
<p>I&#8217;ve got some devops cred. Before joining <a href="http://cohesiveft.com/">my current employer</a> where my role morphed over time away from devops, for over 2 years I had been at Orbitz.com where I was in a group in charge of monitoring and automating a hugely distributed multi-datacenter custom airfare search application, running on many hundreds of machines, with several times as many separate entities and processes that needed to be coordinated, restarted, tweaked and so on (our group was in charge of everything above hardware, OS and basic network services such as connectivity, DNS and DHCP). Before that I had had various sysadmin roles, which all involved a large degree of coding beyond the level of simple shell or perl scripts.</p>
<p>To me, devops is a distinct discipline at the border between software engineering and ops, which focuses on developing software for the infrastructure on top of which end-user-facing software is running. It&#8217;s sometimes referred to as development of infrastructure software and includes release deployment. Devops has the following distinguishing characteristics.</p>
<p><strong>1. Ability to write code beyond simple scripts<br />
</strong></p>
<p>Obvious necessary condition.</p>
<p><strong>2. Focus on stability and uptime</strong></p>
<p>Stability and uptime in devops almost always trump features.</p>
<p><strong>3, Extra focus on moving between states<br />
</strong></p>
<p>In dev land, I have often observed situations when the end result of a particular feature was analyzed on its own merit, without taking into consideration how a system can be moved from its current state to its desired future state. Devops pays extra attention to this problematic area.</p>
<p><strong>4. Different angle on business revenue</strong></p>
<p>While developers usually work on things that are meant to increase or sustain business revenues, devops often work on things that are meant to prevent or reduce loss of business revenues. This is somewhat similar to defense vs offense in team sports. The key word is &#8220;balance.&#8221;</p>
<p><strong>5. In devops, we are users of our own software</strong></p>
<p>This is one of the most important distinctions. Unlike developers who create software to be used by someone else (internal customers, end users, site visitors, etc), devops is about developing software for internal needs. For example, you can certainly get sloppy in logging that error, but it&#8217;s <em>you</em>, not someone else, who&#8217;s going to suffer the consequences of having to waste extra time to find necessary information.</p>
<p><strong>6. Architect, developer, tester, product manager, project manager &#8211; all in one.</strong></p>
<p>My personal experience in devops is I/team get(s) an area of responsibility and it&#8217;s up to me/us to make it happen. Assigning priorities, figuring out dependencies, reacting to unexpected changes, managing resources &#8211; all of these functions are performed in devops by the same group of individuals.</p>
<p><strong>7. Awareness of normal accidents</strong></p>
<p>I have an entire blog post dedicated to this &#8211; <a href="http://somic.org/2010/01/11/normal-accidents-in-complex-it-systems/">check it out</a>.</p>
<p><strong>8. QA in production</strong></p>
<p>Some tasks in devops can&#8217;t be adequately tested in smaller synthetic environments. Lack of scale, lack of unique hardware, lack of sufficient capacity in vendor&#8217;s test environment, lack of sufficient connectivity from test site to vendor&#8217;s systems &#8211; all could be factors. Phased deployment and other techniques designed to reduce the risk of a complete meltdown are (or should be) used extensively in such scenarios, but the truth is &#8211; from time to time in devops I had no other way but to actually run a test system in a live production environment.</p>
<p><strong>9. Manual first, then automate</strong></p>
<p>In my experience, a devops task is more likely to start out as something done manually at first, and automated later. In dev land, tasks rarely go through manual phase before being coded up and shipped in a release.</p>
<p><strong>10. Almost always distributed or <a href="http://somic.org/2009/08/18/the-concept-of-hyper-distributed-application/">hyper-distributed</a></strong></p>
<p>Conclusion</p>
<p>Devops is on the rise primarily due to realization that there is a big gap between developing end-user systems and bare-bones systems administration, in large part due to fast growth of IaaS <a href="/category/cloud-computing">cloud computing</a>. Devops originated at places where a relatively few sysadmins were in charge of many hundreds or even thousands of hosts &#8211; where doing their job without automation was impossible. As time goes on, I expect devops will further solidify its role as a first-class citizen and make inroads into non-cloudy companies as well.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2010/03/02/the-rise-of-devops/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Normal Accidents in Complex IT Systems</title>
		<link>http://somic.org/2010/01/11/normal-accidents-in-complex-it-systems/</link>
		<comments>http://somic.org/2010/01/11/normal-accidents-in-complex-it-systems/#comments</comments>
		<pubDate>Tue, 12 Jan 2010 02:01:14 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[distributed]]></category>
		<category><![CDATA[infrastructure development]]></category>
		<category><![CDATA[software engineering]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[operations]]></category>

		<guid isPermaLink="false">http://somic.org/?p=1043</guid>
		<description><![CDATA[Designing a fully-automated or nearly-fully-automated computer system with many moving parts and dependencies is tricky, whether a system is distributed, hyper distributed or otherwise. Failures happen and must be dealt with. After a while, most folks grow up from &#8220;failures are rare and can be ignored&#8221; to &#8220;failures are not that rare and can not [...]]]></description>
			<content:encoded><![CDATA[<p>Designing a fully-automated or nearly-fully-automated computer system with many moving parts and dependencies is tricky, whether a system is distributed, <a href="/2009/08/18/the-concept-of-hyper-distributed-application/">hyper distributed</a> or otherwise. Failures happen and must be dealt with. After a while, most folks grow up from &#8220;failures are rare and can be ignored&#8221; to &#8220;failures are not that rare and can not be ignored&#8221; to &#8220;failures are common and should be taken into consideration&#8221; to &#8220;failures are frequent and must be planned for.&#8221; The latter seems to represent the current prevailing point of view.</p>
<p>But here is a kicker &#8211; it&#8217;s not the end. I saw <a href="http://twitter.com/benjaminblack/status/5662514947">this tweet</a>, read <a href="http://www.kitchensoap.com/2009/11/12/how-complex-systems-fail-a-webops-perspective/">this post</a> and checked out a book by Charles Perrow titled &#8220;Normal Accidents&#8221; from the library. Published in 1984, the book is not about IT, but its material fits our field nicely. And boy, was I enlightened!</p>
<p>The book&#8217;s main point: <strong>no matter how much thought is put into the system design, or how many safeguards are implemented, a sufficiently complex system sooner or later will experience a significant breakdown that was impossible to foresee beforehand, principally due to unexpected interaction between components, tight coupling or bizarre coincidence. For us in IT, it translates to &#8220;no matter how much planning you do or how many safeguards you implement, failures will still happen.&#8221;</strong></p>
<p>There are at least 3 common themes that are present in multiple illustrations in the book:</p>
<ol>
<li>A big failure was usually a result of multiple smaller failures; these smaller failures were often not even related</li>
<li>Operators (people or systems) were frequently misled by inaccurate monitoring data</li>
<li>In a lot of cases, human operators were used to a given set of circumstances, and their thinking and analysis were misled by their habits and expectations (&#8221;when X happens, we always do Y and it comes back&#8221; &#8211; except for this one time, when it didn&#8217;t)</li>
</ol>
<p>I have had my share of outages and downtimes, and I can attest that I have seen these 3 factors play a big role in tech ops. Some were bugs in management and monitoring code, some where human error, some where bizarre set of dependencies but all were a combination of multiple factors. For example, who would have thought that with a failure of primary DNS resolution server, the VIP would not fail over to the secondary; and even though hosts had more than one &#8220;nameserver&#8221; line in /etc/resolv.conf, application timed out waiting for DNS to respond before getting to ask the second nameserver; without name resolution, multiple load balancers independently thought that there was no capacity behind them (because management code calculated capacity in near real-time relying on worker hosts&#8217; names) and disabled themselves, thus taking down the entire farm &#8211; now I know of course&#8230;</p>
<p>It turns out we can&#8217;t eliminate normal accidents altogether, but here are several techniques that I have been using to speed up detection and response in order to reduce the downtime.</p>
<p><strong>Complexity budget</strong>. <a href="http://blog.b3k.us/complexity_budget.html">Described by Benjamin Black</a>, this is a technique to allocate complexity among components beforehand and strictly follow the allocation during implementation phase. It helps avoid unnecessary fanciness and leads to simpler code, which tends to be easier to troubleshoot and recover after a failure.</p>
<p><strong>Control knobs/switches for individual components</strong>. <a href="http://www.slideshare.net/jallspaw/velocity2008-capacity-management1-484676/51">As John Allspaw shows on this slide</a>, you need to be able to turn off any component in an emergency, or throttle it up or down. Planning this feature and building it in from the very beginning is very important.</p>
<p><strong>Accuracy of monitoring data</strong>. Ensure your alarms are as accurate as possible. No matter how much chaos is going on inside the system during a severe failure, last thing you can afford is misleading the operators with wrong information. If you tried to ping a host A and didn&#8217;t get a response, your alarm should not say &#8220;host A is down&#8221; because it&#8217;s not the knowledge you obtained &#8211; it&#8217;s an assumption that you made. It should say &#8220;failed to ping host A from host B&#8221; &#8211; maybe it was network on host B that was an issue when a ping attempt was made, how do you know?</p>
<p><strong>Availability of monitoring data</strong>. There is a reason first thing the military try to do when attacking, is disrupting enemy&#8217;s means of communication &#8211; it&#8217;s that important, which applies to our case as well. You either design your systems to be able to get monitoring data even during the worst outage imaginable (ideally from more than one source), or you at least should be getting an alarm about lack of such monitoring data (it&#8217;s a very weak substitute though).</p>
<p>All in all, to everybody in IT, I highly recommend the Normal Accidents book as well as this <a href="http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf">whitepaper</a> (linked from <a href="http://www.kitchensoap.com/2009/11/12/how-complex-systems-fail-a-webops-perspective/">John Allspaw&#8217;s blog</a>).</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2010/01/11/normal-accidents-in-complex-it-systems/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Cloud Overlay Networks Demystified &#8211; Holiday Edition</title>
		<link>http://somic.org/2009/12/18/cloud-overlay-networks-demystified-holiday-edition/</link>
		<comments>http://somic.org/2009/12/18/cloud-overlay-networks-demystified-holiday-edition/#comments</comments>
		<pubDate>Fri, 18 Dec 2009 17:22:56 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[cohesiveft]]></category>
		<category><![CDATA[infrastructure development]]></category>
		<category><![CDATA[failover]]></category>
		<category><![CDATA[overlay]]></category>
		<category><![CDATA[overlay network]]></category>
		<category><![CDATA[vpncubed]]></category>

		<guid isPermaLink="false">http://somic.org/?p=999</guid>
		<description><![CDATA[As most of you probably know, I work at CohesiveFT where I focus on VPN-Cubed product. In short, it&#8217;s a solution to build overlay networks in third-party clouds. Overlay networks in this case are based on redundant encrypted point-to-point connections from your regular servers to your VPN-Cubed servers called &#8220;managers&#8221; (that you run in the [...]]]></description>
			<content:encoded><![CDATA[<p>As most of you probably know, I work at <a href="http://cohesiveft.com/">CohesiveFT</a> where I focus on <a href="http://cohesiveft.com/vpncubed">VPN-Cubed</a> product. In short, it&#8217;s a solution to build overlay networks in third-party clouds. Overlay networks in this case are based on redundant encrypted point-to-point connections from your regular servers to your VPN-Cubed servers called &#8220;managers&#8221; (that you run in the cloud); managers then act as virtual switches and routers of this overlay, which essentially sits above your physical network. In other words, an overlay network gives a customer effectively a LAN-like network where the servers can be located pretty much anywhere, including in the cloud.</p>
<p>However, not all people know what an overlay network is or what its benefits and strengths are. This holiday season, as we were putting up our outdoor decorations and holiday lighting, I realized that what my wife and I were doing was essentially building an overlay network. Let&#8217;s follow the similarities.</p>
<p>Imagine a regular house with a front yard where for the holidays you want to set up a bunch of lighted Christmas trees, deer and other holiday figures. All of them require electricity &#8211; but there is no power installed in the ground (<span style="color: #0000ff;"><em>parallel with VPN-Cubed overlay network: you are deploying servers to third-party cloud and want to continue using your IP addressing schemes, want to ensure that all communications are encrypted &#8211; but provider doesn&#8217;t offer any of these services out of the box</em></span>).</p>
<p>You don&#8217;t need power out on your front yard all year around &#8211; so there is usually no point in investing money in installing one. <span style="color: #0000ff;"><em>Cloud computing is all about elasticity. As a complement to clouds, VPN-Cubed is easy to set up and take down if necessary for an experiment, or it can be running for long periods of time.</em></span></p>
<p>There are several outdoor outlets on the front wall so you are deciding to power your decorations from these outlets (<span style="color: #0000ff;"><em>you have VPN devices installed on the edge of your network &#8211; you will use them to offer connectivity to your servers from your network using VPN</em></span>). The first obvious solution is to run a power cord from each piece towards an outlet. While it&#8217;s possible in theory, it will turn out ugly in practice. Firstly, a lot of long outdoor power cords are expensive. Secondly, it will create a cabling mess near the outlet. Thirdly, if a cord goes bad, you need to trace where exactly it&#8217;s plugged in and replace it. Fourthly, the more stuff you have to power up, the more difficult this octopus made of power cords is going to be. <span style="color: #0000ff;"><em>Absolutely the same problems apply in our parallel use case.</em></span></p>
<p>So you come up with optimization #1 &#8211; you go out and buy several outdoor power strips with several outlets each. By placing these power strips where your lighted trees and deer are, you are reducing cabling issues, gain ability to use shorter power cords and most likely save money on power cords. <span style="color: #0000ff;"><em>That&#8217;s your VPN-Cubed manager server instance. When you place it next to your cloud-based servers, you reduce latency for your endpoints and cut down on VPN connections from the edge of your network that you need to build and maintain.</em></span></p>
<p>If you are well prepared (i.e., have enough of everything), your composition will drive how many power cords and strips you will need and how long your cords need to be, not the other way around. <span style="color: #0000ff;"><em>Same with VPN-Cubed &#8211; you mold it to fit your use case, your desired topology or application &#8211; you don&#8217;t adjust your application to be able to work within VPN-Cubed overlay network.</em></span></p>
<p>Outdoor power strips have additional protection to let them function outdoors in low temperatures. <span style="color: #0000ff;"><em>And so are VPN-Cubed manager instances &#8211; they are running a hardened OS, with minimal set of enabled services, behind firewall protection.</em></span> You can grab a regular switch and make it work outdoors &#8211; but why waste your time when these things don&#8217;t cost that much? <span style="color: #0000ff;"><em>Same with VPN-Cubed.</em></span></p>
<p>But power strips may fail &#8211; and if they do, entire section of your composition will be turned off. So you get a cold standby sitting in your garage in case a primary goes out. Or better &#8211; you install 2 power strips next to each other, connect them and evenly plug in your endpoints. If one goes out, you switch all connections to the other strip and it&#8217;s back. <em><span style="color: #0000ff;">VPN-Cubed allows you to deploy a hot spare with automatic failover capability, which can help balance the load as well.</span> </em>Your outdoor lighted Christmas tree is connected to one power strip at any given time, but if one fails it can be reconnected to another within a power cord distance. <span style="color: #0000ff;"><em>Same with VPN-Cubed &#8211; your servers are connected to a single manager at any given time, but if a manager becomes unavailable, your servers can automatically re-connect to another manager.</em></span></p>
<p>And what happens if one of your outlets goes bad? Moving a handful of cables to another outlet is much easier than moving a whole lot. <span style="color: #0000ff;"><em>Same with VPN-Cubed &#8211; if your network loses one entry point, you just re-connect VPN-Cubed to another.</em></span></p>
<p>There are many more parallels between the two. Most of us have been building overlay networks of decorations for quite some time. Building overlay networks for the cloud may be new, but CohesiveFT VPN-Cubed product makes it easy and fun. Don&#8217;t be stuck with long power cords &#8211; <a href="http://www.cohesiveft.com/vpncubed/">get</a> <a href="http://www.cohesiveft.com/Cube/VPN/VPN-Cubed_IPsec_to_Cloud/">yourself</a> <a href="http://www.cohesiveft.com/Cube/VPN/VPN-Cubed_SSL/">some</a> <a href="http://www.cohesiveft.com/Cube/VPN/VPN-Cubed_Custom_Enterprise_Configurations/">nice</a> outdoor power strips. And enjoy the holidays!</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/12/18/cloud-overlay-networks-demystified-holiday-edition/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Punching UDP Holes in Amazon EC2</title>
		<link>http://somic.org/2009/11/02/punching-udp-holes-in-amazon-ec2/</link>
		<comments>http://somic.org/2009/11/02/punching-udp-holes-in-amazon-ec2/#comments</comments>
		<pubDate>Tue, 03 Nov 2009 04:24:16 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[infrastructure development]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[firewall]]></category>
		<category><![CDATA[security groups]]></category>

		<guid isPermaLink="false">http://somic.org/?p=876</guid>
		<description><![CDATA[Disclaimer 1: Despite its possibly ominous name, this is NOT a network vulnerability or an attack that could lead to unauthorized access. UDP hole punching requires cooperation between two hosts, and hence can&#8217;t be easily used as an attack by itself (in other words, in order to run it, you most likely must already  have [...]]]></description>
			<content:encoded><![CDATA[<p><em>Disclaimer 1: Despite its possibly ominous name, this is NOT a network vulnerability or an attack that could lead to unauthorized access. UDP hole punching requires cooperation between two hosts, and hence can&#8217;t be easily used as an attack by itself (in other words, in order to run it, you most likely must already  have gained access to the hosts).<br />
</em></p>
<p><em>Disclaimer 2: Conclusions reached at the end of this post are my educated guesses, and may turn out to be not true. They are based on my observations and not on actual knowledge how EC2 internals are designed or implemented.<br />
</em></p>
<p>I was once working on a setup in Amazon EC2 and came across an oddity, which when coupled with <a href="/2009/09/21/security-groups-most-underappreciated-feature-of-amazon-ec2/">my interest in EC2 security groups mechanism</a>, turned into this post.</p>
<p>UDP hole punching, in a nutshell, is a technique which allows two cooperating hosts, potentially located behind NAT and/or firewalls, to establish a peer-to-peer UDP communication channel directly to each other. It&#8217;s a technique used by Skype, for example, &#8211; you can read more about it in a <a href="http://en.wikipedia.org/wiki/UDP_hole_punching">Wikipedia article</a>. If two hosts start sending UDP packets to each other on pre-agreed ports, bi-directional flow of packets leads NAT devices and firewalls to think that all these packets are a part of an established communication channel.</p>
<p>EC2 allows a lighter form of this technique because EC2 NAT never rewrites source port of outgoing packet (recall that in EC2, NAT is always 1-to-1 such that port rewriting isn&#8217;t necessary). We know with 100% certainty that a packet we are sending with a given source port X will be seen by remote instance with the same source port.</p>
<p>I wrote a small Python tool (available at <a href="http://gist.github.com/224795">http://gist.github.com/224795</a>) to test UDP hole punching and set out to discover if it could work in EC2. My expectation was that it should work. Unless explicitly noted, I used ports above 45,000 and none of security groups explicitly allowed UDP traffic on these ports.</p>
<p><strong>I was able to easily punch UDP holes between any two instances using each instance&#8217;s public IP address</strong> &#8211; in line with my expectation. <strong>But I hit a major snag when using private IP addresses of 2 instances in the same region (I used EC2-US) &#8211; I couldn&#8217;t get it to work no matter what I tried</strong>: same availability zone, different availability zones, same security groups, different security groups, same AWS account, different AWS accounts. I even tried punching a hole over port 53 (all EC2 instances support DNS name resolution which happens over this port without an explicit corresponding rule in security groups) &#8211; no luck (EC2 DNS servers are not located on 10.0.0.0/8 where all instances reside).</p>
<p><strong>The only way I could get it to work using private IPs, is to allow my UDP port in security groups of <em>at least one</em> of the instances.</strong> When I did this, both hosts reported success.</p>
<p>This observation leads to several thoughts that might help uncover some aspects of EC2 firewall&#8217;s internal design (these are all more or less educated guesses):</p>
<ul>
<li>You can punch a UDP hole between any 2 instances using their public IPs, even if your security groups do not allow such communication.</li>
<li>Private IP traffic is treated totally differently than traffic over public IPs.</li>
<li>You can punch a UDP hole on port X using private IP addresses of 2 instances in the same region only if at least one of the instances allows port X in its security groups (can be used as a test if you don&#8217;t have access to query EC2 API endpoint)</li>
<li>EC2 firewall somehow implements more logic than &#8220;all outgoing packets are allowed&#8221; when dealing with traffic over private IPs (if it were not the case, hole punching should have worked &#8211; see below).</li>
<li>If we assume that security group rules are applied at an instance&#8217;s dom0 (as makes at least some sense and as <a href="http://blog.laststation.net/2009/10/11/amazon-ec2-still-vulnerable-to-udp-flood-attacks/">this research</a> implies), <strong>I now suspect that all dom0 hosts have entire view of all security groups in the region and are getting real time updates when a rule is added or deleted</strong> (modification of rules is currently not supported). This in fact was contrary to my expectation &#8211; initially I thought each dom0 &#8220;subscribes&#8221; to updates for only those security groups which correspond to instances running on this dom0 and I thought this was the reason why dynamic group membership changes were not possible (say I want to move an instance from &#8220;db&#8221; security group to &#8220;webapp&#8221; security group).</li>
</ul>
<p>To clarify: under the above assumption, in order for hole punching to NOT work, an outgoing packet from instance A must not reach dom0 of instance B &#8211; and the only way it&#8217;s possible under &#8220;all outgoing packets are allowed&#8221; policy is if dom0 of instance A knows that dom0 of instance B will block this packet and somehow takes this into consideration &#8211; which in general case can only happen if all dom0 hosts have entire view of all security groups and permissions in the region.</p>
<p>I would love to hear your thoughts on what could possibly explain this behavior, please let me know in the comments below.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/11/02/punching-udp-holes-in-amazon-ec2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Standalone Web Front Door a Must in EC2?</title>
		<link>http://somic.org/2009/10/13/standalone-web-front-door-a-must-in-ec2/</link>
		<comments>http://somic.org/2009/10/13/standalone-web-front-door-a-must-in-ec2/#comments</comments>
		<pubDate>Tue, 13 Oct 2009 15:24:45 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[infrastructure development]]></category>
		<category><![CDATA[software engineering]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[ec2]]></category>

		<guid isPermaLink="false">http://somic.org/?p=877</guid>
		<description><![CDATA[Most of you have probably heard about a recent outage at BitBucket. In a nutshell, their systems hosted at AWS came under a UDP flood DDoS attack, which led to significantly increased traffic, which led to saturation of their local network interface, which led to their being unable to connect to their data stored on [...]]]></description>
			<content:encoded><![CDATA[<p>Most of you have probably heard about a <a href="http://blog.bitbucket.org/2009/10/04/on-our-extended-downtime-amazon-and-whats-coming/">recent outage at BitBucket</a>. In a nutshell, their systems hosted at AWS came under a UDP flood DDoS attack, which led to significantly increased traffic, which led to saturation of their local network interface, which led to their being unable to connect to their data stored on EBS, which led to their application becoming unresponsive.</p>
<p>This outage shed more light on some internal designs of EC2 itself, as described <a href="http://blog.laststation.net/2009/10/11/amazon-ec2-still-vulnerable-to-udp-flood-attacks/">here</a>. It might have also showcased our over-confidence in EC2&#8217;s ability to detect and defeat certain types of network attacks. But this post is about something else.</p>
<p><strong>BitBucket was running their web front door and their backend application on the same instance</strong>. Front door is a part of the system which is facing the Internet and its task is to accept connections from clients. For obvious reasons, front door is running on the service&#8217;s discoverable IP address &#8211; whether they used Elastic IP or not, bitbucket.org resolved to that IP. Note that front door (usually) doesn&#8217;t need EBS.</p>
<p>Backend, however, is what needs EBS for disk persistence. At the same time, backend does not need to be publicly discoverable &#8211; as long as front door knows where its backend worker(s) is/are running, the app should be functioning just fine.</p>
<p><strong>With front door and backend running on different instances, UDP flood would have saturated only the former&#8217;s network interface and would have had no impact on the backend and its EBS.</strong></p>
<p>I know that AWS reportedly fixed the flood issue, but looks to me like <strong>separating front door and application backend may still be a good preventive measure</strong> &#8211; after all, it&#8217;s considered a good practice for a reason.</p>
<p>Please note that I am not trying to accuse BitBucket of running a bad architecture and causing their own outage. All I am doing is trying to learn a lesson.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/10/13/standalone-web-front-door-a-must-in-ec2/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Capistrano Auth Trick</title>
		<link>http://somic.org/2009/10/07/capistrano-auth-trick/</link>
		<comments>http://somic.org/2009/10/07/capistrano-auth-trick/#comments</comments>
		<pubDate>Wed, 07 Oct 2009 21:40:17 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[infrastructure development]]></category>
		<category><![CDATA[ruby]]></category>

		<guid isPermaLink="false">http://somic.org/?p=857</guid>
		<description><![CDATA[This past summer, we needed to automate testing of several failure scenarios for VPN-Cubed. Having asked the LazyWeb about any frameworks that could help us and having gotten no response, our dev team had a short chat in the office. We decided that ultimately we were going to have to roll out our own system [...]]]></description>
			<content:encoded><![CDATA[<p>This past summer, we needed to automate testing of several failure scenarios for VPN-Cubed. Having <a href="http://twitter.com/somic/status/2804598299">asked</a> the LazyWeb about any frameworks that could help us and having gotten no response, our dev team had a short chat in the office. We decided that ultimately we were going to have to roll out our own system based on SSH. <a href="http://www.capify.org/">Capistrano</a> was the obvious choice, because it&#8217;s essentially a higher-level wrapper around Net::SSH module (if you prefer python, you may take a look at <a href="http://fabfile.org/">fabric</a> or <a href="http://www.lag.net/paramiko/">paramiko</a>).</p>
<p>One obstacle was that because we were emulating various failures, at times our local capistrano process, which was driving the tests, had to lose SSH connectivity to its target servers. We quickly discovered that this resulted in exception and cap process would die.</p>
<p>To work around this, I added yet another level on top of cap which uses GNU make (one of my all time <a href="http://twitter.com/somic/status/3543470903">favorites</a>). In a nutshell, user controls the testing process via make, and make starts cap. In this case, it&#8217;s ok for cap process to occasionally exit.</p>
<p>But then &#8211; and we are finally getting to the point of this post &#8211; another issue came up: I didn&#8217;t want to keep typing password into cap each time it was started by make. Here is how I ended up implementing it to avoid re-typing password.</p>
<p><code style="font-size:11px"><br />
# in Makefile<br />
USER_PASS := $(shell read -s -p "[make] user's password: " P; echo $$P )<br />
export USER_PASS</code><br />
<code style="font-size:11px"><br />
all: set_password<br />
# do something here</code><br />
<code style="font-size:11px"><br />
set_password:<br />
 &nbsp; &nbsp; @test "$(USER_PASS)"</code></p>
<p></code></p>
<p><code style="font-size:11px"><br />
# in Capfile<br />
set :password, lambda { ENV['USER_PASS'] ||<br />
CLI.password_prompt("[cap] #{user}'s password: ") }<br />
</code></p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/10/07/capistrano-auth-trick/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
