<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Fubaredness Is Contagious &#187; distributed</title>
	<atom:link href="http://somic.org/category/distributed/feed/" rel="self" type="application/rss+xml" />
	<link>http://somic.org</link>
	<description>Dmitriy Samovskiy's Blog</description>
	<lastBuildDate>Wed, 01 Sep 2010 07:55:05 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Normal Accidents in Complex IT Systems</title>
		<link>http://somic.org/2010/01/11/normal-accidents-in-complex-it-systems/</link>
		<comments>http://somic.org/2010/01/11/normal-accidents-in-complex-it-systems/#comments</comments>
		<pubDate>Tue, 12 Jan 2010 02:01:14 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[distributed]]></category>
		<category><![CDATA[infrastructure development]]></category>
		<category><![CDATA[software engineering]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[operations]]></category>

		<guid isPermaLink="false">http://somic.org/?p=1043</guid>
		<description><![CDATA[Designing a fully-automated or nearly-fully-automated computer system with many moving parts and dependencies is tricky, whether a system is distributed, hyper distributed or otherwise. Failures happen and must be dealt with. After a while, most folks grow up from &#8220;failures are rare and can be ignored&#8221; to &#8220;failures are not that rare and can not [...]]]></description>
			<content:encoded><![CDATA[<p>Designing a fully-automated or nearly-fully-automated computer system with many moving parts and dependencies is tricky, whether a system is distributed, <a href="/2009/08/18/the-concept-of-hyper-distributed-application/">hyper distributed</a> or otherwise. Failures happen and must be dealt with. After a while, most folks grow up from &#8220;failures are rare and can be ignored&#8221; to &#8220;failures are not that rare and can not be ignored&#8221; to &#8220;failures are common and should be taken into consideration&#8221; to &#8220;failures are frequent and must be planned for.&#8221; The latter seems to represent the current prevailing point of view.</p>
<p>But here is a kicker &#8211; it&#8217;s not the end. I saw <a href="http://twitter.com/benjaminblack/status/5662514947">this tweet</a>, read <a href="http://www.kitchensoap.com/2009/11/12/how-complex-systems-fail-a-webops-perspective/">this post</a> and checked out a book by Charles Perrow titled &#8220;Normal Accidents&#8221; from the library. Published in 1984, the book is not about IT, but its material fits our field nicely. And boy, was I enlightened!</p>
<p>The book&#8217;s main point: <strong>no matter how much thought is put into the system design, or how many safeguards are implemented, a sufficiently complex system sooner or later will experience a significant breakdown that was impossible to foresee beforehand, principally due to unexpected interaction between components, tight coupling or bizarre coincidence. For us in IT, it translates to &#8220;no matter how much planning you do or how many safeguards you implement, failures will still happen.&#8221;</strong></p>
<p>There are at least 3 common themes that are present in multiple illustrations in the book:</p>
<ol>
<li>A big failure was usually a result of multiple smaller failures; these smaller failures were often not even related</li>
<li>Operators (people or systems) were frequently misled by inaccurate monitoring data</li>
<li>In a lot of cases, human operators were used to a given set of circumstances, and their thinking and analysis were misled by their habits and expectations (&#8221;when X happens, we always do Y and it comes back&#8221; &#8211; except for this one time, when it didn&#8217;t)</li>
</ol>
<p>I have had my share of outages and downtimes, and I can attest that I have seen these 3 factors play a big role in tech ops. Some were bugs in management and monitoring code, some where human error, some where bizarre set of dependencies but all were a combination of multiple factors. For example, who would have thought that with a failure of primary DNS resolution server, the VIP would not fail over to the secondary; and even though hosts had more than one &#8220;nameserver&#8221; line in /etc/resolv.conf, application timed out waiting for DNS to respond before getting to ask the second nameserver; without name resolution, multiple load balancers independently thought that there was no capacity behind them (because management code calculated capacity in near real-time relying on worker hosts&#8217; names) and disabled themselves, thus taking down the entire farm &#8211; now I know of course&#8230;</p>
<p>It turns out we can&#8217;t eliminate normal accidents altogether, but here are several techniques that I have been using to speed up detection and response in order to reduce the downtime.</p>
<p><strong>Complexity budget</strong>. <a href="http://blog.b3k.us/complexity_budget.html">Described by Benjamin Black</a>, this is a technique to allocate complexity among components beforehand and strictly follow the allocation during implementation phase. It helps avoid unnecessary fanciness and leads to simpler code, which tends to be easier to troubleshoot and recover after a failure.</p>
<p><strong>Control knobs/switches for individual components</strong>. <a href="http://www.slideshare.net/jallspaw/velocity2008-capacity-management1-484676/51">As John Allspaw shows on this slide</a>, you need to be able to turn off any component in an emergency, or throttle it up or down. Planning this feature and building it in from the very beginning is very important.</p>
<p><strong>Accuracy of monitoring data</strong>. Ensure your alarms are as accurate as possible. No matter how much chaos is going on inside the system during a severe failure, last thing you can afford is misleading the operators with wrong information. If you tried to ping a host A and didn&#8217;t get a response, your alarm should not say &#8220;host A is down&#8221; because it&#8217;s not the knowledge you obtained &#8211; it&#8217;s an assumption that you made. It should say &#8220;failed to ping host A from host B&#8221; &#8211; maybe it was network on host B that was an issue when a ping attempt was made, how do you know?</p>
<p><strong>Availability of monitoring data</strong>. There is a reason first thing the military try to do when attacking, is disrupting enemy&#8217;s means of communication &#8211; it&#8217;s that important, which applies to our case as well. You either design your systems to be able to get monitoring data even during the worst outage imaginable (ideally from more than one source), or you at least should be getting an alarm about lack of such monitoring data (it&#8217;s a very weak substitute though).</p>
<p>All in all, to everybody in IT, I highly recommend the Normal Accidents book as well as this <a href="http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf">whitepaper</a> (linked from <a href="http://www.kitchensoap.com/2009/11/12/how-complex-systems-fail-a-webops-perspective/">John Allspaw&#8217;s blog</a>).</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2010/01/11/normal-accidents-in-complex-it-systems/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>The Concept of Hyper Distributed Application</title>
		<link>http://somic.org/2009/08/18/the-concept-of-hyper-distributed-application/</link>
		<comments>http://somic.org/2009/08/18/the-concept-of-hyper-distributed-application/#comments</comments>
		<pubDate>Tue, 18 Aug 2009 14:38:31 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[Internet]]></category>
		<category><![CDATA[distributed]]></category>
		<category><![CDATA[software engineering]]></category>
		<category><![CDATA[hyper distributed application]]></category>

		<guid isPermaLink="false">http://somic.org/?p=754</guid>
		<description><![CDATA[Most folks in the industry are familiar with &#8220;distributed applications.&#8221; If app components are running on multiple hosts and need to communicate with each other using network, the app is said to be distributed.
Distributed applications are known for complexity of assuring all components are on the same page as to what&#8217;s going on around them. [...]]]></description>
			<content:encoded><![CDATA[<p>Most folks in the industry are familiar with &#8220;distributed applications.&#8221; If app components are running on multiple hosts and need to communicate with each other using network, the app is said to be distributed.</p>
<p>Distributed applications are known for complexity of assuring all components are on the same page as to what&#8217;s going on around them. Hardware failures, network failures, operator errors can all cause chaos; distributed applications foresee these exception situations and attempt to know how to deal with them.</p>
<p>Up until now, the network piece of the puzzle has been usually under application owner&#8217;s control &#8211; it could be a LAN, or it could be a leased line to remote datacenter. Occasionally, a VPN would be used to provide a dedicated communication channel between locations over public Internet but its use was rarely focused on important stuff &#8211; a mission critical application would usually get a leased line.</p>
<p>With advance of public clouds such as Amazon EC2 and Google AppEngine however, these notions are changing. One day you may decide to leverage each cloud&#8217;s strengths and distinct features to build your app, or may want to avoid cloud lock-in or provide redundancy. In short, you may want to multi-source your infrastructure.</p>
<p>Your multi-sourced infrastructure will of course be a distributed application. But there is a significant difference between this and old-style distributed apps &#8211; this time you no longer have network connectvity under your control. And as a result, you will face 3 significant phenomena that substantially complicate using today&#8217;s distributed algorithms &#8211; uneven bandwidth, uneven latency and increased probability of connectivity loss (I blogged about the latter <a href="/2008/10/10/crash-vs-connectivity-loss-in-distributed-applications/">here</a>).</p>
<p>And this is what I call a hyper distributed application. In other words, <strong>hyper distribution application is a distributed app which runs on a network with uneven bandwidth, uneven latencies and increased probability of connectivity loss (as measured against that on a regular LAN), usually outside of application owner&#8217;s control (for example, Internet).</strong></p>
<p>One example of a hyper distributed application is <a href="http://cohesiveft.com/vpncubed">VPN-Cubed</a> that we at CohesiveFT created to address emerging needs to multisource infrastructure. By the very nature of functionality it provides, its components (we call them VPN-Cubed Managers &#8211; they act as virtual routers and switches) are sometimes distributed over LAN, sometimes distributed over WAN, sometimes both. Communications between manager 1 and manager 2 can be fast and reliable, but between manager 1 and 3 slow and less reliable, with more frequent resets. Or manager 3 may simply disappear (as seen by its peers) &#8211; no, it doesn&#8217;t have to be down due to crash; it can simply mean that its network connection to the outside world was down, possibly temporarily.</p>
<p>Hyper distributed applications are relatively rare, because most architects tend to avoid this if they can. For example, Amazon EC2 has 2 regions &#8211; US and EU. Each region is a distinct EC2 system, with its own API endpoint, its own AMI IDs, kernel IDs, security groups, keypairs. There is no replication or conflict resolution between the regions &#8211; they are totally independent of each other. Why? Because it would be quite difficult to interconnect them into a single entity over public Internet. (I won&#8217;t be surprised if it gets implemented in the future though.)</p>
<p>Another example showing that hyper distributed applications are a distinct breed comes from Facebook Engineering blog post titled <a href="http://www.facebook.com/note.php?note_id=23844338919">Scaling Out</a>:</p>
<blockquote><p>This setup works really well with only one set of databases because we only delete the value from memcache after the database has confirmed the write of the new value. That way we are guaranteed the next read will get the updated value from the database and put it in to memcache. With a slave database on the east coast, however, the situation got a little tricky.</p>
<p>When we update a west coast master database with some new data there is a replication lag before the new value is properly reflected in the east coast slave database. Normally this replication lag is under a second but in periods of high load it can spike up to 20 seconds.</p></blockquote>
<p>It nicely illustrates how hyper distributed nature of the application adds complexity on top of what a plain distributed app already has.</p>
<p>In conclusion, I would like to propose to separate a category of distributed applications that run on top of networks with uneven bandwidth and uneven latencies into their own (I don&#8217;t care much if they end up being called hyper distributed or something else), and start building up research and practical approaches focusing specifically on this area.</p>
<p>P.S. Also consider the future: when we reach inter-planet or inter-galactic communications, you know that latencies and bandwidth in space would not be (initially) the same as on our planet Earth. Better start working on this research now in order to be prepared&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/08/18/the-concept-of-hyper-distributed-application/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Eliminating Single Points of Failure &#8211; One, Two, Many</title>
		<link>http://somic.org/2009/04/09/eliminating-single-points-of-failure-one-two-many/</link>
		<comments>http://somic.org/2009/04/09/eliminating-single-points-of-failure-one-two-many/#comments</comments>
		<pubDate>Thu, 09 Apr 2009 13:48:27 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[distributed]]></category>
		<category><![CDATA[CAP theorem]]></category>
		<category><![CDATA[HA]]></category>
		<category><![CDATA[high availability]]></category>
		<category><![CDATA[single point of failure]]></category>
		<category><![CDATA[SPOF]]></category>

		<guid isPermaLink="false">http://somic.org/?p=423</guid>
		<description><![CDATA[I recently reached an interesting conclusion. When you are trying to eliminate a single point of failure from your architecture, it&#8217;s almost always beneficial to first go with a 2-way redundant solution (active-passive or active-active pair, whichever is easiest to implement) and only then go to N-way, N &#62; 2, only if necessary.
One huge difference [...]]]></description>
			<content:encoded><![CDATA[<p>I recently reached an interesting conclusion. When you are trying to eliminate a single point of failure from your architecture, it&#8217;s almost always beneficial to first go with a 2-way redundant solution (active-passive or active-active pair, whichever is easiest to implement) and only then go to N-way, N &gt; 2, only if necessary.</p>
<p>One huge difference between a pair and N-way (N&gt;2) is how difficult it is to detect partitioning (of <a href="http://www.infoq.com/presentations/availability-consistency">CAP Theorem</a> fame &#8211; you can simultaneously achieve only two properties from the following three: data <strong>C</strong>onsistency, high <strong>A</strong>vailability and <strong>P</strong>artition tolerance). Assuming symmetrical communications (A can talk to B if and only if B can talk to A), partitioning detection in a pair is trivial, because there can be only one option &#8211; system A can&#8217;t talk to system B. With N&gt;2 however, there are way more scenarios to deal with: A can&#8217;t talk to B while both A and B can talk to C, A can&#8217;t talk to B and C , etc. Additionally, communications may be restored in some random order &#8211; A may first be able to talk to B, and only some time later get its visibility to C back.</p>
<p>Interestingly, also from personal experience, if you manage to build a 3-way redundancy, building 4-way or even 5-way is relatively not that difficult.</p>
<p>There are also a couple of purely practical aspects that make a 2-way redundancy an attractive option, even if it&#8217;s going to be intermediate step before N-way is achieved. 2-way can serve as a working prototype &#8211; you can observe it, learn and analyze its failure scenarios and make sure your response to each is optimal. This can validate your approach before you sink all this time in partitioning detection for N-way.</p>
<p>And secondly, after you build an easier 2-way, you might as well discover that you don&#8217;t need an N-way redundancy. If a pair meets your goal (say a given percentage of service availability), you can save a lot of time and effort.</p>
<p>My advice &#8211; don&#8217;t skip two on your way from one to many.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/04/09/eliminating-single-points-of-failure-one-two-many/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>My Comment on Open Federated Clouds</title>
		<link>http://somic.org/2009/03/27/my-comment-on-open-federated-clouds/</link>
		<comments>http://somic.org/2009/03/27/my-comment-on-open-federated-clouds/#comments</comments>
		<pubDate>Fri, 27 Mar 2009 14:34:50 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[distributed]]></category>
		<category><![CDATA[federated clouds]]></category>
		<category><![CDATA[open clouds]]></category>

		<guid isPermaLink="false">http://somic.org/?p=507</guid>
		<description><![CDATA[I left the following comment at CloudAve yesterday, on a post titled Open Federated Clouds And Sun&#8217;s Cloud Announcement.
Interesting. Looks to me it all depends on how you look at different clouds &#8211; as infrastructure providers or as software platforms.
The former case is roughly similar to buying Internet connectivity for your office from 2 different [...]]]></description>
			<content:encoded><![CDATA[<p><em>I left the following comment at CloudAve yesterday, on a post titled <a href="http://www.cloudave.com/link/open-federated-clouds-and-the-sun-cloud-announcement">Open Federated Clouds And Sun&#8217;s Cloud Announcement</a>.</em></p>
<p>Interesting. Looks to me it all depends on how you look at different clouds &#8211; as infrastructure providers or as software platforms.</p>
<p>The former case is roughly similar to buying Internet connectivity for your office from 2 different ISPs for redundancy.</p>
<p>The latter case, however, is roughly similar to a process of selecting platform for a project &#8211; say between Weblogic and JBoss. For a new project, a single platform is usually selected &#8211; I don&#8217;t think there are many cases when an app is built on top of both for better resiliency or to increase capacity (even though I admit that it&#8217;s not impossible).</p>
<p>In both cases, products are very similar or nearly identical to a certain extent, but the way you look at them makes you select 2 in one case and only 1 in another.</p>
<p>Right now, I think choosing a cloud is akin to selecting a software platform. So one will choose only one. However, the future may very well change this trend like you said, especially as interop gets better and each cloud gets its strengths and weaknesses better defined.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/03/27/my-comment-on-open-federated-clouds/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Identification Friend or Foe (IFF) in IaaS Clouds</title>
		<link>http://somic.org/2009/01/05/identification-friend-or-foe-iff-in-iaas-clouds/</link>
		<comments>http://somic.org/2009/01/05/identification-friend-or-foe-iff-in-iaas-clouds/#comments</comments>
		<pubDate>Mon, 05 Jan 2009 15:51:32 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[distributed]]></category>
		<category><![CDATA[ec2]]></category>

		<guid isPermaLink="false">http://somic.org/?p=272</guid>
		<description><![CDATA[I was recently building a distributed system which will run in Amazon EC2 cloud. It consisted of several instances of the same AMI that were going to communicate with each other using private IP addresses assigned by EC2.
One interesting scenario popped up in my head. What if, after initial discovery of each peer&#8217;s internal IP [...]]]></description>
			<content:encoded><![CDATA[<p>I was recently building a distributed system which will run in <a href="http://aws.amazon.com/ec2">Amazon EC2</a> cloud. It consisted of several instances of the same AMI that were going to communicate with each other using private IP addresses assigned by EC2.</p>
<p>One interesting scenario popped up in my head. What if, after initial discovery of each peer&#8217;s internal IP address, one of the instances goes down (let&#8217;s say it was at IP1) and at least one other instance fails to notice this fact and continues to communicate with IP1. EC2 assigns IP addresses dynamically, and as far as I can say, IP1 can get assigned to someone else&#8217;s instance within same minute. So my instance will be unknowingly communicating with someone else&#8217;s instance &#8211; not something that I want to allow.</p>
<p>A solution can be something what the military call Identification Friend or Foe (IFF). You can read about it in <a href="http://en.wikipedia.org/wiki/Identification_friend_or_foe">Wikipedia</a> or <a href="http://www.globalsecurity.org/military/systems/aircraft/systems/iff.htm   ">here</a>. Note that you may want to consider an IFF anytime you are running applications in any IaaS cloud that assigns IP addresses dynamically and/or has no way of predicting which IP address your next host is going to get.</p>
<p><strong>My Basic IFF Solution</strong></p>
<p>First of all, my instances do not have access to AWS credentials (here is <a href="http://twitter.com/somic/status/1084357906">why</a>). Secondly, I set up a requirement that all instances that needed to communicate with each other were to be launched with the same user-data (from a running instance, you can obtain user-data from http://169.254.169.254/latest/user-data).</p>
<p>I then created 2 checksums (SHA1 or MD5) &#8211; 4633e65fce4cf3b40648f574f4b60070 was a checksum of user-data plus some file in the AMI (say /usr/share/doc/coreutils/NEWS.gz) and 7a66a9361b14e95c14d98522502b9487 was a checksum of user-data plus another file in the AMI (say /bin/rmdir). Note that if user-data on each instance are the same, these checksums will be the same, because the files I selected are the same.</p>
<p>Then in Apache, I have the following configuration:</p>
<pre style="background-color:#DDDDDD">&lt;Location /4633e65fce4cf3b40648f574f4b60070&gt;
AuthType basic
AuthName 7a66a9361b14e95c14d98522502b9487
AuthUserFile /etc/apache2/users
Require valid-user
&lt;/Location&gt;</pre>
<p>Before establishing communications to a peer instance (and regularly afterwards), I set up my instances to get HTTP headers from above location (without actually submitting HTTP auth username and password), check WWW-Authenticate header and look for the second checksum there. Easy and efficient. If both checksums match, the other instance is a friend. If not, a foe. In this case I also assume that if I didn&#8217;t get a response, it&#8217;s not a friend &#8211; an instance might have gone down or it might not have apache listening on 80 or its web server might not know what to do with my URI.</p>
<p>You can further enhance this solution by creating new checksums every N minutes &#8211; this should work reliably for as long as EC2 infrastructure has no trouble keeping the clock accurate. You can also embed the timestamp in data used to generate checksums. Furthermore, if you monitor your access logs for bad interrogations (for example, old checksum or wrong checksum), you might be able to easily detect attacks against your IFF system.</p>
<p><strong>A More Scalable IFF</strong></p>
<p>Peer-to-peer IFF implementation that I described above may not work well for large deployments or for enterprise. If you are within either of these categories, I can recommend that you take a look at <a href="http://www.cohesiveft.com/vpncubed/">VPN-Cubed</a>, a product offered by my employer <a href="http://cohesiveft.com">CohesiveFT</a>. Its features essentially serve as a scalable encrypted IFF.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/01/05/identification-friend-or-foe-iff-in-iaas-clouds/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Crash vs Connectivity Loss in Distributed Applications</title>
		<link>http://somic.org/2008/10/10/crash-vs-connectivity-loss-in-distributed-applications/</link>
		<comments>http://somic.org/2008/10/10/crash-vs-connectivity-loss-in-distributed-applications/#comments</comments>
		<pubDate>Fri, 10 Oct 2008 23:24:24 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[distributed]]></category>
		<category><![CDATA[technology]]></category>

		<guid isPermaLink="false">http://somic.org/?p=121</guid>
		<description><![CDATA[Designing a distributed application to be fault tolerant is one of my favorite things that I often get to do at work. First of all, it should never fail under normal circumstances. Don&#8217;t believe people who tell you that circumstances are never normal &#8211; if it&#8217;s the case, a fault-tolerant design is the least of [...]]]></description>
			<content:encoded><![CDATA[<p>Designing a distributed application to be fault tolerant is one of my favorite things that I often get to do at work. First of all, it should never fail under normal circumstances. Don&#8217;t believe people who tell you that circumstances are never normal &#8211; if it&#8217;s the case, a fault-tolerant design is the least of your worries and you need to get overall environment to be at least somewhat stable first. But then, circumstances don&#8217;t remain unchanged for too long &#8211; something will happen sooner or later. So you want to expect as many possible failure scenarios as you can think of, try to anticipate how the event will impact your application, how the app will find out that the event occurred, and what to do about it.</p>
<p>But it&#8217;s not what I wanted to write about. As you might imagine, I read a lot on the subject &#8211; learning from other people&#8217;s mistakes and experiences in distributed systems world has never been easier, thanks to blogging and general tendency towards openness and disclosure. In all this stream of data that I get, the most frequent failure scenarios can by typically categorized as a &#8220;hardware crash&#8221; or &#8220;software crash.&#8221; Something was running fine, and then &#8211; BAM! &#8211; it crashed. It no longer exists. Nothing can talk to it anymore. Nothing can ask it how it&#8217;s doing, or what was the last thing it did. It crashed. Died. Disappeared.</p>
<p>But is crash the worst that could happen? Unfortunately not. <strong>Connectivity loss is way more tricky to deal with. </strong>Your Nagios thinks your web server crashed because it&#8217;s not responding? Can&#8217;t tell &#8211; not enough information. Everything you know is that nagios could not connect to the web server. It doesn&#8217;t mean that the latter crashed. Or you can&#8217;t connect to your messaging backend &#8211; did it crash? Not necessarily, everything you know at the moment is that connectivity between you and remote end is broken.</p>
<p>So why do I say the connectivity loss is way worse than crash?</p>
<ol>
<li>Crash is the same crash to all clients. All clients will fail to connect. Connectivity loss however can impact only a fraction of your client base. So half of your clients are failing over to the secondary, while the other half are still attached to primary. And you neglected to implement an alarm for that &#8211; and now your customers see only half of your inventory on the site? Oops.</li>
<li>Crash is usually a terminal state, as in your application can&#8217;t easily leave a crash state on its own. And what about connectivity? Oh, not at all &#8211; connectivity can be restored without your direct intervention. It can range from route convergence after a backup link gets up, to easing network congestion after a spike in traffic. Are you going to be prepared?</li>
</ol>
<p>And here is yet another twist. No matter what your position is on cloud computing, it is here to stay. And it is only a matter of time before many more services on which you rely for your operations will be scattered all over the world (or space, but that&#8217;s later). Connectivity loss will be occurring way more often than crashes, and unless you start approaching it as a different problem, you might be in for a big surprise.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2008/10/10/crash-vs-connectivity-loss-in-distributed-applications/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
