<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Fubaredness Is Contagious &#187; software engineering</title>
	<atom:link href="http://somic.org/category/software-engineering/feed/" rel="self" type="application/rss+xml" />
	<link>http://somic.org</link>
	<description>Dmitriy Samovskiy's Blog</description>
	<lastBuildDate>Wed, 01 Sep 2010 07:55:05 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Developing API Server &#8211; Practical Rules of Thumb</title>
		<link>http://somic.org/2010/05/04/developing-api-server-practical-rules-of-thumb/</link>
		<comments>http://somic.org/2010/05/04/developing-api-server-practical-rules-of-thumb/#comments</comments>
		<pubDate>Tue, 04 May 2010 08:30:57 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[software engineering]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[server]]></category>

		<guid isPermaLink="false">http://somic.org/?p=1042</guid>
		<description><![CDATA[I have been doing a lot of reading lately on how one would go about developing an API server. It&#8217;s an interesting topic, with various established schools of thought and multiple real-world implementations to compare against. In this post, I am going to summarize my findings, for my own reference as well as for anyone [...]]]></description>
			<content:encoded><![CDATA[<p>I have been doing a lot of reading lately on how one would go about developing an API server. It&#8217;s an interesting topic, with various established schools of thought and multiple real-world implementations to compare against. In this post, I am going to summarize my findings, for my own reference as well as for anyone who may find themselves in a similar position. These are my rules of thumb geared towards practicality. I may very well be wrong on these &#8211; if your experience tells you this makes no sense, I would love to hear your thoughts in the comments. Most examples and references below are from IaaS space.</p>
<p><strong>Query API vs REST API</strong></p>
<p>To start, one should read <a href="http://gehrcke.de/2009/06/aws-about-api/">this blog post</a> by<strong> </strong>Jan-Philip Gehrcke about various types of AWS APIs and differences between RESTful and query API, and <a href="http://stage.vambenepe.com/archives/863">this blog post</a> by William Vambenepe where he analyzes various IaaS API implementations (it&#8217;s a series of 3 posts). Then read description of <a href="http://martinfowler.com/articles/richardsonMaturityModel.html">Richardson Maturity Model by Martin Fowler</a>.</p>
<p>In a nutshell, I think from practical standpoint, if one&#8217;s domain maps easily to a set of entities (nouns) and API operations on these entities are primarily CRUD, in this case one&#8217;s best bet is to go with at least Level 2 REST. If either doesn&#8217;t work, I&#8217;d go with Level 0 REST, which is essentially what query API is.</p>
<p>My main reason for not going with Level 0 when entities and operations do map, is that I hate to see this meta data go to waste because it doesn&#8217;t cost almost anything to include.</p>
<p>Between Level 2 REST and Level 3 REST, I think Level 2 is more practical. According to Fowler, &#8220;Level 3 introduces discoverability, providing a way of       making a protocol more self-documenting.&#8221; It&#8217;s certainly a nice feature but I am not sure this added benefit justifies extra development effort and slightly increased complexity (some might argue it may actually reduce complexity though).</p>
<p><strong>API frontend vs API methods implementation</strong></p>
<p>Keep implementation of your API methods separate from whatever frontend you are deploying (REST, SOAP, etc). API methods are probably going to be the same no matter how they are called, so they should be frontend-independent. This will make it easier for you to introduce new frontends (AMQP, for example) and should facilitate code maintenance.</p>
<p><strong>HTTP verbs</strong></p>
<p>Read and delete operations are easy &#8211; they map to GET and DELETE.</p>
<p>Create and update are trickier. Canonical description of HTTP verbs can be found in <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html">Section 9 of RFC 2616</a> and I use the table <a href="http://rest.blueoxen.net/cgi-bin/wiki.pl?HttpMethods">here</a> as an addendum. In short, for both create and update, if an operation is idempotent and URI of entity on which this operation is being performed is known, use PUT. Otherwise, use POST (it is often used on entities representing &#8220;factories&#8221; &#8211; say a factory of new postings; you don&#8217;t know URI of a posting before you create it, so you POST to a factory which will create a new entity at a new URI; note that POST is not idempotent).</p>
<p>Note the RFC definition of idempotent methods (9.1.2) &#8211; it&#8217;s not defined as &#8220;multiple invocations must lead to the same result as a single invocation.&#8221; It&#8217;s &#8220;(aside    from error or expiration issues) the side-effects of N &gt; 0  identical    requests is the same as for a single request.&#8221;</p>
<p><strong>HTTP return codes</strong></p>
<p><a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html">Section 10 of RFC 2616</a> is a canonical description of HTTP status codes.</p>
<p>Successful completion should be signaled as HTTP 200 OK and, if it&#8217;s important for client to know that an entity was created as a part of operation, HTTP 201 Created. The latter may be redundant &#8211; code that handles 200 and 201 most likely will be identical or very similar.</p>
<p>Speaking of errors, I don&#8217;t think it&#8217;s practical to map each type of error to its own HTTP error  code. Unexpected server side errors (frontend exceptions or uncaught exceptions raised by your API methods) could be HTTP 500 Internal Server Error. If a resource is not found, it should be HTTP 404 Not Found. If your API server uses an external service to perform certain operations and upstream service did not respond or returned an unknown error, I would signal this fact with HTTP 502 Bad Gateway.</p>
<p>The rest of the errors are all client-side, and I like to classify them into 2 categories. When something is wrong with submitted request (missing header, missing argument, argument of wrong type), I think server should return HTTP 400 Bad Request. This way server is telling the client that no matter how many times this request will be submitted, it won&#8217;t work and will produce identical response.</p>
<p>I then group all other client-side errors together and think they should lead to HTTP 403 Forbidden. It means request by itself is fine, but something is preventing server from executing it &#8211; such as a missing prerequisite. Re-submitting the request may work in this case, because by the time the request is re-submitted, something might have happened and prerequisite is already in place.</p>
<p>Error response could include application-level exception and its description &#8211; this way you are letting the client know exactly what was wrong. Whether processing these ends up automated or not &#8211; it&#8217;s up to the client.</p>
<p><strong>Message formats</strong></p>
<p>I can&#8217;t easily justify this one, but I feel that bodies of request and response should be in the same format (there could be exceptions &#8211; for example, when client must upload a binary artifact). vCloud does it this way &#8211; request body is XML, and response is XML. EC2 API sends request arguments in query string (because all requests are GET, since it&#8217;s query API) and response is XML. OCCI API defines request body as form-urlencoded (application/x-www-form-urlencoded) and response is XML as well (all of the above might support JSON as well).</p>
<p>I have 2 weak justifications for this.</p>
<p>Firstly, it somewhat mimics our regular human behavior. If 2 people are communicating in real time, they usually use same medium and same format. It&#8217;s rare when one person is on IM speaking English, while the other is on the phone speaking French &#8211; not saying it&#8217;s impossible but relatively rare.</p>
<p>Secondly, in the future I foresee a greater use of <a href="http://www.rabbitmq.com">messaging</a> in API operations (read <a href="http://broadcast.oreilly.com/2010/02/towards-event-driven-cloud-apis.html">this post</a> by George Reese). Notions of request/response come from HTTP, in messaging it doesn&#8217;t matter &#8211; the same message could be response to one message and request to another. For example, a message requesting server start may lead to a message saying &#8220;server started&#8221; to the client. At the same time, the same &#8220;server started&#8221; message may go to an internal billing system, where it would be a request to start billing.</p>
<p>Having these message in the same format might be beneficial.</p>
<p><strong>Command line tool</strong></p>
<p>AWS set the bar with EC2 here. For every API call, they ship a command line tool to perform said call. No matter what you think whether it&#8217;s right or wrong, I think every provider should match this behavior. It&#8217;s a good practice after all &#8211; when someone is about to try API, it&#8217;s much easier to get going using command line tools instead of embedding API calls straight into the application.</p>
<p>Instead of EC2 practice of one command line tool per API call however (even though inside they still call ec2-cmd), I favor Sun Cloud&#8217;s approach &#8211; they were planning a <a href="http://kenai.com/projects/suncloudapis/pages/CloudCommandClient">single unified tool</a> where an API call would be identified by an option or a subcommand.</p>
<p><strong>Conclusion</strong></p>
<p>As <a href="http://www.python.org/dev/peps/pep-0020/">the Zen of Python</a> goes, &#8220;[...] practicality beats purity.&#8221; This should be your main guiding principle when designing API server side.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2010/05/04/developing-api-server-practical-rules-of-thumb/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>On Dangers of Prematurely Making API Public</title>
		<link>http://somic.org/2010/02/06/on-dangers-of-prematurely-making-api-public/</link>
		<comments>http://somic.org/2010/02/06/on-dangers-of-prematurely-making-api-public/#comments</comments>
		<pubDate>Sat, 06 Feb 2010 11:00:55 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[Internet]]></category>
		<category><![CDATA[software engineering]]></category>
		<category><![CDATA[api]]></category>

		<guid isPermaLink="false">http://somic.org/?p=1204</guid>
		<description><![CDATA[From time to time, I come across a statement that every service on the Internet must have an API, or people behind this service are doing it wrong. This phrase usually applies specifically to publicly available API.
As a user who stands to benefit from increased number of services allowing third-party applications and mashups, I certainly [...]]]></description>
			<content:encoded><![CDATA[<p>From time to time, I come across a statement that every service on the Internet must have an API, or people behind this service are doing it wrong. This phrase usually applies specifically to publicly available API.</p>
<p>As a user who stands to benefit from increased number of services allowing third-party applications and mashups, I certainly tend to agree. But as a developer, I realize that prematurely making API public may be a disaster.</p>
<p><strong>Publication of API represents a long-term commitment</strong>. You as a developer are committing to supporting this API for some non-trivial amount of time (at least 12 months I would imagine) and are essentially inviting other developers to create new functionality against this API. No one likes to spend their time developing against a given API just to discover shortly that API changed, or some functionality that used to be offered is no longer available.</p>
<p><strong>By making your API public you are signaling that this part of your system is very stable, its functionality well established, understood and developed, usage patterns well thought out.</strong> Or at least that&#8217;s how I as a third-party developer interpret your action.</p>
<p>If you know your audience well enough and are pretty confident that they won&#8217;t mind your tweaking things after initial publication, you may take a risk. Twitter famously launched their API very very early, and in the end it proved a huge success for them. (So if they listened to my advice in this post, they would be worse off).</p>
<p>But not all developer audiences may be as agile and forgiving as Twitter&#8217;s. I can imagine a very conservative big user of your API that will very strongly object to your changing the API. What do you do next? Maintain 2 versions? But what if underlying database schema changes make old API incompatible with what you are trying to do in the future? Fork and host 2 different systems, old and new? I can&#8217;t honestly imagine a worse scenario.</p>
<p>My advice &#8211; before publishing your new API, make sure you are not going to force yourself into a corner down the road. <strong>Only publish API for those parts of your systems that are very stable (both operationally and from perspective of internal mechanics and functionality) and where usage patterns are well researched and predictable to a certain extent. Don&#8217;t rush it.</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2010/02/06/on-dangers-of-prematurely-making-api-public/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Normal Accidents in Complex IT Systems</title>
		<link>http://somic.org/2010/01/11/normal-accidents-in-complex-it-systems/</link>
		<comments>http://somic.org/2010/01/11/normal-accidents-in-complex-it-systems/#comments</comments>
		<pubDate>Tue, 12 Jan 2010 02:01:14 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[distributed]]></category>
		<category><![CDATA[infrastructure development]]></category>
		<category><![CDATA[software engineering]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[operations]]></category>

		<guid isPermaLink="false">http://somic.org/?p=1043</guid>
		<description><![CDATA[Designing a fully-automated or nearly-fully-automated computer system with many moving parts and dependencies is tricky, whether a system is distributed, hyper distributed or otherwise. Failures happen and must be dealt with. After a while, most folks grow up from &#8220;failures are rare and can be ignored&#8221; to &#8220;failures are not that rare and can not [...]]]></description>
			<content:encoded><![CDATA[<p>Designing a fully-automated or nearly-fully-automated computer system with many moving parts and dependencies is tricky, whether a system is distributed, <a href="/2009/08/18/the-concept-of-hyper-distributed-application/">hyper distributed</a> or otherwise. Failures happen and must be dealt with. After a while, most folks grow up from &#8220;failures are rare and can be ignored&#8221; to &#8220;failures are not that rare and can not be ignored&#8221; to &#8220;failures are common and should be taken into consideration&#8221; to &#8220;failures are frequent and must be planned for.&#8221; The latter seems to represent the current prevailing point of view.</p>
<p>But here is a kicker &#8211; it&#8217;s not the end. I saw <a href="http://twitter.com/benjaminblack/status/5662514947">this tweet</a>, read <a href="http://www.kitchensoap.com/2009/11/12/how-complex-systems-fail-a-webops-perspective/">this post</a> and checked out a book by Charles Perrow titled &#8220;Normal Accidents&#8221; from the library. Published in 1984, the book is not about IT, but its material fits our field nicely. And boy, was I enlightened!</p>
<p>The book&#8217;s main point: <strong>no matter how much thought is put into the system design, or how many safeguards are implemented, a sufficiently complex system sooner or later will experience a significant breakdown that was impossible to foresee beforehand, principally due to unexpected interaction between components, tight coupling or bizarre coincidence. For us in IT, it translates to &#8220;no matter how much planning you do or how many safeguards you implement, failures will still happen.&#8221;</strong></p>
<p>There are at least 3 common themes that are present in multiple illustrations in the book:</p>
<ol>
<li>A big failure was usually a result of multiple smaller failures; these smaller failures were often not even related</li>
<li>Operators (people or systems) were frequently misled by inaccurate monitoring data</li>
<li>In a lot of cases, human operators were used to a given set of circumstances, and their thinking and analysis were misled by their habits and expectations (&#8221;when X happens, we always do Y and it comes back&#8221; &#8211; except for this one time, when it didn&#8217;t)</li>
</ol>
<p>I have had my share of outages and downtimes, and I can attest that I have seen these 3 factors play a big role in tech ops. Some were bugs in management and monitoring code, some where human error, some where bizarre set of dependencies but all were a combination of multiple factors. For example, who would have thought that with a failure of primary DNS resolution server, the VIP would not fail over to the secondary; and even though hosts had more than one &#8220;nameserver&#8221; line in /etc/resolv.conf, application timed out waiting for DNS to respond before getting to ask the second nameserver; without name resolution, multiple load balancers independently thought that there was no capacity behind them (because management code calculated capacity in near real-time relying on worker hosts&#8217; names) and disabled themselves, thus taking down the entire farm &#8211; now I know of course&#8230;</p>
<p>It turns out we can&#8217;t eliminate normal accidents altogether, but here are several techniques that I have been using to speed up detection and response in order to reduce the downtime.</p>
<p><strong>Complexity budget</strong>. <a href="http://blog.b3k.us/complexity_budget.html">Described by Benjamin Black</a>, this is a technique to allocate complexity among components beforehand and strictly follow the allocation during implementation phase. It helps avoid unnecessary fanciness and leads to simpler code, which tends to be easier to troubleshoot and recover after a failure.</p>
<p><strong>Control knobs/switches for individual components</strong>. <a href="http://www.slideshare.net/jallspaw/velocity2008-capacity-management1-484676/51">As John Allspaw shows on this slide</a>, you need to be able to turn off any component in an emergency, or throttle it up or down. Planning this feature and building it in from the very beginning is very important.</p>
<p><strong>Accuracy of monitoring data</strong>. Ensure your alarms are as accurate as possible. No matter how much chaos is going on inside the system during a severe failure, last thing you can afford is misleading the operators with wrong information. If you tried to ping a host A and didn&#8217;t get a response, your alarm should not say &#8220;host A is down&#8221; because it&#8217;s not the knowledge you obtained &#8211; it&#8217;s an assumption that you made. It should say &#8220;failed to ping host A from host B&#8221; &#8211; maybe it was network on host B that was an issue when a ping attempt was made, how do you know?</p>
<p><strong>Availability of monitoring data</strong>. There is a reason first thing the military try to do when attacking, is disrupting enemy&#8217;s means of communication &#8211; it&#8217;s that important, which applies to our case as well. You either design your systems to be able to get monitoring data even during the worst outage imaginable (ideally from more than one source), or you at least should be getting an alarm about lack of such monitoring data (it&#8217;s a very weak substitute though).</p>
<p>All in all, to everybody in IT, I highly recommend the Normal Accidents book as well as this <a href="http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf">whitepaper</a> (linked from <a href="http://www.kitchensoap.com/2009/11/12/how-complex-systems-fail-a-webops-perspective/">John Allspaw&#8217;s blog</a>).</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2010/01/11/normal-accidents-in-complex-it-systems/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Standalone Web Front Door a Must in EC2?</title>
		<link>http://somic.org/2009/10/13/standalone-web-front-door-a-must-in-ec2/</link>
		<comments>http://somic.org/2009/10/13/standalone-web-front-door-a-must-in-ec2/#comments</comments>
		<pubDate>Tue, 13 Oct 2009 15:24:45 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[infrastructure development]]></category>
		<category><![CDATA[software engineering]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[ec2]]></category>

		<guid isPermaLink="false">http://somic.org/?p=877</guid>
		<description><![CDATA[Most of you have probably heard about a recent outage at BitBucket. In a nutshell, their systems hosted at AWS came under a UDP flood DDoS attack, which led to significantly increased traffic, which led to saturation of their local network interface, which led to their being unable to connect to their data stored on [...]]]></description>
			<content:encoded><![CDATA[<p>Most of you have probably heard about a <a href="http://blog.bitbucket.org/2009/10/04/on-our-extended-downtime-amazon-and-whats-coming/">recent outage at BitBucket</a>. In a nutshell, their systems hosted at AWS came under a UDP flood DDoS attack, which led to significantly increased traffic, which led to saturation of their local network interface, which led to their being unable to connect to their data stored on EBS, which led to their application becoming unresponsive.</p>
<p>This outage shed more light on some internal designs of EC2 itself, as described <a href="http://blog.laststation.net/2009/10/11/amazon-ec2-still-vulnerable-to-udp-flood-attacks/">here</a>. It might have also showcased our over-confidence in EC2&#8217;s ability to detect and defeat certain types of network attacks. But this post is about something else.</p>
<p><strong>BitBucket was running their web front door and their backend application on the same instance</strong>. Front door is a part of the system which is facing the Internet and its task is to accept connections from clients. For obvious reasons, front door is running on the service&#8217;s discoverable IP address &#8211; whether they used Elastic IP or not, bitbucket.org resolved to that IP. Note that front door (usually) doesn&#8217;t need EBS.</p>
<p>Backend, however, is what needs EBS for disk persistence. At the same time, backend does not need to be publicly discoverable &#8211; as long as front door knows where its backend worker(s) is/are running, the app should be functioning just fine.</p>
<p><strong>With front door and backend running on different instances, UDP flood would have saturated only the former&#8217;s network interface and would have had no impact on the backend and its EBS.</strong></p>
<p>I know that AWS reportedly fixed the flood issue, but looks to me like <strong>separating front door and application backend may still be a good preventive measure</strong> &#8211; after all, it&#8217;s considered a good practice for a reason.</p>
<p>Please note that I am not trying to accuse BitBucket of running a bad architecture and causing their own outage. All I am doing is trying to learn a lesson.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/10/13/standalone-web-front-door-a-must-in-ec2/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>The Concept of Hyper Distributed Application</title>
		<link>http://somic.org/2009/08/18/the-concept-of-hyper-distributed-application/</link>
		<comments>http://somic.org/2009/08/18/the-concept-of-hyper-distributed-application/#comments</comments>
		<pubDate>Tue, 18 Aug 2009 14:38:31 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[Internet]]></category>
		<category><![CDATA[distributed]]></category>
		<category><![CDATA[software engineering]]></category>
		<category><![CDATA[hyper distributed application]]></category>

		<guid isPermaLink="false">http://somic.org/?p=754</guid>
		<description><![CDATA[Most folks in the industry are familiar with &#8220;distributed applications.&#8221; If app components are running on multiple hosts and need to communicate with each other using network, the app is said to be distributed.
Distributed applications are known for complexity of assuring all components are on the same page as to what&#8217;s going on around them. [...]]]></description>
			<content:encoded><![CDATA[<p>Most folks in the industry are familiar with &#8220;distributed applications.&#8221; If app components are running on multiple hosts and need to communicate with each other using network, the app is said to be distributed.</p>
<p>Distributed applications are known for complexity of assuring all components are on the same page as to what&#8217;s going on around them. Hardware failures, network failures, operator errors can all cause chaos; distributed applications foresee these exception situations and attempt to know how to deal with them.</p>
<p>Up until now, the network piece of the puzzle has been usually under application owner&#8217;s control &#8211; it could be a LAN, or it could be a leased line to remote datacenter. Occasionally, a VPN would be used to provide a dedicated communication channel between locations over public Internet but its use was rarely focused on important stuff &#8211; a mission critical application would usually get a leased line.</p>
<p>With advance of public clouds such as Amazon EC2 and Google AppEngine however, these notions are changing. One day you may decide to leverage each cloud&#8217;s strengths and distinct features to build your app, or may want to avoid cloud lock-in or provide redundancy. In short, you may want to multi-source your infrastructure.</p>
<p>Your multi-sourced infrastructure will of course be a distributed application. But there is a significant difference between this and old-style distributed apps &#8211; this time you no longer have network connectvity under your control. And as a result, you will face 3 significant phenomena that substantially complicate using today&#8217;s distributed algorithms &#8211; uneven bandwidth, uneven latency and increased probability of connectivity loss (I blogged about the latter <a href="/2008/10/10/crash-vs-connectivity-loss-in-distributed-applications/">here</a>).</p>
<p>And this is what I call a hyper distributed application. In other words, <strong>hyper distribution application is a distributed app which runs on a network with uneven bandwidth, uneven latencies and increased probability of connectivity loss (as measured against that on a regular LAN), usually outside of application owner&#8217;s control (for example, Internet).</strong></p>
<p>One example of a hyper distributed application is <a href="http://cohesiveft.com/vpncubed">VPN-Cubed</a> that we at CohesiveFT created to address emerging needs to multisource infrastructure. By the very nature of functionality it provides, its components (we call them VPN-Cubed Managers &#8211; they act as virtual routers and switches) are sometimes distributed over LAN, sometimes distributed over WAN, sometimes both. Communications between manager 1 and manager 2 can be fast and reliable, but between manager 1 and 3 slow and less reliable, with more frequent resets. Or manager 3 may simply disappear (as seen by its peers) &#8211; no, it doesn&#8217;t have to be down due to crash; it can simply mean that its network connection to the outside world was down, possibly temporarily.</p>
<p>Hyper distributed applications are relatively rare, because most architects tend to avoid this if they can. For example, Amazon EC2 has 2 regions &#8211; US and EU. Each region is a distinct EC2 system, with its own API endpoint, its own AMI IDs, kernel IDs, security groups, keypairs. There is no replication or conflict resolution between the regions &#8211; they are totally independent of each other. Why? Because it would be quite difficult to interconnect them into a single entity over public Internet. (I won&#8217;t be surprised if it gets implemented in the future though.)</p>
<p>Another example showing that hyper distributed applications are a distinct breed comes from Facebook Engineering blog post titled <a href="http://www.facebook.com/note.php?note_id=23844338919">Scaling Out</a>:</p>
<blockquote><p>This setup works really well with only one set of databases because we only delete the value from memcache after the database has confirmed the write of the new value. That way we are guaranteed the next read will get the updated value from the database and put it in to memcache. With a slave database on the east coast, however, the situation got a little tricky.</p>
<p>When we update a west coast master database with some new data there is a replication lag before the new value is properly reflected in the east coast slave database. Normally this replication lag is under a second but in periods of high load it can spike up to 20 seconds.</p></blockquote>
<p>It nicely illustrates how hyper distributed nature of the application adds complexity on top of what a plain distributed app already has.</p>
<p>In conclusion, I would like to propose to separate a category of distributed applications that run on top of networks with uneven bandwidth and uneven latencies into their own (I don&#8217;t care much if they end up being called hyper distributed or something else), and start building up research and practical approaches focusing specifically on this area.</p>
<p>P.S. Also consider the future: when we reach inter-planet or inter-galactic communications, you know that latencies and bandwidth in space would not be (initially) the same as on our planet Earth. Better start working on this research now in order to be prepared&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/08/18/the-concept-of-hyper-distributed-application/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Electrical and Plumbing Analogies in Application Monitoring</title>
		<link>http://somic.org/2009/08/10/electrical-and-plumbing-analogies-in-application-monitoring/</link>
		<comments>http://somic.org/2009/08/10/electrical-and-plumbing-analogies-in-application-monitoring/#comments</comments>
		<pubDate>Tue, 11 Aug 2009 03:26:36 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[software engineering]]></category>
		<category><![CDATA[monitoring]]></category>

		<guid isPermaLink="false">http://somic.org/?p=739</guid>
		<description><![CDATA[Water and electricity are two components without which a modern home can&#8217;t function well. Both are provided as a utility, and both have strictly defined access points from which they can be consumed &#8211; taps for water and outlets for electricity.
But there are also differences. Every child knows that electric shock can cause injury even [...]]]></description>
			<content:encoded><![CDATA[<p>Water and electricity are two components without which a modern home can&#8217;t function well. Both are provided as a utility, and both have strictly defined access points from which they can be consumed &#8211; taps for water and outlets for electricity.</p>
<p>But there are also differences. Every child knows that electric shock can cause injury even as a result of a short exposure &#8211; hence most perceive electricity as a powerful force. This force however has a binary switch attached to it, in the form of switches, circuit breakers and distribution board. Turn it on &#8211; electricity is flowing, turn it off &#8211; it&#8217;s not. When off, electricity can&#8217;t leak by design.</p>
<p>Water, on the other hand, is not perceived as such a great force because damage from short exposure is unlikely to be too severe. Additionally, indoor plumbing has no binary on-off switches &#8211; it&#8217;s measured by a degree of &#8220;open&#8221; or &#8220;close&#8221;, &#8220;hot&#8221; or &#8220;cold.&#8221; As a result, leaks can and do occur from time to time. And it&#8217;s these leaks that have a potential to do costly damage over time but still are not perceived dangerous enough to warrant immediate attention.</p>
<p>There are many things in software applications that are binary in nature &#8211; web server daemon is up or down, for example. We all take these all-or-nothing components seriously, because when it&#8217;s nothing, the app is down.</p>
<p>But we have our fair share of potentially leaky stuff as well &#8211; memory leaks, file descriptor leaks, network connection leaks, and so on. In other words, things that don&#8217;t happen instantaneously but build up over time, often hidden behind other bigger component. Some of us don&#8217;t take these issues seriously enough because they lack the perceived power of being able to cause significant damage quickly enough. And it&#8217;s a mistake.</p>
<p>When monitoring a component of &#8220;electricity&#8221; type, most common test is to send a probe &#8211; if it returns OK, the component is up (&#8221;active monitoring&#8221;, &#8220;active polling&#8221; or simply &#8220;polling&#8221;). But this doesn&#8217;t work when monitoring a component of &#8220;plumbing&#8221; type &#8211; if water is flowing, it doesn&#8217;t mean there is no leak. In this case, a set of alarms instrumented into the component itself would be a better fit.</p>
<p>The sooner we realize different nature of various components of our applications and the need to monitor them differently, the higher uptime for our applications we are going to achieve.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/08/10/electrical-and-plumbing-analogies-in-application-monitoring/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Developer&#8217;s Attempt to Define Cloud Computing</title>
		<link>http://somic.org/2009/07/06/developers-attempt-to-define-cloud-computing/</link>
		<comments>http://somic.org/2009/07/06/developers-attempt-to-define-cloud-computing/#comments</comments>
		<pubDate>Mon, 06 Jul 2009 14:20:22 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[software engineering]]></category>

		<guid isPermaLink="false">http://somic.org/?p=661</guid>
		<description><![CDATA[I have been closely following cloud computing for many months now. As a developer, I get often frustrated by lack of clear and widely accepted definition of what cloud computing actually is. This is a problem, because without a definition, every imaginable operation performed over the Internet all of a sudden became a &#8220;cloud.&#8221; It [...]]]></description>
			<content:encoded><![CDATA[<p>I have been closely following cloud computing for many months now. As a developer, I get often frustrated by lack of clear and widely accepted definition of what cloud computing actually is. This is a problem, because without a definition, every imaginable operation performed over the Internet all of a sudden became a &#8220;cloud.&#8221; It dilutes the value and obscures the innovation cloud computing concept used to stand for in its early days.</p>
<p>The term &#8220;cloud computing&#8221; consists of 2 words &#8211; &#8220;cloud&#8221; and &#8220;computing.&#8221;</p>
<p><strong>Cloud</strong></p>
<p>Traditionally, an image of cloud is used on network diagrams to denote an opaque network entity (for example, Internet or MPLS cloud). Opaque in this case means that to an enduser it&#8217;s a black box &#8211; you hook up inputs and outputs as directed, and you get functionality. In addition to opaqueness, there are other less obvious properties that clouds on network diagrams usually possess:</p>
<ul>
<li>cloud is multi-tenant (many endusers use same one)</li>
<li>cloud resources (links, bandwidth) are not dedicated (each enduser gets to use some up to their quota; if user A no longer uses a resource, cloud can assign it to user B)</li>
<li>cloud is outside of enduser&#8217;s full control</li>
</ul>
<p><strong>Computing</strong></p>
<p>Firstly, allow me to note that I strongly disagree with pure linguistic approach here &#8211; to linguists, &#8220;computing&#8221; and &#8220;computer&#8221; are derived from the same root, such that &#8220;computing&#8221; is an action which involves a &#8220;computer.&#8221; I disagree with it because it&#8217;s too general and useless for our case.</p>
<p>I define &#8220;computing&#8221; as running user-provided software. It doesn&#8217;t have to be developed by user &#8211; one can download it from the web and run it. But it&#8217;s still the user who provides this software in this particular case. In contrast, if you use a web site to perform a certain operation, you also use software &#8211; but in this case, it&#8217;s the software developed and operated by the web site, hence it&#8217;s a service, not computing.</p>
<p><strong>My Definition of Cloud Computing</strong></p>
<p><span style="text-decoration: underline;"><strong>Cloud computing is a form of using opaque multi-tenant networks of computers outside of enduser&#8217;s full control with primary goal to run software provided by the enduser, in which computational resources are allocated dynamically (as opposed to being permanently assigned).</strong></span></p>
<p><strong>Examples and Caveats</strong></p>
<ul>
<li>If we take a well-known SPI model (Software as a Service, Platform as a Service, Infrastructure as a Service), contrary to current mainstream thinking, only IaaS can be cloud computing when enduser provides the software to run.</li>
</ul>
<ul>
<li>I added a clause about &#8220;primary goal&#8221; to eliminate things like Google Spreadsheet from cloud computing &#8211; even though a spreadsheet program may run macros (which are software code) and such macros could be provided by enduser, it&#8217;s still not cloud computing, because the primary goal of a spreadsheet program is number crunching, not running macros.</li>
</ul>
<ul>
<li>Programming frameworks (such as Hadoop for example) can be both: Hadoop can be cloud computing when enduser provides their map and reduce functions; but if enduser ends up running defaults or functions that ship with Hadoop distribution, there is no software supplied by enduser so it&#8217;s not cloud computing.</li>
</ul>
<ul>
<li>Things like storage as a service, backup as a service are all &#8220;cloudy,&#8221; but they are not computing. There is already a term for this &#8211; Internet. Therefore, I consider &#8220;cloudy&#8221; by itself to be a redundant term.</li>
</ul>
<ul>
<li>Google AppEngine (GAE) is a cloud computing platform. Many don&#8217;t put it into IaaS category because it doesn&#8217;t provide customers with access to low-level hypervisor-based VMs. But this alone doesn&#8217;t make it non-IaaS from developer&#8217;s standpoint &#8211; after all, a VM in hypervisor model is one thing, and a VM in language interpreter model is another (JVM, Erlang VM, Python VM, etc) but it&#8217;s still a VM in a sense that it encapsulates running code inside and proxies all system-level requests through its abstraction layer. GAE provides access to its BigTable infrastructure, its memcache infrastructure so to me it&#8217;s very much an IaaS  system and satisfies my definition of &#8220;cloud computing.&#8221;</li>
</ul>
<ul>
<li>In my opinion, multi-tenancy is a necessary condition of a cloud computing platform. Multiple tenants must not be different companies &#8211; they can be different business units, different departments. The key is that there must be dynamic allocation of resources and scarcity. If all resources are dedicated to one organization and simply switched between applications, it&#8217;s not cloud computing &#8211; it would be simply an infrastructure controlled via API.</li>
</ul>
<ul>
<li>Same thing about on-premises server farms with cloudy features &#8211; they are not cloud computing, because they are not opaque to enduser and they are under enduser&#8217;s full control.</li>
</ul>
<p><strong>Conclusion</strong></p>
<p>All in all, I hope this blog post gets us closer to finally figuring out once and for all what &#8220;cloud computing&#8221; is and what it isn&#8217;t.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/07/06/developers-attempt-to-define-cloud-computing/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Full Data vs Incremental Data in Messaging</title>
		<link>http://somic.org/2009/06/25/full-data-vs-incremental-data-in-messaging/</link>
		<comments>http://somic.org/2009/06/25/full-data-vs-incremental-data-in-messaging/#comments</comments>
		<pubDate>Thu, 25 Jun 2009 18:25:54 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[rabbitmq]]></category>
		<category><![CDATA[software engineering]]></category>
		<category><![CDATA[messaging]]></category>

		<guid isPermaLink="false">http://somic.org/?p=642</guid>
		<description><![CDATA[My recent experiments with messaging for a distributed application led to a realization that I would like to share with you in this post. It&#8217;s not an earth shaking discovery but you may still find it interesting.
Do you remember an old Unix command to create tape backups called dump? Remember its concept of levels? To [...]]]></description>
			<content:encoded><![CDATA[<p>My recent experiments with messaging for a distributed application led to a realization that I would like to share with you in this post. It&#8217;s not an earth shaking discovery but you may still find it interesting.</p>
<p>Do you remember an old Unix command to create tape backups called <a href="http://linux.die.net/man/8/dump"><em>dump</em></a>? Remember its concept of levels? To refresh your memory, in a nutshell level 0 (full backup) includes all files on the filesystem, and any other level corresponds to incremental backup where only files modified since last backup are included.</p>
<p>It turns out somewhat similar concept applies to messaging, specifically to the contents of messages themselves.</p>
<p>A message in general is some piece of information that one system passes to another. On one hand, publisher may make an observation, extract information from it, package entire current state into a blob, and send it out as a message. The same sequence of operations is performed at regular intervals. Examples of this model include sending a message about processes currently running on the system, clients currently connected to a server, current usage of RAM, etc. This model roughly corresponds to dump&#8217;s level 0 &#8211; consumer needs just a single message to obtain all information that publisher sent, there is no need for consumer to accumulate and merge a series of messages to get the full picture.</p>
<p>On the other hand, a publisher can send a message that contains information about a single event. For example, a new client connected, a new job got submitted to the backend, hard disk failed. This mode is more like incremental backup &#8211; a message contains only a delta, its payload doesn&#8217;t carry entire state.</p>
<p>Each of these models has its good and bad sides. In full data model, a single message is sufficient to transfer all knowledge about current state from producer to consumer, and consumer can start reading messages at any point in the queue &#8211; by design it will catch up once it receives and processes at least one message. The downsides of this model are waste of bandwidth and processing power (if there are no changes, same contents will be transferred over and over again) and the fact that delta must be calculated by consumer (for example, having received 2 &#8220;ps auxww&#8221; outputs, consumer would have to diff them and parse the result).</p>
<p>Incremental data model clearly provides an easy delta and is less wasteful on resources, but requires consumer to merge multiple messages to get entire picture and as a result is sensitive to a point from which a consumer starts reading the queue.</p>
<p>A potential solution is to do what dump does &#8211; send full data once in a while, followed by deltas. This way consumer will catch up eventually &#8211; once it gets full data message (which will come sooner or later). Another caveat is that not always does a consumer need a full picture &#8211; in a classic scalability scenario of supervisor-workers model, workers rarely need more than contents of their current job contained in an incremental message.</p>
<p>But it&#8217;s not the end of it. While working on a problem, I realized that usually I as a developer don&#8217;t even get to choose which model to use &#8211; it&#8217;s dictated to me by the nature of information I am trying to pass from one system to another. Some data can be easily obtained as full and very difficult to obtain as incremental, some vice versa. For example, a list of current processes on Linux is trivial to obtain as full (ps auxww) and quite difficult to obtain as incremental (I would need a notification about when each process starts and dies). Or in case of incoming jobs &#8211; it&#8217;s easy to obtain delta (one job) but it&#8217;s quite difficult to know current status of all jobs.</p>
<p>My conclusion here is that there are 2 main factors to think about:</p>
<ol>
<li>can my publisher get data in full or incremental form?</li>
<li>does my consumer need data in full or incremental form?</li>
</ol>
<p>If the answers to above questions are the same, you are good to go. But if they are different, you need to understand potential issues as discussed above and analyze further. I hope to be able to provide more practical thoughts on this in the future &#8211; stay tuned.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/06/25/full-data-vs-incremental-data-in-messaging/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Why I Sometimes Prefer Shell To Ruby or Python</title>
		<link>http://somic.org/2009/06/11/why-i-sometimes-prefer-shell-to-ruby-or-python/</link>
		<comments>http://somic.org/2009/06/11/why-i-sometimes-prefer-shell-to-ruby-or-python/#comments</comments>
		<pubDate>Thu, 11 Jun 2009 16:38:02 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[python]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[software engineering]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[shell]]></category>

		<guid isPermaLink="false">http://somic.org/?p=620</guid>
		<description><![CDATA[Shell was among the first things I got familiar with when I was introduced to Linux. It&#8217;s not a typical programming language, primarily due to lack of easy-to-use high-level data structures such as hashes and arrays (anticipating your objection to this &#8211; note I said &#8220;easy-to-use&#8221;). This may explain why I often get funny looks [...]]]></description>
			<content:encoded><![CDATA[<p>Shell was among the first things I got familiar with when I was introduced to Linux. It&#8217;s not a typical programming language, primarily due to lack of easy-to-use high-level data structures such as hashes and arrays (anticipating your objection to this &#8211; note I said &#8220;easy-to-use&#8221;). This may explain why I often get funny looks from folks when I mention that I use shell quite a bit, often in quite non-trivial systems.</p>
<p>And here are my reasons.</p>
<p><strong>Memory Management</strong></p>
<p>Shell scripts are excellent in managing their memory and one has to try real hard to cause a shell script to leak memory. This makes shell a very convenient tool for long running processes, supervisors in multiple-workers models, daemons and so on. There is an easy explanation for this. In shell, there are only a handful of built-in primitives &#8211; everything else is an external command, which gets started and then finishes before giving control back to your script. If there is a memory leak in that command, it won&#8217;t damage your calling script and will usually be insignificant because it will return quickly.</p>
<p><strong>No Exceptions</strong></p>
<p>This is a double edged sword, and you need to be careful how you exploit this &#8220;weakness.&#8221; This feature allows me to write compact code which is easy to understand without enclosing every single command in &#8220;try&#8230; except&#8221;. For naysayers, I would like to point out that a strict mode exists, where every error is treated as fatal and causes the script to exit (<em>set -e</em>).</p>
<p>In general, not all unforeseen error conditions warrant a crash, like you get in Python or Ruby when an unhandled exception gets propagated all the way to the top. If a problem is transient, it may be better to ignore it temporarily.</p>
<p>To assure a Ruby or Python script doesn&#8217;t crash on some unforeseen transient problem, many people often end up enclosing their entire program in a wildcard try&#8230; except block to catch any exception &#8211; but to me this approach is dangerous, even though I sometimes end up using it myself.</p>
<p>If you are writing a daemon process to perform some action in a loop, shell is often by far the most stable alternative.</p>
<p><strong>When Not To Use Shell</strong></p>
<p>My personal rule of thumb is don&#8217;t use shell when you expect to need high-level data structures like hashes or arrays beyond what <em>for</em> loop can give you, or when you can see potential for code reuse following <a href="http://en.wikipedia.org/wiki/Object-oriented_programming">OOP</a> patterns like inheritance, or when your program needs to participate in some orchestration schemes that go beyond creating and removing files on the filesystem.</p>
<p><strong>Conclusion</strong></p>
<p>I wouldn&#8217;t overlook shell if I were you.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/06/11/why-i-sometimes-prefer-shell-to-ruby-or-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Branching In Git When Working On Big New Features</title>
		<link>http://somic.org/2009/05/31/branching-git-when-working-on-big-features/</link>
		<comments>http://somic.org/2009/05/31/branching-git-when-working-on-big-features/#comments</comments>
		<pubDate>Sun, 31 May 2009 20:35:55 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[software engineering]]></category>
		<category><![CDATA[git]]></category>

		<guid isPermaLink="false">http://somic.org/?p=607</guid>
		<description><![CDATA[A note to self.
When starting to work on a new big feature, always set up 2 branches for it. Say FEATURE_work and FEATURE_integration. Do your regular development in FEATURE_work committing as often as you want. When you reach certain milestones (but entire feature is still not ready yet), squash merge FEATURE_work into FEATURE_integration. When entire [...]]]></description>
			<content:encoded><![CDATA[<p>A note to self.</p>
<p>When starting to work on a new big feature, always set up 2 branches for it. Say FEATURE_work and FEATURE_integration. Do your regular development in FEATURE_work committing as often as you want. When you reach certain milestones (but entire feature is still not ready yet), squash merge FEATURE_work into FEATURE_integration. When entire feature is finished, merge FEATURE_integration into master.</p>
<p>This gives you a much nicer history of commits, lets you group changes by milestone, and allows to keep big feature as multiple commits in master.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/05/31/branching-git-when-working-on-big-features/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
