Fubaredness Is Contagious

Dmitriy Samovskiy’s Blog

Building Erlang R13B02-1

October 21st, 2009 · Comments Off

This is a quick note in case anyone is having the same issue.

When building erlang R13B02-1 on a 64bit non-SMP machine (not sure if it matters), “make -j 2″ somehow resulted in an error which I could not work around. Reverting to simply make (without -j 2) and starting compilation from the very beginning fixed it.

Also, after final make install, I could not start erl – it was complaining about “start.boot not found”. The solution is to symlink boot files like this:


cd /usr/lib/erlang/bin
ln -s /usr/lib/erlang/releases/R13B02/start.boot .
ln -s /usr/lib/erlang/releases/R13B02/start_clean.boot .
ln -s /usr/lib/erlang/releases/R13B02/start_sasl.boot .

I configured it with “./configure –prefix=/usr –disable-x –enable-threads –enable-kernel-poll –disable-hipe”.

Comments OffTags: erlang · rabbitmq

Standalone Web Front Door a Must in EC2?

October 13th, 2009 · 4 Comments

Most of you have probably heard about a recent outage at BitBucket. In a nutshell, their systems hosted at AWS came under a UDP flood DDoS attack, which led to significantly increased traffic, which led to saturation of their local network interface, which led to their being unable to connect to their data stored on EBS, which led to their application becoming unresponsive.

This outage shed more light on some internal designs of EC2 itself, as described here. It might have also showcased our over-confidence in EC2’s ability to detect and defeat certain types of network attacks. But this post is about something else.

BitBucket was running their web front door and their backend application on the same instance. Front door is a part of the system which is facing the Internet and its task is to accept connections from clients. For obvious reasons, front door is running on the service’s discoverable IP address – whether they used Elastic IP or not, bitbucket.org resolved to that IP. Note that front door (usually) doesn’t need EBS.

Backend, however, is what needs EBS for disk persistence. At the same time, backend does not need to be publicly discoverable – as long as front door knows where its backend worker(s) is/are running, the app should be functioning just fine.

With front door and backend running on different instances, UDP flood would have saturated only the former’s network interface and would have had no impact on the backend and its EBS.

I know that AWS reportedly fixed the flood issue, but looks to me like separating front door and application backend may still be a good preventive measure – after all, it’s considered a good practice for a reason.

Please note that I am not trying to accuse BitBucket of running a bad architecture and causing their own outage. All I am doing is trying to learn a lesson.

→ 4 CommentsTags: cloud computing · infrastructure development · software engineering

Capistrano Auth Trick

October 7th, 2009 · 2 Comments

This past summer, we needed to automate testing of several failure scenarios for VPN-Cubed. Having asked the LazyWeb about any frameworks that could help us and having gotten no response, our dev team had a short chat in the office. We decided that ultimately we were going to have to roll out our own system based on SSH. Capistrano was the obvious choice, because it’s essentially a higher-level wrapper around Net::SSH module (if you prefer python, you may take a look at fabric or paramiko).

One obstacle was that because we were emulating various failures, at times our local capistrano process, which was driving the tests, had to lose SSH connectivity to its target servers. We quickly discovered that this resulted in exception and cap process would die.

To work around this, I added yet another level on top of cap which uses GNU make (one of my all time favorites). In a nutshell, user controls the testing process via make, and make starts cap. In this case, it’s ok for cap process to occasionally exit.

But then – and we are finally getting to the point of this post – another issue came up: I didn’t want to keep typing password into cap each time it was started by make. Here is how I ended up implementing it to avoid re-typing password.


# in Makefile
USER_PASS := $(shell read -s -p "[make] user's password: " P; echo $$P )
export USER_PASS


all: set_password
# do something here


set_password:
    @test "$(USER_PASS)"


# in Capfile
set :password, lambda { ENV['USER_PASS'] ||
CLI.password_prompt("[cap] #{user}'s password: ") }

→ 2 CommentsTags: infrastructure development · ruby

Security Groups – Most Underappreciated Feature of Amazon EC2

September 21st, 2009 · 5 Comments

Having been developing software to run on Amazon EC2 for over a year now, I find security groups to be among its least understood and appreciated features.

Basic Usage

In short, EC2 security group (SG) is a set of ACCEPT firewall rules for incoming packets that can apply to TCP, UDP or ICMP.  When an instance is launched with a given SG, firewall rules from this group are activated for this instance in EC2’s internal distributed  firewall (it’s not the same as iptables on your instance!).

A common misconception is that SG rules can apply only to traffic from Internet into EC2 – this is incorrect, SGs apply to all traffic that is coming to your instance.

SG can be thought of as a security profile or a security role – it promotes good practice of managing firewall by role, not by machine. For example, you could say servers with “webapp” role must be able to connect to servers with “mysql” role on port 3306. Going further with security profile analogy, an instance can be launched with multiple SGs – similar to a server with multiple roles. Because all rules in SG are ACCEPT rules, it’s trivially easy to combine them (more on this down in my future features wishlist).

Each rule in SG (called “permission”) must specify the source of packets to be allowed. It can be either a subnet anywhere on the Internet (in CIDR notation, with 0.0.0.0/0 being entire Internet) or another security group, which once again promotes managing firewall by role. Interestingly, in the latter case, the source SG does not necessarily have to belong to your AWS account – it can be anyone’s. This makes it easy to grant selective access to your instances from instances run by your friends, partners and vendors. It works only if their instances are running in the same EC2 region (US or EU), because this functionality works only using EC2 private IP addresses.

Specifying rules with other SGs as source helps you deal with dynamic IP addressing in EC2. Without this feature, each time a new instance were launched, you would have had to adjust the SGs. It could become a mess if the application you are running in EC2 is very dynamic (scales up or down frequently). In general, if you are using IP address instead of SG name in a rule that allows certain communications to your instance from another EC2 instance in the same region, you are doing it wrong (you should be using source instance’s SG, not its IP).

To allow traffic from any EC2 instance in the same region, create a rule with source as 10.0.0.0/8 (all private IPs in EC2 so far are from this block, so this rule all not affect public IP traffic). To allow traffic from another region, you can easily find out public IPs of EC2 US and EC2 EU by launching an instance and looking up its IP in ARIN or RIPE Whois databases (note there may be multiple blocks of public IPs in use by each region).

A list of security groups with which your instance is currently running is available from inside the instance, using EC2 meta-data service (ec2-metadata -s). You can use this functionality to do some on-boot customizations based on which role this instance has or doesn’t have. Be sure to run such on-boot scripts after networking has been set up and eth0 interface is up.

Advanced Usage

I know that many folks are used to running their datacenter-based servers without local firewalls relying on protection of the network perimeter. There is no out-of-the-box perimeter in Amazon EC2 (shameless plug – third-party solutions are available). I personally highly recommend the use of local firewall in conjunction with SGs, because SGs can’t do everything (see my wishlist #1 and #3 below). Two levels of protection, instead of one, won’t hurt and should reduce probability of operator error in one of the layers leading to drastic consequences.

SG can be modified at any time using API, and modifications take immediate effect on all instances that are running with this SG. It works great for connectivity that is required occasionally. For example, you probably don’t need to have SSH open on your instances at all times. When you are about to SSH in, you can open tcp/22 and when you are done – close it. This trivially easy method will keep your instances more secure.

Additionally, note that you don’t need to have any access to your instance to adjust SGs – all SG operations are performed against EC2 API endpoint, not through your instance. There is absolutely no way to irreversibly lock yourself out of your instance – a hugely positive side effect for anyone who has ever cut off their access while trying to fix a problem.

A common task is to allow certain functionality to be called only by instances running in a specific security group. For example, check out this thread on EC2 forum. Short of re-writing the app in question to be EC2-aware, SGs offer an elegant solution. Enable requested functionality on a special network port and allow only instances from specific security group to connect to that port. Problem solved!

My Future Features Wishlist

There are several things that I would like SGs to do that it currently doesn’t do:

  • As of today, all outgoing and “related” packets are implicitly allowed. I hope SGs will provide some control over these in the future.
  • As of today, you can’t attach or detach an SG to/from a running instance – a list of SGs is set at instance launch time and remains unchanged until the instance is shut down (you can add or remove rules in groups at any time, but can’t modify SG membership for your instances once they are launched) – I hope this can be added in the future.
  • As of today, all rules in SGs are ACCEPT rules. Being able to use REJECT or DROP rules would be nice. Yes, I realize that combining multiple SGs would become tricky (because the order matters), but I think this difficulty could be addressed similar to Order directive in Apache HTTPD.
  • As of today, if a packet gets dropped due to SG, there is no way to find out about it – I hope something can be done about logging this information and making it available via some new API call, possibly something like A6.
  • Interesting things could be done if EC2 meta-data service could provide more information about other members of SGs that current instance has – I hope this could be added for easier discovery.

Conclusion

Firewall is an important subsystem of an Infrastructure as a Service cloud. With the bar set this high by Amazon EC2, I am looking forward to what other IaaS cloud implementations are planning to deliver.

→ 5 CommentsTags: cloud computing

On Cloud Lock-In

September 15th, 2009 · Comments Off

I left this comment on today’s post by Randy Bias titled VMWare vs Amazon… ROUND ONE… FIGHT!:

Functionality is more important, imho. As a hypothetical example, say there exists an EC2-like cloud where security groups span all regions (in EC2, as we all know, security groups are confined to a single region). Switching between EC2 and this new cloud and back for operations (start, stop, status) would be relatively easy, with help of abstraction libraries; but once you set up your architecture to use global security groups and rely on this fact when writing your app, it won’t be as easy to switch back and forth.

In other words, cloud lock-in via functionality is harder to overcome than cloud lock-in via API.

Comments OffTags: cloud computing

Shiny Cloud APIs – Necessary But Not Sufficient

September 8th, 2009 · Comments Off

In the stream of non-stop cloud computing chatter that was surrounding VMWorld 2009 that wrapped up last week, I noticed a pattern – folks were paying disproportionate amount of attention to API, API portability and API standardization, as opposed to actual technology concepts and constructs that are going to power new clouds.

API indeed is important – I blogged about it before. But so is curb appeal of a house you might be looking to buy. But you are not going to buy a house just because it looks nice from the outside, right? You will want to consider interior, location, and many other factors before making a decision. Similarly, API alone (or portability of API between multiple vendors) is not nearly enough to get you to choose this cloud other its competitor. There are other things such as features, infrastructure decisions, bandwidth, pricing, tech ops, technical support that play a significant role (or at least should play a role in your decision making).

Well-thought-out, scalable, responsive and easy-to-use API is a NECESSARY condition of a successful cloud, but not SUFFICIENT.

It means that a successful cloud implies good API, not vice versa. Another way to read the same would be to say that bad API implies unsuccessful cloud (A->B is the same as (not B)->(not A)).

I am very excited about recent developments in infrastructure-as-a-service space, but would like to see core concepts and technologies that power clouds discussed as much as new API.

Comments OffTags: cloud computing

The Concept of Hyper Distributed Application

August 18th, 2009 · 1 Comment

Most folks in the industry are familiar with “distributed applications.” If app components are running on multiple hosts and need to communicate with each other using network, the app is said to be distributed.

Distributed applications are known for complexity of assuring all components are on the same page as to what’s going on around them. Hardware failures, network failures, operator errors can all cause chaos; distributed applications foresee these exception situations and attempt to know how to deal with them.

Up until now, the network piece of the puzzle has been usually under application owner’s control – it could be a LAN, or it could be a leased line to remote datacenter. Occasionally, a VPN would be used to provide a dedicated communication channel between locations over public Internet but its use was rarely focused on important stuff – a mission critical application would usually get a leased line.

With advance of public clouds such as Amazon EC2 and Google AppEngine however, these notions are changing. One day you may decide to leverage each cloud’s strengths and distinct features to build your app, or may want to avoid cloud lock-in or provide redundancy. In short, you may want to multi-source your infrastructure.

Your multi-sourced infrastructure will of course be a distributed application. But there is a significant difference between this and old-style distributed apps – this time you no longer have network connectvity under your control. And as a result, you will face 3 significant phenomena that substantially complicate using today’s distributed algorithms – uneven bandwidth, uneven latency and increased probability of connectivity loss (I blogged about the latter here).

And this is what I call a hyper distributed application. In other words, hyper distribution application is a distributed app which runs on a network with uneven bandwidth, uneven latencies and increased probability of connectivity loss (as measured against that on a regular LAN), usually outside of application owner’s control (for example, Internet).

One example of a hyper distributed application is VPN-Cubed that we at CohesiveFT created to address emerging needs to multisource infrastructure. By the very nature of functionality it provides, its components (we call them VPN-Cubed Managers – they act as virtual routers and switches) are sometimes distributed over LAN, sometimes distributed over WAN, sometimes both. Communications between manager 1 and manager 2 can be fast and reliable, but between manager 1 and 3 slow and less reliable, with more frequent resets. Or manager 3 may simply disappear (as seen by its peers) – no, it doesn’t have to be down due to crash; it can simply mean that its network connection to the outside world was down, possibly temporarily.

Hyper distributed applications are relatively rare, because most architects tend to avoid this if they can. For example, Amazon EC2 has 2 regions – US and EU. Each region is a distinct EC2 system, with its own API endpoint, its own AMI IDs, kernel IDs, security groups, keypairs. There is no replication or conflict resolution between the regions – they are totally independent of each other. Why? Because it would be quite difficult to interconnect them into a single entity over public Internet. (I won’t be surprised if it gets implemented in the future though.)

Another example showing that hyper distributed applications are a distinct breed comes from Facebook Engineering blog post titled Scaling Out:

This setup works really well with only one set of databases because we only delete the value from memcache after the database has confirmed the write of the new value. That way we are guaranteed the next read will get the updated value from the database and put it in to memcache. With a slave database on the east coast, however, the situation got a little tricky.

When we update a west coast master database with some new data there is a replication lag before the new value is properly reflected in the east coast slave database. Normally this replication lag is under a second but in periods of high load it can spike up to 20 seconds.

It nicely illustrates how hyper distributed nature of the application adds complexity on top of what a plain distributed app already has.

In conclusion, I would like to propose to separate a category of distributed applications that run on top of networks with uneven bandwidth and uneven latencies into their own (I don’t care much if they end up being called hyper distributed or something else), and start building up research and practical approaches focusing specifically on this area.

P.S. Also consider the future: when we reach inter-planet or inter-galactic communications, you know that latencies and bandwidth in space would not be (initially) the same as on our planet Earth. Better start working on this research now in order to be prepared…

→ 1 CommentTags: Internet · distributed · software engineering

Electrical and Plumbing Analogies in Application Monitoring

August 10th, 2009 · Comments Off

Water and electricity are two components without which a modern home can’t function well. Both are provided as a utility, and both have strictly defined access points from which they can be consumed – taps for water and outlets for electricity.

But there are also differences. Every child knows that electric shock can cause injury even as a result of a short exposure – hence most perceive electricity as a powerful force. This force however has a binary switch attached to it, in the form of switches, circuit breakers and distribution board. Turn it on – electricity is flowing, turn it off – it’s not. When off, electricity can’t leak by design.

Water, on the other hand, is not perceived as such a great force because damage from short exposure is unlikely to be too severe. Additionally, indoor plumbing has no binary on-off switches – it’s measured by a degree of “open” or “close”, “hot” or “cold.” As a result, leaks can and do occur from time to time. And it’s these leaks that have a potential to do costly damage over time but still are not perceived dangerous enough to warrant immediate attention.

There are many things in software applications that are binary in nature – web server daemon is up or down, for example. We all take these all-or-nothing components seriously, because when it’s nothing, the app is down.

But we have our fair share of potentially leaky stuff as well – memory leaks, file descriptor leaks, network connection leaks, and so on. In other words, things that don’t happen instantaneously but build up over time, often hidden behind other bigger component. Some of us don’t take these issues seriously enough because they lack the perceived power of being able to cause significant damage quickly enough. And it’s a mistake.

When monitoring a component of “electricity” type, most common test is to send a probe – if it returns OK, the component is up (”active monitoring”, “active polling” or simply “polling”). But this doesn’t work when monitoring a component of “plumbing” type – if water is flowing, it doesn’t mean there is no leak. In this case, a set of alarms instrumented into the component itself would be a better fit.

The sooner we realize different nature of various components of our applications and the need to monitor them differently, the higher uptime for our applications we are going to achieve.

Comments OffTags: software engineering

New Era in Internet Search – Google vs Bing

July 30th, 2009 · Comments Off

This week marks the beginning of a new era in Internet search. For the first time in modern Internet history, there is a number 2 with sizable market share. This is going to become interesting once Bing and Yahoo! finish integration.

I switched to Google Search many years ago because it was the best – its results were most appropriate, its query language was most predictable, it was fast. In other words, it allowed me to find things easier, faster and with least amount of effort. Search was the first social application on the Web – by clicking on a search result you let search engine know “this is what I was looking for,” which is a form of user participation which allows users to influence (”vote”) selection of content.

I now feel however that Bing search is as good as Google’s. When I end up working on someone else’s machine without Firfox, I end up using IE. And while at first, I always went to google.com explicitly before submitting my search, on a couple of occasions I got lazy and tried Bing (via their upper-righthand corner search textbox). And surprise – results didn’t suck.

While up until now competition was based on quality and technology, now it’s shifting to marketing, distribution, conversions, churn rates, and so on – because quality (I think) is pretty close and no longer is a distinguishing factor (in economics speak, search quality is no longer a competitive advantage). Interestingly, if you read Wikipedia article on Competitive Advantage, you will see that “many forms of competitive advantage cannot be sustained indefinitely” – exactly what happened here.

I also think that this event re-emphasizes increased importance of Google’s relationship with Mozilla (which makes #2 browser) – there is no way IE will default to anything but Bing. This also underscores importance of Google’s investment in Chrome, their own browser platform. If I were Mozilla, I think I could try to extract maybe better terms from Google next time they renegotiate the contract. It’s a win-win for both.

My final observation has nothing to do with search. With technological competitive advantage gone, Google vs Bing showdown is now going to be about execution and – most importantly – effectiveness of leveraging network effect. I find it very interesting, because the same type of showdown may occur in other areas. For example, micro blogging. Twitter currently is by far the #1 platform not due to its technological competitive advantage (their technology is complex, their traffic is huge – but it’s not insanely mathematically complex to require PhDs to figure it out). As a result, all potential entrants to the microblogging space face a single huge obstacle – overcoming huge Twitter network effect. I suspect that Bing vs Google will let us observe and study whether and how network effect can be tamed and ultimately reversed.

In other words, I am most interested in seeing whether network effect in social Internet can be sustained indefinitely as a competitive advantage or not.

Up until now, I can’t name a case when a social Internet site which dominated its field got pushed aside and slipped from #1. In all cases up until now, new entrants carve up a niche and end up dominating it, while the original #1 remains overall #1. Or did I miss an example – could you help in the comments below?

It’s about to get very interesting.

Comments OffTags: Internet

Evaluating Cloud Computing from Buy vs Rent Perspective

July 27th, 2009 · Comments Off

What is driving people, projects and organizations to adopt cloud computing?

There is no single answer. Everyone’s situation is different, and everyone assigns different weights to different factors. But what is common in “to cloud or not to cloud” decision making is that fundamentally it’s like buy vs rent in housing.

*aaS is all about rent vs buy – rent is housing-as-a-service pay-as-you-go after all. You either want to be able to get out fast, or staying in one place for a long time doesn’t scare you. Rent might be more expensive over time and might constrain you in certain ways (I never met a landlord who would agree to let tenants paint walls bright green, for example), but on the other hand it does not require an up-front payment and allows a certain degree of flexibility. Buy involves a commitment, but may provide some benefits (like ability to do that painting project).

The key observation is that there is no single factor that simplistically would let you choose one over the other. Rent vs buy decisions are based on personal preferences, current situation, future plans, and surrounding circumstances – all subjective. In nearly identical situations, one would choose rent, and another would choose buy – and both will end up making right decisions for themselves.

Similar logic should apply to cloud computing decisions. A popular phrase is “it depends on workload” – which is another way of saying it depends on your use case, what you are trying to accomplish and which obstacles you’re trying to overcome. It also depends on what kind of company you are, how you have been doing your infrastructure projects in the past, what your plans are, and so on.

There is no right or wrong in either case, and in spite of what some would like you to believe, cloud computing is not right for every use case. So focus on what’s the right tool for the job at hand, with an eye towards the future.

Comments OffTags: cloud computing