<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Fubaredness Is Contagious &#187; web</title>
	<atom:link href="http://somic.org/category/web/feed/" rel="self" type="application/rss+xml" />
	<link>http://somic.org</link>
	<description>Dmitriy Samovskiy's Blog</description>
	<lastBuildDate>Wed, 01 Sep 2010 07:55:05 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Adjustable Per-URI Backend Capacity in Rabbitbal</title>
		<link>http://somic.org/2009/03/11/adjustable-per-uri-backend-capacity-in-rabbitbal/</link>
		<comments>http://somic.org/2009/03/11/adjustable-per-uri-backend-capacity-in-rabbitbal/#comments</comments>
		<pubDate>Wed, 11 Mar 2009 17:45:04 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[rabbitmq]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[rabbitbal]]></category>

		<guid isPermaLink="false">http://somic.org/?p=296</guid>
		<description><![CDATA[I recently pushed a Rabbitbal update to Github &#8211; http://github.com/somic/rabbitbal.
The biggest enhancement (IMHO) is ability to increase or decrease the number of backend consumers based on any HTTP request headers. In &#8220;table&#8221; routing mode (see rabbitbal.yml), you can now specify array of tests against which incoming request headers will be matched. This will cause your [...]]]></description>
			<content:encoded><![CDATA[<p>I recently pushed a Rabbitbal update to Github &#8211; <a href="http://github.com/somic/rabbitbal">http://github.com/somic/rabbitbal</a>.</p>
<p>The biggest enhancement (IMHO) is ability to increase or decrease the number of backend consumers based on any HTTP request headers. In &#8220;table&#8221; routing mode (see rabbitbal.yml), you can now specify array of tests against which incoming request headers will be matched. This will cause your request to be published with a matching key (note :key). Your backend consumers use the same YAML file and can bind to all or only some queues, thus giving you flexibility in adjusting the capacity. Old functionality is available by using &#8220;topic&#8221; routing mode.</p>
<p>Note that I still use topic-based exchange, because I wanted to support a use case where you want to aggregate all incoming requests into separate queues (routing key would be something like &#8220;request.#&#8221;) for bot detection, access log aggregation, etc. In other words, each request ultimately must end up in a single queue where it will be picked up by backend servers, while at the same time it can also be duplicated into other queues for other purposes.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/03/11/adjustable-per-uri-backend-capacity-in-rabbitbal/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Protect Your Blog Against Misbehaving Bots With Apache</title>
		<link>http://somic.org/2009/02/16/protect-your-blog-against-misbehaving-bots-with-apache/</link>
		<comments>http://somic.org/2009/02/16/protect-your-blog-against-misbehaving-bots-with-apache/#comments</comments>
		<pubDate>Mon, 16 Feb 2009 05:17:31 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[web]]></category>
		<category><![CDATA[bots]]></category>
		<category><![CDATA[rss]]></category>

		<guid isPermaLink="false">http://somic.org/?p=365</guid>
		<description><![CDATA[I recently glanced over Apache httpd logs for this blog. Among other things, I discovered several bots that were making quite a few useless requests, thus driving load on the machine. It wasn&#8217;t a big deal but a matter of principle. If all of us webmasters start paying attention to misbehaving bots and block them, [...]]]></description>
			<content:encoded><![CDATA[<p>I recently glanced over Apache httpd logs for this blog. Among other things, I discovered several bots that were making quite a few useless requests, thus driving load on the machine. It wasn&#8217;t a big deal but a matter of principle. If all of us webmasters start paying attention to misbehaving bots and block them, their authors or maintainers might finally learn how to play by the rules &#8211; it&#8217;s not difficult really.</p>
<p>Let&#8217;s say you are running a blog and hosting your own feed. Regular crawler bots from search engines like Google, Yahoo, MSN, etc will always check robots.txt before sending next batch of requests, and they won&#8217;t ask for the same URL very frequently.</p>
<p>The second type of bots you are going to deal with, are RSS reader bots. There is no reason for them to hit anything but URL for your feeds. Such bots usually will not check robots.txt. Their user agent string (which you can see if you enable <em>CustomLog /path/to/log combined</em>) will usually have &#8220;N subscribers,&#8221; which can give you a rough idea how many people subscribe to your feed via this service. These bots may hit you with varying regularity, ranging from half an hour to several hours between requests. Additionally, well behaving RSS reader bots will always include If-Modified-Since in their requests &#8211; you will see in your logs that your usual response to these queries is 304 Not Modified, and only once after new post is published you should see 200 OK response.</p>
<p>So what did I see in my logs? First, I noticed a bot requesting / from 2 different IPs every hour, without including If-Modified-Since. Wasteful and negligent. If this bot does not know how to appreciate my server resources, I am sure it won&#8217;t miss my content &#8211; block!</p>
<pre style="font-size:14px">&lt;Location /&gt;
order allow,deny
allow from all
# BadBot1 is from its User-Agent string
SetEnvIf User-Agent BadBot1 DenyBot=1
deny from env=DenyBot
&lt;/Location&gt;</pre>
<p>Then I noticed several RSS reader bots that were requesting /feed/ way too frequently (every 10 minutes for 1 subscriber and every 60 minutes for one susbcriber) and were very inconsistent with If-Modified-Since &#8211; I couldn&#8217;t detect logic in their requests, but sometimes I saw 304 and sometimes I saw 200 response, even when there was no new content on my site.</p>
<p>I didn&#8217;t feel right blocking these altogether, so instead what I did is I opened a particular 1 hour window during the night when I allow such bots to get the feed&#8217;s content &#8211; all other times, their requests are blocked.</p>
<pre style="font-size:14px">RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} .*BadBot2.* [OR]
RewriteCond %{HTTP_USER_AGENT} .*BadBot3.*
RewriteCond %{REQUEST_URI} /feed/
RewriteCond /etc/allow_bad_bots !-f
RewriteRule . - [forbidden]</pre>
<p>Voila! If file /etc/allow_bad_bots exists (I create and delete it from cron, it exists on my system between 1am and 2am), requests from these bots will succeed. During the rest of the day, these rude bots are getting 403 Forbidden.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/02/16/protect-your-blog-against-misbehaving-bots-with-apache/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Perlbal Reproxy and HTTP Auth</title>
		<link>http://somic.org/2009/01/09/perlbal-reproxy-and-http-auth/</link>
		<comments>http://somic.org/2009/01/09/perlbal-reproxy-and-http-auth/#comments</comments>
		<pubDate>Fri, 09 Jan 2009 17:22:56 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[web]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[http auth]]></category>
		<category><![CDATA[perlbal]]></category>
		<category><![CDATA[reproxy]]></category>
		<category><![CDATA[x-reproxy-url]]></category>

		<guid isPermaLink="false">http://somic.org/?p=236</guid>
		<description><![CDATA[I use Perlbal in one of the systems to reproxy requests to an internal URL. Reproxying to URL is a powerful feature that works like this.

An HTTP request comes to Perlbal.
Perlbal reverse-proxies it to one of its backend servers.
Backend server does some work (in my case, does extensive verification of URL) but instead of returning [...]]]></description>
			<content:encoded><![CDATA[<p>I use <a href="http://www.danga.com/perlbal">Perlbal</a> in one of the systems to reproxy requests to an internal URL. Reproxying to URL is a powerful feature that works like this.</p>
<ol>
<li>An HTTP request comes to Perlbal.</li>
<li>Perlbal reverse-proxies it to one of its backend servers.</li>
<li>Backend server does some work (in my case, does extensive verification of URL) but instead of returning entire response (status, headers, body), it returns X-REPROXY-URL header which includes a list of URLs.</li>
<li>Seamlessly to end user, perlbal attempts to fetch content from one of these URLs, and returns that new content to the user.</li>
</ol>
<p>The other day I found out that Perlbal can&#8217;t reproxy to URLs that require HTTP basic authentication. Here is a part of Perlbal that parses X-REPROXY-URL header and you can clearly see from regex, it treats URLs as (host, port, path) tuples (this is from ClientProxy.pm).</p>
<pre style="font-size:10px;background-color:#DDDDDD">    # construct reproxy_uri list
    if (defined $urls) {
        my @uris = split /\s+/, $urls;
        $self-&gt;{currently_reproxying} = undef;
        $self-&gt;{reproxy_uris} = [];
        foreach my $uri (@uris) {
            next unless $uri =~ m!^http://(.+?)(?::(\d+))?(/.*)?$!;
            push @{$self-&gt;{reproxy_uris}}, [ $1, $2 || 80, $3 || '/' ];
        }
    }</pre>
<p>And my backend Apaches do require http auth. What to do?</p>
<p>RTFM to the rescue! Apache provides a very cool feature &#8211; <a href="http://httpd.apache.org/docs/2.2/mod/core.html#satisfy">Satisfy (all|any)</a> command. Essentially, it means that for a Directory or Location I can specify both http auth and IP based access control, and using Satisfy Any I can allow access if at least one of these conditions are met (default is Satisfy All).</p>
<p>Here is what it looks like in http config:</p>
<pre style="font-size:10px;background-color:#DDDDDD">&lt;Location /foo&gt;
  # http auth
  AuthType basic
  AuthName "protected"
  AuthUserFile /etc/apache2/users
  Require valid-user

  order allow,deny
  # this is subnet where perlbal is running
  # backends see perlbal's reproxy requests from this subnet
  allow from 192.168.4 127.0.0.1
  satisfy any
&lt;/Location&gt;</pre>
<p>Alternatively, I would need to create a fake URI outside of http auth location and rewrite it with mod_rewrite, or possibly use a symlink &#8211; way less transparent.</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2009/01/09/perlbal-reproxy-and-http-auth/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Introducing Rabbitbal</title>
		<link>http://somic.org/2008/12/18/introducing-rabbitbal/</link>
		<comments>http://somic.org/2008/12/18/introducing-rabbitbal/#comments</comments>
		<pubDate>Thu, 18 Dec 2008 14:47:39 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[rabbitmq]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[amqp]]></category>
		<category><![CDATA[reverse proxy]]></category>

		<guid isPermaLink="false">http://somic.org/?p=253</guid>
		<description><![CDATA[Inspired by Nanite, a very interesting project by Ezra Zygmuntowicz of EngineYard that uses RabbitMQ and eventmachine-based ruby amqp library by Aman Gupta, I sat down and wrote Rabbitbal, a reverse proxy for Rails (as well as other web frameworks, not necessarily limited to Ruby) on top of RabbitMQ. It&#8217;s now available on github at [...]]]></description>
			<content:encoded><![CDATA[<p>Inspired by <a href="http://github.com/ezmobius/nanite/tree/master">Nanite</a>, a very interesting project by <a href="http://brainspl.at/articles/2008/10/11/merbcamp-keynote-and-introducing-nanite">Ezra Zygmuntowicz</a> of EngineYard that uses <a href="http://www.rabbitmq.com">RabbitMQ</a> and eventmachine-based <a href="http://github.com/tmm1/amqp">ruby amqp library</a> by Aman Gupta, I sat down and wrote Rabbitbal, a reverse proxy for Rails (as well as other web frameworks, not necessarily limited to Ruby) on top of RabbitMQ. It&#8217;s now available on github at <a href="http://github.com/somic/rabbitbal">http://github.com/somic/rabbitbal</a>. Rabbitbal code is based on Nanite.</p>
<p>Here are benefits of AMQP-based approach over traditional HTTP-based reverse proxies taken from rabbitbal README file (in no particular order) as I see them.</p>
<ol>
<li>Model where workers fetch work as fast as they can (each going at its own pace) in theory should provide more efficient load balancing than a model where proxy assigns work based on certain criteria; methods based on round robin, arbitrary weights or least connections become simply unnecessary.</li>
<li>RabbitMQ broker implements intelligent failover out of the box &#8211; if a web server disconnects before ack&#8217;ing, the request will be automagically requeued for another server; all in all, RabbitMQ is way smarter than a bunch of low level TCP connections.</li>
<li>Enhanced security of actual web servers &#8211; servers behind Rabbitbal do not need inbound connectivity, they only need to be able to establish an outgoing connection to RabbitMQ broker(s).</li>
<li>Rabbitbal does not need to know IPs or have direct connectivity into its web servers (use case: Amazon EC2 without Elastic IPs)</li>
<li>Using Duplication pattern of RabbitMQ (see Resources below), you could be reading requests and responses off of the same broker in real time (access log aggregation, double-entry book keeping, logging, bot detection)</li>
<li>You could relatively easily have one request go to more than 1 web server</li>
<li>Add capacity as often and as much as you like &#8211; rabbitbal won&#8217;t even know</li>
<li>By slightly readjusting mapping between queues and URIs, you could add or remove capacity per URI if needed</li>
<li>TCP overhead savings compared with HTTP proxies (AMQP uses persistent TCP connections)</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2008/12/18/introducing-rabbitbal/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>PalmOS Blazer-Friendly Browsing with GWT</title>
		<link>http://somic.org/2008/09/18/palmos-blazer-friendly-browsing-with-gwt/</link>
		<comments>http://somic.org/2008/09/18/palmos-blazer-friendly-browsing-with-gwt/#comments</comments>
		<pubDate>Thu, 18 Sep 2008 15:20:34 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[technology]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[blazer]]></category>
		<category><![CDATA[gwt]]></category>
		<category><![CDATA[mobile]]></category>
		<category><![CDATA[optimize]]></category>
		<category><![CDATA[palm]]></category>
		<category><![CDATA[palmos]]></category>

		<guid isPermaLink="false">http://somic.org/?p=108</guid>
		<description><![CDATA[Those of us who [still] have a PalmOS-based device and use its Blazer browser will probably know that Blazer may take some time to render complex pages, and the end result might not even be readable on a small screen. I recently found a solution to this problem that works great, at least for me.
On [...]]]></description>
			<content:encoded><![CDATA[<p>Those of us who [still] have a PalmOS-based device and use its Blazer browser will probably know that Blazer may take some time to render complex pages, and the end result might not even be readable on a small screen. I recently found a solution to this problem that works great, at least for me.</p>
<p>On your phone, head over to <a href="http://www.google.com/gwt/n">http://www.google.com/gwt/n</a> and enter URL you are trying to get. <a href="http://code.google.com/webtoolkit/">GWT</a> (which stands for Google Web Toolkit) will fetch the content and optimize it for your mobile browser. Additionally, it will adjust all links to also go through GWT, which makes Internet surfing with Blazer not painful at all.</p>
<p>For example, I like checking <a href="http://techmeme.com/">Techmeme</a> on train on my way to work. They offer <a href="http://techmeme.com/mini">mini</a> version, which renders well on my Palm Centro from Sprint. But if I want to follow a story and click on a link, I usually get the page not optimized for mobile (there are several exceptions that detect user agent and adjust content formatting). Instead, in my Blazer bookmarks, I have this &#8211; <a href="http://www.google.com/gwt/n?u=http%3A%2F%2Ftechmeme.com">http://www.google.com/gwt/n?u=http%3A%2F%2Ftechmeme.com</a>. From this page, I can jump to any story and get the content nicely formatted for my Centro.</p>
<p><strong>UPDATE</strong>: It looks like I might have confused Google Wireless Transcoder (GWT) with Google Web Toolkit (GWT).</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2008/09/18/palmos-blazer-friendly-browsing-with-gwt/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SEO and 301 Redirect</title>
		<link>http://somic.org/2008/09/09/seo-and-301-redirect/</link>
		<comments>http://somic.org/2008/09/09/seo-and-301-redirect/#comments</comments>
		<pubDate>Tue, 09 Sep 2008 21:13:22 +0000</pubDate>
		<dc:creator>Dmitriy</dc:creator>
				<category><![CDATA[technology]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[bug]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[googlebot]]></category>
		<category><![CDATA[SEO]]></category>

		<guid isPermaLink="false">http://somic.org/?p=106</guid>
		<description><![CDATA[I was under assumption that when a site moves to a new domain or URL space, the best thing to do from SEO perspective was to put up one&#8217;s site at a new place and set up old site to do HTTP 301 redirects (Moved Permanently).
I did it a couple of weeks ago when I [...]]]></description>
			<content:encoded><![CDATA[<p>I was under assumption that when a site moves to a new domain or URL space, the best thing to do from SEO perspective was to put up one&#8217;s site at a new place and set up old site to do HTTP 301 redirects (Moved Permanently).</p>
<p>I did it a couple of weeks ago when I was moving this site to its current address, but noticed today that my old address still shows up in Google at the top of search results. I got curious, and checked both Yahoo and MSN and both of them properly do not display links that have been redirected.</p>
<p>Am I missing anything, or is it a bug in GoogleBot?</p>
]]></content:encoded>
			<wfw:commentRss>http://somic.org/2008/09/09/seo-and-301-redirect/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
