The Advisory Boar

By <ams@toroid.org>

Dealing with messy data: Google refine

Recently at work, I needed to import a substantial quantity of data from some spreadsheets into an SQL database. Due to multiple maintainers and changing needs, the spreadsheets were a mess of special cases, shorthand notation, and minor errors. I gave up on importing them as-is, and asked for the relevant data to be extracted into a format I defined. Writing a converter for the latter was much easier (although errors kept cropping up even in the new spreadsheet).

Today I read about Google Refine, a program dedicated to dealing with messy data. The video demonstrations on the project page show off some useful capabilities, such as grouping together and canonicalising values from a column in a few steps, scaling numeric values, and rearranging the data in various ways.

Refine is a Java program that you install and interact with using a web browser. I've never used such a program, and I suspect it may be painful to install, but I'm going to try it if any more spreadsheets appear on the horizon.

Update 2020-03-23: Google Refine was renamed again and is now known as OpenRefine.

Airtel's Fair Usage Policy

This morning, I got an email (and SMS) alert from Airtel:

Dear Airtel Customer,
You have consumed 100% of your high speed data transfer limit of 10000
MB. Now you will be getting a revised speed till the end of this bill
cycle (as per the bill plan subscribed by you) and the speed will be
back to normal at the beginning of the new bill cycle. You are still
on an unlimited plan and all your data transfer remains free.

Airtel was forced to institute a Fair Usage Policy for "unlimited" data transfer plans, because A very small number of customers use an excessive amount of the network bandwidth, to the extent that it can impair the experience of others. But …needless to mention, the usage levels set are very generous such that most customers will not be affected. And remember, they're only defining a "fair usage level", not a "limit".

I humbly apologise to everyone whose "experience" I unfairly impaired by downloading 10GB at 512Kbit/s in one month. The strain on the Airtel network must have been enormous.

But wait, there's more! The email goes on to say:

However, if you need a higher speed, you can visit www.airtel.in/sod
and subscribe to speed on demand - a service from Airtel where you
can increase your browsing speed by paying a nominal charge.

Oh good, I should have known a nominal charge could fix everything. I feel so… unlimited now.

(P.S. airtel.in/sod says "Unexpected error" when you try to sign up.)

No more Sai Baba

I can understand wanting to make sure that Sai Baba Mark II was actually dead, and not just setting the stage for a miraculous resurrection, but I think it was in poor taste for Prime Minister Manmohan Singh to do it personally.

A little more legroom

I've always had trouble fitting my 190cm+ frame into airline seats. Only on the rare occasions when I've made it to the airport before the hordes of (usually much shorter) people who want the front row or emergency exit row seats, have I had a reasonably comfortable flight.

While checking in on a 0615 flight to Kolkata this morning, the chap in queue ahead of me was told that there were no more aisle or window seats available. I resigned myself to an uncomfortable two hours, but asked for an exit row seat anyway. To my pleasant surprise, I was told that some were available, and I could have one for an extra INR 300.

I lost no time forking over the money (which seemed to surprise them), and was duly assigned seat 13F (emergency exit over the wing). At least four of the exit row seats over the wings were unoccupied during the flight, and there were no takers despite an announcement that premium seats were available for a small fee.

If this means I can get an exit row seat on my return flight too, I'm not complaining.

Goodbye to fugue

My old server, fugue.toroid.org, is no more. After five years of sterling service, it has been retired and replaced with one of Hetzner's new entry-level root servers (with sixteen times the RAM and a very much faster CPU). It took me a long time to migrate the services across, but everything works now.

The new server is named raven.toroid.org.

Escaping from Delhi this October

For many months, we have been planning to be away from Delhi during the Commonwealth Games 2010 (October 3–14). Thanks to extra school holidays and invitations from friends, we're spending 25 days in Karnataka. I'm looking forward to this holiday very much.

Switching from RSS to Atom

A faithful reader complained this morning that my RSS feed didn't have timestamps. This was a surprise to me, because I had suppressed all memories of RSS after cutting-and-pasting a template together months ago. But he was right: my entries contained only a title, a link, and some text. I tried to figure out how to specify timestamps, and soon realised that RSS is a complete mess with divergent streams of development and lax specifications (and that this wasn't news to anyone who had been paying attention).

I fixed the timestamp problem by adding a pubDate element to each item, but I was forced to change the feed version from 0.91 to 2.0 because pubDate is a channel element (and not an item element) in 0.91. I also decided that I didn't want to keep using RSS any more. Atom is now supported widely, and I decided to switch to it.

There are many reasons to prefer Atom to RSS. The most compelling ones are that the content-type problems of RSS disappear entirely. There is no ambiguity about what MIME type to serve Atom feeds as, and there is no ambiguity about whether the content is text or HTML, or how special characters are escaped. You no longer have to guess whether an entry contains a summary or full text, and relative URIs can be handled sanely. Best of all, Atom has a real specification. (Here's a detailed comparison of Atom 1.0 with RSS 2.0.)

It took me only a few minutes to publish an Atom feed for ams/etc. I am no longer advertising the RSS feed, but it will continue to work for the benefit of people already subscribed to it.

No visa required?

The Economist says that as of August 2010, Indian citizens can visit (approximately) 50 "countries and territories" without a visa.

I really wonder which ones they are.

Update (2010-12-24): According to someone else's research on the IATA's timaticweb.com, Indians can apparently visit the following twenty-eight countries without a visa: Andorra, Bhutan, British Virgin Islands, Cook Islands, Ecuador, Egypt, El Salvador, Fiji, Grenada, Guatemala, Haiti, Honduras, Hong Kong, Jamaica, Kosovo, Macau, Micronesia, Montserrat, Nepal, Nicaragua, Niue, Palestine, Seychelles, St. Kitts and Nevis, St. Vincent and Grenadines, Trinidad and Tobago, Turks and Caicos, and Vanuatu.

In addition, the following thirty countries are said to grant Indian tourists a visa on arrival: Bangladesh, Bolivia, Cambodia, Cape Verde, Comoros Islands, Djibouti, Dominica, Ethiopia, Indonesia, Iran, Jordan, Kenya, Laos, Madagascar, Maldives, Mauritius, Mozambique, Nauru, Palau, Western Samoa, South Korea, Sri Lanka, St. Lucia, Tajikistan, Tanzania, Thailand, Timor-Leste, Togo, Tuvalu, and Uganda.

(Don't depend on the accuracy of this list if you plan to travel to any of these countries. Visa policies change regularly, especially for those granted on arrival.)

Prejudice lurks in dark corners

In "Women in computing: first, get the problem right", ESR explains that everyone else just misunderstood the problems that keep women away from computing and other technical fields; and that although achieving equality is precluded by the difference in dispersion of the IQ curves, his insights can help to establish the large, happy female minority that is the best we can hope for in its stead.

Talking about prejudice in this context is lazy, stupid, [and] wrong, and the real reason women bail out of computing is that they have short fertile periods, and their biological instincts tell them not to waste time on the warrior-ethic ways of programming.

By these and other bold observations, ESR demonstrates the honesty and willingness to speak uncomfortable truths that are prerequisite to addressing the problem. For example:

I don't mean to deny that there is still prejudice against women lurking in dark corners of the field.

Prejudice. Lurking in dark corners. Who would have thought it?

I'll file this article away right next to his equally-insightful "Sex tips for geeks".

A new symbol for the Indian Rupee

The Union Cabinet selected a new symbol for the Indian Rupee, designed by a Mr. D. Udaya Kumar of IIT Bombay. It won't actually be printed on currency notes, and of course nobody will bother to use it until it is added to Unicode, but here it is:

Rupee symbol

The Information & Broadcasting minister Ambika Soni told reporters that It is just a symbol, but it apparently allows us to join the exclusive club of countries whose currencies have a distinct identity, and somehow represents the robustness of the Indian economy (R for Robustness?) while being a blend of modernity and Indian culture.

How dreadfully silly.

I practised drawing the symbol on my whiteboard a few times, and came to the happy realisation that—if you squint at it just right—it looks like a (strangely long-necked) raptor in soaring flight, as seen from below. But maybe that's just me.