The silent victories of open source

Tux, the Linux penguin

Image via Wikipedia

For years, free/libre/open source software (henceforth referred to as FLOSS) have proclaimed, year after year, how that year is the year of Linux, or the year that open source will become mainstream, or the year that open source will finally take off etc. But it never has, at least traditionally speaking. Linux based desktops haven’t penetrated either the enterprise or consumer markets; with a few notable exceptions (Apache httpd, for instance), most FLOSS products — be it office software like OpenOffice, multimedia software such as Gimp or Inkscape — remain popular with economically insignificant niches. And yet, this year, more than ever before, open source forges ahead with its silent victories.

Consider the following shifts:

  • all the top brands of the day — Apple, Google, Facebook, Twitter, Amazon — they ALLstand tall on the shoulders of FLOSS giants.
  • Contributing software back to the open source community is becoming increasingly common, even expected. Take a look at the GitHub repositories of Twitter and Facebook, or the various Google projects. In fact, when screening engineering candidates, I often look for and encourage people to talk about their open source contributions.
  • Most of the activity around “big data” and “cloud computing” is being driven in large part by FLOSS, whether it is the Hadoop-powered ecosystem or the Xen/Linux powered Amazon Web Services.
  • Given the current smartphone landscape, it is highly likely that Android will become ubiquitous on tablet devices and a variety of consumer smart phones. Already, Android has more search mindshare than Linux, despite the fact that Linux is part of the Android stack.
  • If you start a software company today, I would bet that you will find yourself bootstrapping almost entirely using open source software. The entire development process — from the GCC compiler toolchain, to the build systems, to the scripting languages, to the version control systems, to the code review systems, to the continuous integration systems — everything is dominated by FLOSS products. Good bug trackers and enterprise Wikis are the last bastions but it is just a matter of time.

I’ve had a chance to see the enterprise software market up close and increasingly find more and more open source everywhere I look. FLOSS has not arrived, it has taken over.

How do you use Twitter/Buzz/Facebook?

No no, I’m not late to the party and I’m not asking literally how does one use the above mentioned services. Rather, I’m asking how does one put these various services to use. When do you post something on Twitter but not on Buzz, Facebook but not on Twitter; or do you post everything everywhere (ping.fm style)? I’m not a heavy hitter by any means and my usage of social networks is mediocre at best. Yet I myself confounded with all of the various services and their accompanying warts and virtues. Don’t you?

To help sort out my thoughts, I drew a picture (don’t you dare judge me for my lack of creativity!):

Twitter/Facebook/Buzz

Below I elaborate more on how I currently use each of the services.

Twitter

  • I tend to use it for technical and/or non-personal content. Things that I would want to publicize.
  • Unlike Buzz/Facebook, I don’t pay too much attention to who is following me. Most tweets are public anyways.
  • The 140 character limit is sometimes amusing, but often irritating. Are people still using regular SMS with Twitter?
  • Multiple startups devoted to managing Twitter “noise” is not encouraging.
  • @ replies are bandaid. Twitter is a broadcast-and-forget medium — I can’t have (or follow) a conversation on it.

Facebook

  • Use it for sharing random, personal updates (or things I find interesting :p)
  • Mostly on because of network effect (read: don’t want to be left off the social bandwagon).
  • Like that I can “Like” most things and actually follow the conversation via comments.
  • Always worried if my privacy settings are working and if there’s a new “default” I need to worry about.
  • Pay more attention to who I friend. The noise level is still quite high despite that.

Buzz

  • Usage domain similar to that of Facebook. Unlike Facebook, can choose to make posts Public.
  • Love the email integration. Conversely, API/clients still have to catch up to Twitter.
  • Supports likes, comments and “resharing”.
  • Privacy is modeled around my contacts (chat or otherwise), which seems natural.

I’m fine with using Twitter for all of my public posts. The main confusion lies between Buzz and Facebook. Facebook obviously has more social traction. That said, Buzz is just more convenient to use (because of the email integration mostly). Of course, all of the various connectors available (Twitter <-> Buzz, Twitter <-> Facebook, multicast via ping.fm or Chromedeck etc) make the whole thing even more confusing. At the end of the day, I might just go back to not using anything on a regular basis.

How are you using Twitter, Buzz and Facebook?

Startup Infrastructure: Where Linux Fails

Category:WikiProject Cryptography participants

Image via Wikipedia

It is no secret that I’m an open source evangelist and so when it was time to set up internal infrastructure at work, naturally the first order of business was to evaluate the various OSS projects out there — everything from wikis, bug trackers, source control, code review and project management. Running Ubuntu LTS (10.04) on all of our servers was a no-brainer and there were plenty of excellent options for most everything else as well (a follow-up post on our final choices later). The Linux ecosystem is fabulous for most of the infrastructure needs of a startup, but I learnt the hard way that there are still some areas where Linux needs a lot of work before it can become competitive with proprietary, non-Linux solutions.

Authentication

Centralized account management (users and groups) and authentication is critical component in any IT deployment, no matter the size. Even for a small startup, creating users/groups repeatedly for each new server, separate authentication mechanisms for each new service is simply not scalable. That is precisely why Active Directory is so ubiquitous at enterprises.

LDAP was the obvious solution in Linux-land and I figured it would be trivial to setup an OpenLDAP server that can manage user/group information for us. It would also be the single authentication source for all servers and services. I was so wrong.

After struggling with OpenLDAP for several painful hours, I gave up — the documentation is fragmented, Google doesn’t help much and personally I think the LDAP creators had never heard of “usability” when designing it. The seemingly simple task of creating some new users and groups involved several black-magic incantations of the LDAP command line tools. Getting servers to authenticate against the resulting directory was even harder.

Just as I was about to throw in the towel and setup an AD instance in-house, I stumbled upon the 389 Directory Server (now known as the Fedora Directory Server). With a new found hope, I set about installing it on Ubuntu and hit another roadblock — there are no up-to-date packages of FDS for Ubuntu. Reluctantly, I setup a Fedora instance (the only one so far) and installed FDS. Thankfully, Red Hat has put together really comprehensive documentation and guides for the Directory Server, which was invaluable.

From there on, it was mostly downhill (only a few minor hiccups). Finally we have a nice GUI to manage users and groups, and all servers/services authenticate against a single Directory Server. But the journey was unnecessarily painful. Here’s what I’d like to see:

  • Up-to-date packages of FDS for Ubuntu. Sane defaults and functionality out-of-the-box
  • Ready to consume documentation on how to integrate LDAP with various web applications, Linux distros etc (I’ll put together some of this soon)
  • More awareness — I should have found FDS a lot sooner than I did, but it is certainly not very well marketed
  • Single sign on: This is a whole different beast

Remote Access

At my previous company, we had a Cisco VPN solution. There were plenty of Cisco compatible VPN clients on Windows and Mac. In fairness, it was relatively easy to get vpnc working on Ubuntu as well. In fact, with Network Manager, you can manage your VPN connections using a simple and intuitive UI. But the setup was not very reliable and my connections would get dropped relatively frequently. It was impossible to have a long-running VPN session without disruption. I’m not sure if the problem was with the Cisco hardware or the Ubuntu vpnc client; I did see similar issues with the built-in VPN client on Mac OS X.

But at least VPN on Linux works. I can’t say the same about other remote access mechanisms, in particular IPSec and L2TP over IPSec. It took me some time to figure out which package to use (Strongswan, Openswan, iked etc etc); another couple of hours to get the Openswan configuration just right; several hours of struggling to automatically setup DNS lookups when using the IPSec connection (gave up and ended up using entries in /etc/hosts!). There is no UI in Network Manager to manage IPSec connections either. Strongswan does have a NM plugin, but that only works for IKEv2 (certificate based authentication), while I had to use IKEv1 (shared key based authentication).

At the end of the day, I do have a working IPSec tunnel and it is definitely more reliable than the Cisco VPN (been up for more than 2 days without disruption). But all this can and should become a lot more seamless.

These are a few areas where Linux failed me in setting up the infrastructure for a startup; it shines most everywhere else. Hopefully these last few kinks will get ironed out soon.

The San Francisco Taiko Dojo

My wife and I had been thinking about learning Taiko, so after some quick Googling, one fine Tuesday we dropped in at the San Francisco Taiko Dojo to “observe” the adult beginners class. We only stayed the first hour or so, and it was interesting to say the least. First, there was the intimidating workout: everyone was counting in Japanese; the workout included sets of 60 pushup, situps, scissor kicks and tricep dips! And then there was the class itself — there seemed to be no “orientation” for beginners or a structured way to learn the ropes; everyone there just seemed to know what they were doing; there seemed to be a lot of understood etiquettes — there was an expected way of doing pretty much everything. Suffice to say that we decided to start classes the following week.

BTW, if don’t know what Taiko is or have never heard Taiko, I refer you to the mighty Wikipedia and the mightier YouTube:

I’m on a temporary hiatus from Taiko right now, but I had an amazing experience the few months I spent with SF Taiko Dojo.

Yes, there are rules and etiquettes. But in a society where anything goes and freedom rules and any kind of “discipline” is often frowned upon, SFTD was almost refreshing. In many ways, it was reminiscent of the Gurukul system in ancient India.

Taiko itself is a wonderful art form. There is something powerful about a Taiko performance. A single drum is an excellent percussion device, but in a group, Taiko takes a life of its own. Like most art forms, you can pick up the basics real quick. But to go deep into Taiko, you need time, patience and a lot of hard work. The veterans at SFTD have been playing for 10-15 years and still learning.

Needless to add, Taiko is also a fantastic full body workout. It is a combination of dance, drumming, music and more. The classes are fun, but you do need serious commitment if you want to become an advanced Taiko player. The folks in the adult beginners class are a merry bunch. Before our first class, I was extremely anxious, trying to memory numerals in Japanese from Wikipedia and worried whether I’ll be able to keep up with everyone. There was help every step of the way. The class won’t stop for you, but it will not leave you behind either :)

But the best part of SFTD is the opportunity to learn from Sensei Tanaka. His accomplishments in the world of Taiko are well known, so I won’t enlist them here. What surprised me was the humility and generosity and the energy he brings with him, even after doing this for more than four decades. He could easily delegate the adult beginners class to one of his many advanced students; yet he still routinely teaches the class himself, ever so patient and understanding. Better yet, his expertise in Taiko is matched only by his wistful humor.

So if you are in the Bay Area and are looking for some inspiration, do checkout San Francisco Taiko Dojo.

Some thoughts on dbShards

I heard about dbShards via two recent blog posts — one by Curt Monash and the other by Todd Hoff. It seemed like an interesting product, so I spent some time digging around on their website.

dbShards

dbShards

As the name suggests, dbShards is all about sharding. Sharding, also known as partitioning, is the process of distributing a given dataset into smaller chunks based some policy. AFAIK, the term “shard” was popularized recently by Google even though the concept of partitioning is at least a few decades old. Most distributed data management systems implement some form of sharding by necessity, since the entire data set will not fit in memory on a single node (if it would, you should not be using a distributed system). And therein lies the USP of dbShards — it brings sharding (and with it, performance and scalability) to commodity, single-node databases such as MySQL and Postgres.

So how does it work? Well, dbShards acts as a transparent layer sitting in front of multiple nodes running MySQL, lets say. Transparent, because they want to work with legacy code, meaning no or minimal client side modifications. Inserting new data is pretty simple: dbShards using a “sharding key” to route an incoming tuple to the appropriate destination. Queries are a bit more complex, and here the website is skimpy on details. Monash’s post mentions that join performance is good when sharding keys are the same — this is not a surprise. I’m not interested in what other kinds of query optimizations are in place. When data is partitioned, you really need a sophisticated query planner and optimizer that can minimize data movement and aggregation, and push down as much computation as possible to individual nodes.

I found the page on replication intriguing. I’m guessing when they say “reliable replication”, they mean “consistent replication” in more common parlance (alternative, that dbShards supports strong consistency, as opposed to eventual or lazy consistency). This particular bit in the first paragraph caught my eye: “deliver high performance yet reliable multi-threaded replication for High Availability (HA)”. I’m not sure how to read this. Are they implying that multi-threaded replication is typically not performant? And usually you do NOT want threading for high availability, because a single thread can still take the entire process down. The actual mechanism for replication seems like a straightforward application of the replicated state machine approach.

But making a replicated state machine based system scale requires very careful engineering, otherwise it is easy to hit performance bottlenecks. I’d be very interested in knowing a bit more about the transaction model in dbShards and how it performs on larger systems (tens to hundreds of nodes).

The pricing model is also quite interesting. I think it is the first vendor I know of that is pricing on CPU and not storage (their pricing is $5,000 per year per server). I think this is indicative of the target customer segment as well — I would imagine dbShards works well with a few TBs of data on machines with a lot of CPU and memory.