Some thoughts on dbShards

I heard about dbShards via two recent blog posts — one by Curt Monash and the other by Todd Hoff. It seemed like an interesting product, so I spent some time digging around on their website.

dbShards

dbShards

As the name suggests, dbShards is all about sharding. Sharding, also known as partitioning, is the process of distributing a given dataset into smaller chunks based some policy. AFAIK, the term “shard” was popularized recently by Google even though the concept of partitioning is at least a few decades old. Most distributed data management systems implement some form of sharding by necessity, since the entire data set will not fit in memory on a single node (if it would, you should not be using a distributed system). And therein lies the USP of dbShards — it brings sharding (and with it, performance and scalability) to commodity, single-node databases such as MySQL and Postgres.

So how does it work? Well, dbShards acts as a transparent layer sitting in front of multiple nodes running MySQL, lets say. Transparent, because they want to work with legacy code, meaning no or minimal client side modifications. Inserting new data is pretty simple: dbShards using a “sharding key” to route an incoming tuple to the appropriate destination. Queries are a bit more complex, and here the website is skimpy on details. Monash’s post mentions that join performance is good when sharding keys are the same — this is not a surprise. I’m not interested in what other kinds of query optimizations are in place. When data is partitioned, you really need a sophisticated query planner and optimizer that can minimize data movement and aggregation, and push down as much computation as possible to individual nodes.

I found the page on replication intriguing. I’m guessing when they say “reliable replication”, they mean “consistent replication” in more common parlance (alternative, that dbShards supports strong consistency, as opposed to eventual or lazy consistency). This particular bit in the first paragraph caught my eye: “deliver high performance yet reliable multi-threaded replication for High Availability (HA)”. I’m not sure how to read this. Are they implying that multi-threaded replication is typically not performant? And usually you do NOT want threading for high availability, because a single thread can still take the entire process down. The actual mechanism for replication seems like a straightforward application of the replicated state machine approach.

But making a replicated state machine based system scale requires very careful engineering, otherwise it is easy to hit performance bottlenecks. I’d be very interested in knowing a bit more about the transaction model in dbShards and how it performs on larger systems (tens to hundreds of nodes).

The pricing model is also quite interesting. I think it is the first vendor I know of that is pricing on CPU and not storage (their pricing is $5,000 per year per server). I think this is indicative of the target customer segment as well — I would imagine dbShards works well with a few TBs of data on machines with a lot of CPU and memory.

What is node.js?

The logo of the Node.js Project from the offic...
Image via Wikipedia

If you follow the world of Javascript and/or high-performance networking, you have probably heard of node.js. If you already grok Node, then this post is not for you; move along. If, however, you are a bit confused as to exactly what Node.js is and how it works, then you should read on.

The node.js website doesn’t mince words in describing the software: “Evented I/O for V8 JavaScript.” While that statement is precise and captures the essence of node.js succinctly, at first glance it did not tell me much about node.js. I did what anyone interested in node.js should do: downloaded the source and started playing around with it.

So what exactly is node.js? Well, first and foremost it is a Javascript runtime. Think of your web browser; how does it run Javascript? It implements a Javascript runtime and supports APIs that make sense in the browser such as DOM manipulation etc. Javascript as a language itself is fairly browser agnostic. So node.js is yet another runtime for Javascript, implemented primarily in C++.

Because node.js focuses on networking, it does not support the standard APIs available in a browser. Instead, it provides a different set of APIs (with fantastic documentation). Thus, for instance, HTTP support is built into node.js — it is not an external library.

The other salient feature of node.js is that it is event driven. If you are familiar with event driven programming (ala Python Twisted, Ruby’s Event Machine, the event loop in Qt etc), you know what I’m talking about. The key difference though is that unlike all these systems, you never explicitly invoke a blocking call to start the event loop — node.js automatically enters the event loop as soon as it has finished loaded the program. A corollary is that you can only write event driven programs in node.js, no other programming models are supported. Another consequence of this design choice is that node.js is single-threaded. To exploit CPU parallelism, you need to run multiple node.js instances. Of course, there are several node.js modules and projects already available to address this very issue.

To implement a runtime for Javascript, node.js first needs to parse the input Javascript. node.js leverages Google’s V8 Javascript engine to do this. V8 takes care of interpreting the Javascript so node.js need not worry about syntactical issues; it only need to implement the appropriate hooks and callbacks for V8.

node.js claims to be extremely memory efficient and scalable. This is possible because node.js does not expose any blocking APIs. As a result, the program is completely callback driven. Of course, any kind of I/O (disk or network) will eventually block. node.js does all blocking I/O in an internal thread pool — thus even though the application executes in a single thread, internally there are multiple threads that node.js manages.

Overall, node.js is very refreshing. The community seems great and there is a lot of buzz around the project right now, with some big companies like Yahoo starting to use experiment with node.js. node.js is also driving the “server side Javascript” movement. For instance, Joyent’s Smart platform allows you to write your server code in Javascript, which they can then execute on their hosted platforms.

Finally, no blog post about node.js is complete without an example of node.js code. Here is a simple web server:

How the mouse moves

Random interesting find of the day: IOGraphica. Here’s mine for about 7 hours at work:

Such a simple app, but such a fascinating output. An easy way to create computer generated art! Couple of observations:

  • I have a dual-monitor setup at work. I use the left monitor for email for browsing and the right monitor for code. The mouse patterns clearly reflect this usage pattern. I tend to rest the mouse roughly equally on the both the monitors.
  • I was very intrigued by the fact that most of the mouse motions are very smooth. Most curves almost look parabolic. There are very few jerks and jittery lines. Once again, nature seems poetic even in the most chaotic and random actions.

Big Data Analytics

DISCLAIMER: As with all other material on this blog, these are my thoughts and do NOT reflect the opinions of my employer.

I really like the tagline on our logo: big data. fast insights.

But leaving the marketing aside, what does it mean really? What is all the hoopla about big data analytics?

The way I look at things, a few key observations here are:

  1. Data is increasing. This is almost self-evident, so I won’t bother with presenting any evidence.
  2. Data is driving businesses more than ever. Whether it is search, advertising, insurance, finance, health care, governance — data is becoming an integral part of more and more business processes.
  3. Finally, data movement is slow. And I mean really really slow, compared to our processing and memory speeds. Once you go into the range of hundreds of terabytes of petabytes of data, you really don’t want to keep moving around that data into isolated silos for doing analytics.

Clearly, none of these observations is particularly new or insightful. However, I do think some of the implications of these observations are quite powerful and were new at least for me. For instance, (3) implies that once you have accumulated a lot of data in one place (imagine hundreds of TB or more), it is extremely difficult and time consuming to move that data around. This, in turn, means that more often than not, data is likely to reside in a single place.

Traditionally, it was not uncommon to have a large data warehouse that would be the repository of all data. Then smaller data sets could be carved out from this master data set (also known as data marts) as required. This approach is becoming increasingly unfeasible. Carving out 100TB data marts from a 1PB data warehouse is simply not going to scale.

At the same time, it is clear that a one-size-fits-all approach to data storage and analysis is not practical either. Some data sets naturally lend themselves to a relational data model, while others might be more suited to unstructured processing (Hadoop) or document oriented processing (CouchDB or MarkLogic) or graph analysis (Neo4J) and so on. Forcing a single model or access mechanism down all customers’ throat is not tenable.

So what would the ideal platform for big data analytics look like? One that allows you to store and access data in various ways, seamlessly.

Reblog this post [with Zemanta]

My experiences with Apple: A poem

Apple Inc.

Image via Wikipedia

I’m a Linux guy; Windows was never my thing honey
Apple seemed interesting, but required too much money

I have ideological problems with Apple too,
What with all the DRM and hardware lock-in they do.

But people are crazy about Apple, and I used to wonder why,
I had a dream: to own Apple products that I didn’t have to buy.

A few months back my wife gifted me an iPhone, bro!
And then at work I got the new Macbook Pro!!

Thus suddenly I was an Apple user,
Sure, some people called me a sore loser.

Allow me to share my early experiences,
Some accolades and some grievances.

I’ll try to keep a neutral tone,
Shall focus on the Mac and not the iPhone.

Integration, integration, integration!
The attention to detail gives a wonderful sensation.

User experience is the key,
Excellent design is for all to see.

They’ve taken care of the enterprises,
Exchange support, Google integration — no surprises.

It’s by far the best laptop I’ve ever used,
The hardware is slick, the software is smooth.

Image representing iTunes as depicted in Crunc...

Image via CrunchBase

But boy do I hate iTunes,
It’s so broken it should be called Looney Tunes.

Try connecting multiple iPhones to the same device,
Or plug your iPhone in another laptop (poor advice).

Sync is threatening, sounds like a bully.
“I shall sync or destroy”, that just sounds silly.

The Terminal app should aspire higher,
No 256-color support leaves much to desire.

Keyboard shortcuts are hard to find,
Change them? you must be out of your mind!

“Features” like “Spaces” are overrated,
More like awaited, belated and deflated.

I prefer iTerm over Terminal and Adium for chat,
Chrome over Safari, and this over that.

I’m certainly not blown away,
But a Mac is convenient, I have to say.

Reblog this post [with Zemanta]