What I've come to understand about NoSQL
I've been following the movement for a while now, and have a few favorites that I want to succeed, but I am still waiting for all of them to mature more. I want to provide an overview of my understanding of what the movement is about and how I approach their use, since they are not general-purpose.
The first thing I've understood is that although the name suggests "no" SQL or as some may say, "not only" SQL, I think that NoSQL is not really related to SQL at all, apart from both of them being able to use the same data. Whereas SQL is all about relationships and normalizing data, NoSQL is basically going in the opposite direction, supporting at times completely denormalized data.
In order to understand this about NoSQL, I had to first come to grips with why it was needed. Until an organization or developer reaches a stage with their database where they have to perform either incredibly complex queries spanning many tables, with various unions, joins and optimizations, there really is no need to deviate from the excellent querying abilities of a relational database. MySQL is massively scalable (as are most commercial databases). However, there comes a time when you really need to start 'sharding' your data, and this starts off the process of denormalization for performance.
For instance, take the typical 'user' table. In a normal site, you create the user table to hold all the user and login data, and then you join this to other tables to get at user-specific information. Once you cross a certain massive number of users, this model creates bottlenecks with querying, with the same tables being hit many more times than necessary with expensive join operations. You also realize that it doesn't make sense to have the login info set with other user data, since that is contributing to this problem. You can deal with this either by sharding the rows of the user table (keep chunks of rows separate from each other so that queries for specific groups of users would go to a different physical locations), or by creating separate tables for each user and letting them join to other tables specific to them.
What you are starting to see are the seams in the process. The relational database was not meant to function like this. The real problem comes from synchronizing the information across your server farm(s). The user data needs to have fall-backs and backup machines, but the data in them needs to be consistent. Sharding your data helps with balancing the load on your database server, but adds overhead in lookup times. Also, your sharded data needs to be spread out well, and therefore needs consistency checks.
This is the whole problem with big data - beyond a threshold, the model is not able to maintain reliability, consistency and availability. One or more of those need to be sacrificed, and with relational models, that is availability. If your queries take too long though, you are likely to lose your users completely. How do you go about providing a good balance of all three to a user when you have a massive number of them? Note though that we are only talking about users because that is the example we started with. It can be anything: sales orders, logs, interaction data, basically anything with the capacity to have logarithmic expansion in data as your user base grows larger.
I'm going to discuss two of my current interests with regards to this problem of big data. The first one is going to be a graph database (neo4j). In order to provide my best use-case for a graph database, I'm goign to discuss my pet project that I've been working on for 2 years now, on and off.It is still nowhere near a state where it can perform anything, but I am slowly realizing what parts I need to make it work. The problem has always been with data structures.
The problem at its core is organization of information. At the very root of this is the concept of a personal database. Most people would never think that they need one, and they definitely think they would not want to bother with programming them in order to use them.
Let's take a small example: let's say that we want the user to be able to store their workout information. There are a lot of gym apps out there, so we are safe in discussing this one. When deciding on what to store, you would consider things like sets, reps, weight, time, etc. but then you start to wonder if you should maybe consider muscle groups, and meta-info on each exercise (maybe a short video showing someone performing the exercise with correct technique). In order to capture all this information, you decide to use a relational database. You create tables for 'exercise', 'muscle', 'user', 'workout_template', 'actual_workout', etc. You then create all the associative tables to properly normalize all the information.
Let us try to query the information that I need to determine whether I satisfied my workout requirements today (after I just finish a workout). I would need to touch every single one of these tables with the exception of the muscle-related tables. Worse, I might need to run union queries to account for all the different templates and set/rep combinations. It all depends on how detailed you want your results to be. Let us assume here that in my usual style I want the kitchen-sink approach.
The relational database would work here, but it is not ideal. This seems to be the trend with the whole NoSQL movement: a relational database would fit here, but it is not ideal. In this example, there are too many relationships among the data. It would be much better to just have all of this data denormalized to help in the queries and not in the inserts/updates. I am of course looking at the problem from the point of view of someone that has been using the app for a long time and has a lot of data to comb through (let us say that there is about 2 years' worth of workout data, and the user went to workout almost every week, five days a week with five different workouts and had some seasonal work of going heavy/cutting). Querying this beast would be very difficult with SQL. What if I added in notes every workout or guaged the difficulty of each workout? The problem lies with the fact that we are segregating the information by type and not by intent. Sets are sets and indexed as sets, reps are likewise. When you query by intent, but store by type, you are not taking advantage of the platform you are using.
Let us consider the graph approach. The graph database assumes that everything is connected - you just have to fill in the specifics. It would have no problem with querying something this complex, since it stores data by intent. Data is just stored, but the relationships are indexed, and this makes combing through them for intent very fast. I should elaborate a bit on querying by 'type' and by 'intent'. In the former, we are mainly querying information as it is stored i.e. in the relational database, querying all the 'set' infromation to see how many sets I did per week. By 'intent'-based querying, I mean querying across multiple tables to derive information from it that is not inherent in the way it is stored. For example, I might want to know how balanced my workout is for my body and watch it over time. Perhaps I discovered last year that I was working out my chest a lot more than my back and decided to change it and want to compare how I am doing now vs. how I was doing last year. This would be very difficult to do efficiently with the relational model, but might be easier on the graph database, if your front-end was able to handle the data properly (in the case of neo4j, the java/android program).
There is also a case for CouchDB here, as all the workout data can be stored as one unit of JSON. Since I don't expect to go editing my workouts all the time (unless I am cheating), the append-only structure makes using Couch more advantageous due to low lookup times. I can also query the changes feed for timeline-based queries. This approach cuts down on the amount of implicit relationships I have in the data. I probably really only need a few "large tables" worth of different types of JSON. For example, I might have a "workout_template" and "workout_actual" to store all my workout information. Any analysis would then make use of the front-end to handle some of the leg work. In the case of the query for my muscle-group balancing, it would be trivial to write a map-reduce query (three, actually, one synchronous and the others asynchronous) and use the results to manipulate in javascript (or just display in a tabular format).
I think that most of all, I feel that NoSQL solutions are completely separate from relational ones. However, mixing the two might or might not make sense for your data. The overhead of connecting to different datasources might be a lot less than the overhead of stuffing all of your data into one kind of data store and trying to query them in ways that are inefficient.
Labels: couchdb, neo4j, nosql, why you should hire me