Log-In to post
Hadoop isn't a simple database; it's a bunch of different technologies built on top of the Hadoop common utilities, MapReduce, and HDFS (Hadoop Distributed File System). Each of these products serves a simple purpose - HDFS handles storage, MapReduce is a parallel job distribution system, HBase is a distributed database with support for structured tables. You can find out more on the Apache Hadoop page.
Installing Hadoop is not quite as easy as installing Cassandra. There are two flavors to choose from:
Good question! NoSQL databases are not renowned for their data security.
Back in the beginning, at the dawn of Hadoopery, it was assumed that everyone accessing the system would be a trusted individual operating inside a secured environment. Clearly this happy ideal won't fly in a modern, litigation fueled world. As a result, Hadoop has support for a Kerberos authentication (now meaning all versions of Hadoop newer than version 0.22 for the Apache Hadoop distribution, Cloudera's CDH3 distribution, or the 0.20.S Yahoo! Distribution of Hadoop). The Kerberos piece handles the proper authentication, but it is still up to Hadoop and HDFS(it's a distributed file system, remember?) to make sure that an authenticated user is authorized to mess around with a file.
In short: Kerberos guarantees that I'm Jeremiah Peschka and Hadoop+HDFS guarantee that I can muck with data.
N.B. As of 2010-10-21 (when I'm writing this), Hadoop 0.22 is not available for general use. If you want to run Hadoop 0.22, you'll have to build from the source control repository yourself.
Or any RDBMS for that matter. An example makes this easier to understand.
Let's say that we have 1 terabyte of data (the average Hadoop cluster is reported as being 114.5TB in size) and we need to process this data nightly - create reports, analyze trends, detect anomalous data, etc. If we are using an RDBMS, we'd have to batch this data up (to avoid transaction log problems). We would also need to deal with the limitations of parallelism in your OS/RDBMS combination, as well as I/O subsystem limitations (we can only read so much data at one time). SQL dialects are remarkably bad languages for loop and flow control.
If we were using Hadoop for our batch operations, we'd write a MapReduce program (think of it like a query, for now). This MapReduce program would be distributed across all of the nodes in your cluster and then run in parallel using local storage and resources. So instead of hitting a single SAN across 8 or 16 parallel execution threads, we might have 20 commodity servers all processing 1/20 of the data simultaneously. Each server is going to process 50GB. The results will be combined once the job is done and then we're free to do whatever we want with it - if we're using SQL Server to store the data in tables for report, we would probably bcp the data into a table.
Another benefit of Hadoop is that the MapReduce functionality is implemented in Java, C, or another imperative programming language. This makes it easy to solve computationally intensive operations in MapReduce programs. There are a large number of programming problems that cannot be easily solved in a SQL dialect; SQL is designed for retrieving data and performing relatively simple transformations on it, not for complex programming tasks.
Hadoop (Hadoop common + HDFS + MapReduce) is great for batch processing. If you need to consume massive amounts of data, it's the right tool for the job. Hadoop is being used in production for point of sale analysis, fraud detection, machine learning, risk analysis, and fault/failure prediction.
The other consideration is cost: Hadoop is deployed on commodity servers using local disks and no software licensing fees (Linux is free, Hadoop is free). The same thing with an RDBMS is going to involve buying an expensive SAN, a high end server with a lot of RAM, and paying licensing fees to Microsoft, Oracle, or IBM as well as any other vendors involved. In the end, it can cost 10 times less to store 1TB of data in Hadoop than in an RDBMS.
HBase provides the random realtime read/write that you might be used to from the OLTP world. The difference, however, is that this is still NoSQL data access. Tables are large and irregular (much like Google's Big Table) and there are no complex transactions.
Hive is a data warehouse toolset that sits on top of Hadoop. It supports many of the features that you're used to seeing in SQL, including a SQL-ish query language that should be easily learned but also provides support for using MapReduce functionality in ad hoc queries.
In short, Hadoop and SQL Server solve different sets of problems. If you need transactional support, complex rule validation and data integrity, use an RDBMS. If you need to process data in parallel, perform batch and ad hoc analysis, or perform computationally expensive transformations then you should look into Hadoop.
Unfortunately, there are not a lot of management tools on the market for Hadoop. APIs are available to develop your own management tools. From the sound of discussions overheard at Hadoop World 2010, I think a lot of developers are going down this route: they're developing monitoring tools that meet their own, internal, needs rather than build general purpose tools that they could sell to third parties. As the core product improves, I'm sure that more vendors will be stepping up to the plate to provide additional tools and support for Hadoop.
Right now, there are a few products on the market that support Hadoop:
As I mentioned earlier, there is no support for ACID transactions. On the most basic level, Hadoop processes data in bulk from files stored in HDFS. When writes fail, they're retried until the minimum guaranteed number of replicas is written. This is a built-in part of HDFS, you get this consistency for free just by using Hadoop.
As far as eventual consistency is concerned, Hadoop uses an interesting method to ensure that data is quickly and effectively written to disk. Basically, HDFS finds a place on disk to write the data. Once that write has completed, the data is forwarded to another node in the cluster. If that node fails to write the data, a new node is picked and the data is written until the minimum number of replicas have had data written to them. There are additional routines in place that will attempt to spread the data throughout the data center.
If this seems like an overly simple explanation of how HDFS and Hadoop write data, you're right. The specifics of write consistency can be found in the HDFS Architecture Guide (which is nowhere near as scary as it sounds).