Saturday, March 01, 2014

Evaluating NoSQL Databases - Apache HBase


There is no doubt that there is a growing trend of adoption of NoSQL databases from startups to large enterprises.  There is also array of database choices in the market and the purpose of this series of posts is to assist you in evaluating various NoSQL databases.  In this post we will look at Apache HBase, the column-oriented database.  Here we will explore various criteria with a goal of helping you make an informed decision.  HBase is open-source Apache top-level project since May 2010 and is part of Apache Hadoop ecosystem.  It touts itself as a fault tolerant and consistent database and is based upon Google's BigTable.

Key Features 
  • ACID compliant database that can run transactional applications
  • Each row may have one column to millions of columns and billions of rows.  It is recommended to use HBase for huge volume of data.
  • HBase supports two types of compression algorithms: Gzip (GZ) and Lempel-Ziv-Oberhumer (LZO).  LZO is highly recommended over Gzip but due to licensing issues LZO doesn't come packaged with HBase
  • Bloom Filter is a really cool data structure supported by HBase which answers the question: "Is this data present before?".
  • Out of the box versioning support, a unique feature which makes HBase stand out
  • Supports High availability through automatic failover
  • Architecture facilitates scaling out quite nicely so hardware can just be added on an on-demand basis
  • Thanks to tools from vendors like Cloudera, Hortonworks administration has become easier over the years and is improving. 
  • In order to achieve fault tolerance, data replication can be configured within data center or between data center racks.
  • Migration of a system (application) from RDBMS to HBase requires complete redesign and refactoring of code not just switching of JDBC drivers.   Please see the APIs section for supported integration interfaces.
  • Availability of tested HBase packages from commercial vendors has enabled faster time-to-market
  • Thanks to Hortonworks, HBase is now packaged for Windows so its easier for .NET shops to ship to market faster
  • APIs are available for Java, Thrift and REST protocols.  Support is also available for Avro.
  • Spring-Hadoop integration supports HBase
Community Support
  • HBase has a fast growing community of companies.  Hadoop vendors are also investing heavily on HBase development as they see the adoption rate growing in the enterprise.
  • Sources of support are IRC channel: irc:// and mailing lists
Learning Curve
  • As we have seen before there is support for a variety of APIs in popular platforms and this must shorten the learning curve.
  • Since HBase is part of Hadoop ecosystem the talent pool is increasing quite rapidly.
  • Simple syntax for developers to learn and remember
  • Since HBase is opens-source practically you don't have to spend anything for the product.   But there is talent and infrastructural costs
  • Runs on commodity hardware so the cost is reasonable
  • Vendors offer professional support which varies based on your needs
Prominent Users
  • Facebook, Meetup, eBay, Ning, StumbleUpon and Yahoo! use HBase in their production environments
Product Roadmap
  • Ability to take snapshots/backups and restore them at later point of time in an on-demand basis
  • Monitoring and diagnostics tools
  • Improvement to reliability and high-availability
  • Cell-level security
As they say one size doesn't fit all and hopefully this post addresses the questions/concerns you have in your mind.  We have seen that we can use HBase where there is huge volume of data with columnar requirements.   Those are obviously not the only criteria but we may need consider other factors listed above while making the selection.

No comments:

Disqus for techtalk