Thursday, March 06, 2014

Evaluating NoSQL databases - MongoDB

Introduction

MongoDB is undoubtedly one of the most popular document-oriented databases.  It brings to the table the powerful queryability of relational database and distributed nature of NoSQL databases like HBase or Cassandra.   We will see in this post that MongoDB provides a sophisticated set of unique features and you can decide how it meet your needs.

Key Features 
  • High Availability through replicated servers and automatic master failover
  • ACID compliance at single document level including nested documents
  • Scalability is achieved through automatic sharding 
  • Provides a distributed filesystem known as GridFS and can be accessed from the command line
  • Built-in functions and UDFs are written using JavaScript
  • Provides in built support for Map/Reduce/Finalize
Administration
  • Provides an admin shell for administrative tasks
  • There is an UI for Administration through MongoDB Management Service (Third Party Service)
Migration
  • Migration of a system (application) from RDBMS to MongoDB requires complete redesign and refactoring of code not just switching of drivers.   Please see the Drivers section for supported integration interfaces.
Time-to-market
  • With a straightforward installation and administration model it has a very fast time-to-market.  It is better than HBase
Drivers
  • Drivers are available for Java,  C, C++, Erlang, C#, Perl, Scala, Ruby, Phython, PHP
Community Support
  • MongoDB's huge installation base is expanding very rapidly.
  • Sources of support are listed here
Cost
  • Since HBase is opens-source practically you don't have to spend anything for the product.   But there is talent and infrastructural costs
  • Runs on commodity hardware so the cost is reasonable
  • Vendors offer professional support which varies based on your needs
Prominent Users
  • SAP, MetLife, eBay, MTV,  SourceForge and many others use MongoDB in their production environments
Security
  • Control access to MongoDB instances using authentication and authorization
  • Controls access to sharded clusters with key files
Supported Operation Systems
  • Windows, Linux, Mac OS X, Solaris
Resources
Conclusion

MongoDB is widely adopted NoSQL database which can be used from mid-to-huge data volume requirements unlike HBase and other databases which are suited for huge volume of data.  It provides a familiar programming paradigm with JavaScript and provides Drivers for popular languages which makes it a convenient choice.

Saturday, March 01, 2014

Evaluating NoSQL Databases - Apache HBase

Introduction

There is no doubt that there is a growing trend of adoption of NoSQL databases from startups to large enterprises.  There is also array of database choices in the market and the purpose of this series of posts is to assist you in evaluating various NoSQL databases.  In this post we will look at Apache HBase, the column-oriented database.  Here we will explore various criteria with a goal of helping you make an informed decision.  HBase is open-source Apache top-level project since May 2010 and is part of Apache Hadoop ecosystem.  It touts itself as a fault tolerant and consistent database and is based upon Google's BigTable.

Key Features 
  • ACID compliant database that can run transactional applications
  • Each row may have one column to millions of columns and billions of rows.  It is recommended to use HBase for huge volume of data.
  • HBase supports two types of compression algorithms: Gzip (GZ) and Lempel-Ziv-Oberhumer (LZO).  LZO is highly recommended over Gzip but due to licensing issues LZO doesn't come packaged with HBase
  • Bloom Filter is a really cool data structure supported by HBase which answers the question: "Is this data present before?".
  • Out of the box versioning support, a unique feature which makes HBase stand out
  • Supports High availability through automatic failover
  • Architecture facilitates scaling out quite nicely so hardware can just be added on an on-demand basis
Administration
  • Thanks to tools from vendors like Cloudera, Hortonworks administration has become easier over the years and is improving. 
  • In order to achieve fault tolerance, data replication can be configured within data center or between data center racks.
Migration
  • Migration of a system (application) from RDBMS to HBase requires complete redesign and refactoring of code not just switching of JDBC drivers.   Please see the APIs section for supported integration interfaces.
Time-to-market
  • Availability of tested HBase packages from commercial vendors has enabled faster time-to-market
  • Thanks to Hortonworks, HBase is now packaged for Windows so its easier for .NET shops to ship to market faster
APIs
  • APIs are available for Java, Thrift and REST protocols.  Support is also available for Avro.
  • Spring-Hadoop integration supports HBase
Community Support
  • HBase has a fast growing community of companies.  Hadoop vendors are also investing heavily on HBase development as they see the adoption rate growing in the enterprise.
  • Sources of support are IRC channel: irc://irc.freenode.net/#hbase and mailing lists
Learning Curve
  • As we have seen before there is support for a variety of APIs in popular platforms and this must shorten the learning curve.
  • Since HBase is part of Hadoop ecosystem the talent pool is increasing quite rapidly.
  • Simple syntax for developers to learn and remember
Cost
  • Since HBase is opens-source practically you don't have to spend anything for the product.   But there is talent and infrastructural costs
  • Runs on commodity hardware so the cost is reasonable
  • Vendors offer professional support which varies based on your needs
Prominent Users
  • Facebook, Meetup, eBay, Ning, StumbleUpon and Yahoo! use HBase in their production environments
Product Roadmap
  • Ability to take snapshots/backups and restore them at later point of time in an on-demand basis
  • Monitoring and diagnostics tools
  • Improvement to reliability and high-availability
  • Cell-level security
Resources
Conclusion
As they say one size doesn't fit all and hopefully this post addresses the questions/concerns you have in your mind.  We have seen that we can use HBase where there is huge volume of data with columnar requirements.   Those are obviously not the only criteria but we may need consider other factors listed above while making the selection.

Disqus for techtalk