A Big High Five to Hive

Big Data landscape is like a kaleidoscope. It has taken different forms and shapes based on the needs of the problem statement.  Some of its various forms are:

  • MapReduce was the first most popular and successful form of Big Data. MapReduce ensured processing goes to where data is, instead of fetching data to compute.
  • Hadoop Distributed File System (HDFS) became the de-facto distributed storage. With the huge amount of data processing needed with BigData, compute and storage were separate entities. So, data would be in a distributed storage and compute can be done by any other tool (unlike databases, where storage & compute are tightly integrated & monolithic).
  • Hive provided near ANSI-SQL capabilities to query the HDFS data followed by NoSQL’s, which lead to a radically different view of handling data
  • Another popular compute came in the form of in-memory Spark. This provided distributed compute in a super-fast way compared to MapReduce.

As enterprises started adopting BigData, it needed data ingestion, authentication, encryption, ACLs, scheduling and along came many tools including but not limited to Sqoop, Flume, Kafka, Kerberos, YARN, Ambari, Oozie and so on.

Tools kept increasing and when deployment became too much, various Hadoop distributions started becoming common place. (Hortonworks, Cloudera). And I haven’t even started with EMR and HDInsights yet. View the full picture

The question is

Is this going to be an ever expanding galaxy? Is there not going to be any standardization? Will this change every 6 months and developers need to re-orient?

Well, there is one standard. It’s something that all developers just know. It’s nothing but – “SQL Queries.” For them, accessing data as SQL queries is easiest and most related form of interaction.

Due to this, Hive is turning out to be the standard. I don’t mean Hive over Map Reduce here. I mean Hive Metastore. Practically every querying engine over HDFS is leveraging Hive Metastore.

  • Hive Over Tez (Query engine from Hortonworks)
  • Impala (Query Engine from Cloudera)
  • BigSQL (Query Engine from BigInsights)
  • Presto (Facebook’s query engine)
  • Spark-SQL (Spark’s own version of Query Engine)
  • Drill (Although this Query Engine uses different approach, still supports Hive Metastore)

They all claim to be fastest in their own way and using their own technique. But they also claim that they use Hive Metastore either as default or as a plug-in.

So, in this ever expanding galaxy of big data query engines and tools, there is one Pole Star – that’s Hive Metastore. And Hence a Big High Five to Hive.

Recommended Posts

Leave a Reply