Big Data landscape is like a kaleidoscope. It has taken different forms and shapes based on the needs of the problem statement. Some of its various forms are:
As enterprises started adopting BigData, it needed data ingestion, authentication, encryption, ACLs, scheduling and along came many tools including but not limited to Sqoop, Flume, Kafka, Kerberos, YARN, Ambari, Oozie and so on.
Tools kept increasing and when deployment became too much, various Hadoop distributions started becoming common place. (Hortonworks, Cloudera). And I haven’t even started with EMR and HDInsights yet. View the full picture
Is this going to be an ever expanding galaxy? Is there not going to be any standardization? Will this change every 6 months and developers need to re-orient?
Well, there is one standard. It’s something that all developers just know. It’s nothing but – “SQL Queries.” For them, accessing data as SQL queries is easiest and most related form of interaction.
Due to this, Hive is turning out to be the standard. I don’t mean Hive over Map Reduce here. I mean Hive Metastore. Practically every querying engine over HDFS is leveraging Hive Metastore.
They all claim to be fastest in their own way and using their own technique. But they also claim that they use Hive Metastore either as default or as a plug-in.
So, in this ever expanding galaxy of big data query engines and tools, there is one Pole Star – that’s Hive Metastore. And Hence a Big High Five to Hive.