Introduction of Hive

Hive, Originally developed by Facebook, It provides us data warehousing facilities on top of an existing Hadoop cluster.
Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.

All commands of Hive and queries go to the Driver, which compiles the input, optimizes the computation required, and executes the required steps, usually with MapReduce jobs. When MapReduce jobs are required, Hive doesn’t generate Java MapReduce programs. Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an XML file representing the “job plan.” In other words, these generic modules function like mini language interpreters and the “language” to drive the computation is encoded in XML.

Hive communicates with the JobTracker to initiate the MapReduce job. Hive does not have to be running on the same master node with the JobTracker. In larger clusters, it’s common to have edge nodes where tools like Hive run. They communicate remotely with the JobTracker on the master node to execute jobs. Usually, the data files to be processed are in HDFS, which is managed by the NameNode. The Metastore is a separate relational database (usually a MySQL instance) where Hive persists table schemas and other system metadata.

Use Hive when you have warehousing needs and you are good at SQL and don’t want to write MapReduce jobs. One important point though, Hive queries get converted into a corresponding MapReduce job under the hood which runs on your cluster and gives you the result. Hive does the trick for you. But each and every problem cannot be solved using HiveQL. Sometimes, if you need really fine grained and complex processing you might have to take MapReduce’s shelter.

%d bloggers like this: