Hive In Microsoft Azure HDInsight

You need to have an active Azure account and you can try out the steps mentioned in this article. First, you need to have a provisional Azure storage account and HDinsight cluster. Create a HDInsight cluster, using the storage account created earlier. Hadoop compute cluster is also the storage cluster. Hence, create the HDInsight cluster in the same region as the storage account. Select the cluster and click Manage Cluster icon, located at the bottom of the page. This will bring up the Hive Page from where you can issue HiveQL statements as the jobs.



Copy and paste the following HiveQL DDL statement in the text area and click submit. This will create a Hive database called suketudemo and a Hive table called incident_by_premise_type.

  1. CREATE DATABASE IF NOT EXISTS suketudemo LOCATION '/hive/warehouse' ;  
  2. CREATE TABLE IF NOT EXISTS bocsar.incident_by_premise_type(  
  3. incident_year INT,  
  4. division STRING,  
  5. incident STRING,  
  6. premise_type STRING,  
  7. incident_count INT)  
  8. ROW FORMAT DELIMITED  
  9. FIELDS TERMINATED BY '01';  
The job status refreshes after every 30 seconds and changes from Queued, Running and Succeeded.



For this article, I am loading data to Hive tables using text files with the columns delimited by Ctrl-A which is “/001” or char (1). I used SSIS Script task to convert the data published by the author into Ctrl-A delimited text files. The Hive data files have to be copied to the Azure storage account from your local drive. I am using Azure Storage Explorer to upload the text files. Azure Storage Explorer can be downloaded.

Follow the instructions given below, to configure Azure Storage Explorer and upload the text files. Once you have added your storage account to Azure Storage Explorer, select HDInsight cluster and click Blobs followed by the upload button and select the three text files. Once the files are uploaded; they are available at the home directory of Azure storage account. You can verify this from Hive file Browser page.

Navigate to Hive Editor and run the following Hive statements to populate the Hive table – suketudemo.incident_by_premise_type.
  1. LOADDATA INPATH '/Arson_Sydney_TressPass_Data_By_PremiseType.txt'INTOTABLEsuketudemo.incident_by_premise_type;  
  2. LOADDATA INPATH '/TransportOffence_Sydney_Data_By_PremiseType.txt'INTOTABLEsuketudemo.incident_by_premise_type;  
  3. LOADDATA INPATH '/Tresspass_Sydney_Data_By_PremiseType.txt'INTOTABLEsuketudemo.incident_by_premise_type;  
Once the job is successful, you will notice that the text files have been moved from the home directory to \hive\warehouse\ incident_by_premise_type.

Now, Hive table is populated and it can be queried by submitting HiveQL DML jobs.
  1. select incident, count(1)  
  2. frombocsar.incident_by_premise_type  
  3. group by incident;  
Click view details to see the query result.

 

Next Recommended Readings