Introduction
This article will help you learn Hadoop using HDInsight on Windows. Here, we will be running a Hive Query from the Azure Portal. Look at my previous article to understand what is Hadoop, what is HDInsight, and how to create a cluster in Hadoop.
Links
Click here for Kickstart Hadoop in HDInsight on Windows.
What is Hive?
Hive provides a Database Query Interface to Apache Hadoop. It is a component of HDP (Hortonworks Data Platform). Hive provides an interface like SQL, for storing the data in HDP.
In the previous article, we have created an HDInsight Cluster. Now, we will be running a Hive job to query a sample Hive table. We will be using hivesampletable which is available with HDInsight clusters. The table contains data about mobile device manufacturers, models, and platforms. A Hive query, on this table, retrieves the data for mobile devices by a specific manufacturer.
Pre-requisites
- An Azure Subscription.
Note: Click here to get a free trial account of Azure.
Follow the below-mentioned steps now.
Step 1 - Go to the Azure Portal and sign in to your account.
Step 2 - Click on More Services - HDInsight Clusters and click on HDInsight Clusters.
You can find the list of the Clusters that are available.
Step 3 - Click the name of the cluster that you want to use. Here, I will be using CCUG Cluster that I have created.
Step 4 - Go on the Dashboard of the cluster.
Clicking on the Dashboard will open a new webpage for you. Enter your login credentials for the cluster and login here.
Note: Use the default name of “admin” for login and password that you gave while creating the particular cluster.
From the above home screen of your cluster, click on Hive Editor under the Dashboard. Here, you can find the URL – www.ccug.azurehdinsight.net.
This means, you can also login to the webpage directly without using Azure portal.
The Dashboard at the top of the page can help us submit the Hive queries. Check the Hadoop job logs and browse the files in the storage.
Step 5 - For Query name, enter “HTC20”. This is the Job Title. Followed by this, enter the following Hive Query.
- Select * from hivesampletable
- WHERE devicemake LIKE "HTC%"
- LIMIT 20;
Click on Submit button after entering the Hive Query.
Under the Job Session, you can find that the job has been completed.
Step 6 - Click on the query name, now, to see the output. Make a note of Job Start Time (UTC). We will be working on it later.
This page shows you the Job Output and Job Log. You have an option here where you can download the output file and log file.
Step 7 - Now, let’s browse the output file. Click on File Browser.
Click on the Storage account name over here, followed by the container name.
Container name
Step 8 - Click on User.
Step 9 - Click on Admin now.
Step 10 - Click on GUID that has the last modified time (the job which started after the job time that you noted earlier). Again, make a note of this GUID. We will be using this in our next module of work.
Follow my next article to learn connecting to Microsoft Business Intelligence tools for Excel.