Run Map Reduce WordCount Example On HDInsight Using PowerShell

Create Hadoop Cluster of HDInsight in Microsoft Azure portal. First, open Microsoft Azure portal, using your active subscription. Click New and in Data Service menu, there is an option of HDInsight. In it, click Hadoop option. In the screenshot, shown below, enter the cluster name, select cluster size and enter the password (Password should be the combination of a numeric character i.e. in the lower case and upper case with the special character), select Storage account name, which you already created for Hadoop and click create HDInsight cluster button. It will take 10 to 15 minutes to successfully run the Hadoop cluster.

HDInsight

After successfully creating it,  the Hadoop cluster status will be shown Running, as shown in the screenshot, given below:

HDInsight

Afterwards, open Windows PowerShell ISE in your workstation and Add Azure Account wih the help of the following command. There will be a popup in the Window and you need to enter active Azure subscription email ID and password. If you don’t have active Azure subscription, please use Azure one month free trial.

Add-AzureAccount

Click new ps1 file, enter the following Powershell commands in it. We are accessing Hadoop cluster. We are running Map Reduce job on davinci.txt novel to count the word instance, using wordcount class.

Editor – Untitled1.ps1

$ClusterName = “hdi0102”
# Define the Mapreduce job
$WordCountJobDefinition = New-AzureHDInsightMapReduceJobDefinition `
-JarFile “wasb:///example/jars/Hadoop-examples.jar” `
-ClassName “wordcount” `
-Arguments “wasb:///example/data/Gutenberg/davinci.txt”, `
“wasb:///example/data/WordCountOutput”
# Submit the Job
$WordCountJob = start-AzureHDInsightJob `
-Cluster $ClusterName `
-JobDefinition $WordCountJobDefinition
Wait-AzureHDInsightJob `
-Job $WordCountJob `
-waitTimeoutSeconds 3600
# Get the Job Output
Get-AzureHDInsightJobOutput `
-Cluster $clustername `
-JobId $WordCountJob JobId `
-standardError


code

Run the ps1 file.

To get the data of the output, click another .ps1 file and apply the following Powershell commands, where we are accessing the storage account by storage name, access key and fetching the resultant data of the wordcount. The output file name is part-r-00000.

Untitled2.ps1

mkdir \Tutorials
cd \Tutorials
$StorageAccountName = “hadoopstorage”
$ContainerName = “hdi0102”
# Create the Storage account context object
$StorageAccountKey = Get-AzureStorageKey $StorageAccountName | %{ $_.Primary }
$StorageContext = New-AzureStorageContext
-StorageAccountName $StorageAccountName
-StorageAccountKey $StorageAccountKey
#Download the job output to the workstation
Get-AzureStorageBlobContent
-Context $StorageContext -Force
-Container $ContainerName
-Blob example/data/WordCountOutput/part-r-00000
Cat ./example/data/WordCountOutput/part-r-00000 | Findstr “there”|


code

Run the untitled2.ps1 file. Afterwards, we can retrieve all the resultant data in Excel Sheet.

  • Open Excel – Click on PowerQuery Menu.
  • Click on From OtherSources
  • Click on From HDInsight
  • Enter Account Name (Storage Account)
  • Enter Account Key

Now, right side navigator will run and it will fetch all the data in the Excel sheet.

Next Recommended Readings