Let us start analyzing BigData in the Windows Azure-Hadoop environment

Niladri.Biswas
Posted by in Cloud category on for Beginner level | Points: 250 | Views : 3477 red flag

In this step by step article we will look into how to implement a Map-Reduce for analyzing BigData in the Windows Azure-Hadoop environment


 Download source code for Let us start analyzing BigData in the Windows Azure-Hadoop environment

Introduction

In this step by step article we will look into how to implement a Map-Reduce for analyzing BigData in the Windows Azure-Hadoop environment.

What is Hadoop?

Hadoop is a Java based framework that helps in processing large chunks of data (often BigData) in distributed computing environment.

Straight to experiment

Step 1:

At present, a Developer Preview version of Hadoop is available on Windows Azure.So let us visit https://www.hadooponazure.com/

Step 2:

We need to provide "Invitation Code" and we don't have that.Henceforth, we need to sign up for that.From the screen that appears, enter the needed information.

Once filled up, send the information and we will receive an information as under

Have patience as it will take some time to get the "Invitation Code"(I received within 3 business days).

Step 3:

Once the "Invitation Code" is received, let us enter it and click on "Enroll" button

Step 4:

That will take us to the Cluster creation page where we need to provide some informations like "DNS Name", "Cluster Login" , "Cluster Password" and finally click on the "Request Cluster" button.

It will take some time for the cluster to set up and once done, we will receive something as under

Next click on the "Go to Cluster" button for accessing the cluster dashboard.

Step 5:

Now it's time to write the Map and Reduce programs.So fire up Visual Studio and create two console applications.Name one as "WordCountMapper" and the other as "WordCountReducer".In the "WordCountMapper" project create a class file by the name "WordCountMapper.cs" as write the below code

using System;
using System.IO;

namespace WordCountMapper
{
    class WordCountMapper
    {
        static void Main(string[] args)
        {
            if (args.Length > 0)
            {
                Console.SetIn(new StreamReader(args[0]));
            }

            string line;

            while ((line = Console.ReadLine()) != null)
            {
                foreach (var word in line.Split(' '))
                {
                    Console.WriteLine("{0}", word);
                }
            }
        }
    }
}

Here we are just reading the sentence, splitting them by space(' ') and displaying those.E.g. If the input is "Bear Cat Deer", then the output will be

In the "WordCountReducer" project create a class file by the name "WordCountReducer.cs" as write the below code

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

namespace WordCountReducer
{
    class WordCountReducer
    {
        static void Main(string[] args)
        {
            if (args.Length > 0)
            {
                Console.SetIn(new StreamReader(args[0]));
            }

            string line;
            var collection = new List();

            while ((line = Console.ReadLine()) != null)
            {
                collection.Add(line);
            }

            collection
                .OrderBy(o=>o)
                .GroupBy(g => g)
                .Select(i => new
                {
                    Word = i.Key
                  ,
                    Occurance = i.Count()
                })
                .ToList()
                .ForEach(i => Console.WriteLine("Word {0} appeared {1} time(s)", i.Word, i.Occurance));           
        }
    }
}

In this case we are reading the individual words that we received from the Map program(shown above).Those words we are putting into a collection of string.Next we are sorting our collection first, then performing a group by on that followed by counting the number of occurance of the words in the collection.And finally we are displaying the end result.A sample done is depicted here.

Once done, let us build the project in the release mode.

Step 6:

Now we will deploy our mapper, reducer and the input file to the Azure Hadoop Cluster.For that,let us go to the Azure Hadoop cluster dashboard and select the Javascript console.The interactive console window will appear as under

Type fs.put() for uploading the mapper file.

We need to specify a destination file path (it is the path in the HDFS system where the file will be stored and this will be use for processing further).Here the destination path is "/example/apps/WordCountMapper.exe".

Click on the "Upload" button and the file will be uploaded

We need to do the same for the reducer file (destination path : "/example/apps/WordCountReducer.exe") and for the input file (destination path: "/example/data/WordCountInputFile.txt").

Just to verify that our file has been properly uploaded to the HDFS system, we can issue the #ls command as under

In case we need to see the file content, we can always issue #cat command as shown under

Step 7:

The next step is to create a job.So click on the "Create Job" icon from the cluster dashboard.

The "Create Job" window will appear

We need to provide a "Job Name" and the "Hadoop Streaming Jar file name". Since we are writing the Map-Reduce program in C#, it is not consider as a first class citizen (which is actually Java).Henceforth we need to do a Hadoop-Streaming for this.However, the Hadoop-Streaming jar file can be obtained from here

Next enter the below values(note you may need to change the values if you have choosen a different path.)

Hadoop jar hadoop-streaming.jar 
-files "hdfs:///example/apps/WordCountMapper.exe,hdfs:///example/apps/WordCountReducer.exe" 
-input "/example/data/WordCountInputFile.txt" -output "/example/data/WordCountOutput" 
-mapper "WordCountMapper.exe" -reducer "WordCountReducer.exe"

Finally, click on the "Execute Job" button.If everything goes well, we will receive the below

So, it say that we can find our outfile file(i.e. the reduce file) at "/example/data/WordCountOutput" location.

Step 8:

Let us again go back to the "Interactive Javascript" window and fire up the #cat command to get the desired result

References

  1. Apache Hadoop
  2. Hadoop
  3. What is Hadoop?
  4. What is Hadoop?

Conclusion

Hope the tutorial will server as a good purpose to getting started with Hadoop in Azure.Attached is the Map-Reduce programs.Thanks for reading.

Page copy protected against web site content infringement by Copyscape

About the Author

Niladri.Biswas
Full Name: Niladri Biswas
Member Level: Platinum
Member Status: Member
Member Since: 10/25/2010 11:04:24 AM
Country: India
Best Regards, Niladri Biswas
http://www.dotnetfunda.com
Technical Lead at HCL Technologies

Login to vote for this post.

Comments or Responses

Login to post response

Comment using Facebook(Author doesn't get notification)