Search code examples
c#mapreducehadoop-streamingazure-hdinsight

Mapreduce with C#: Process whole input files


Problem:

I'm creating a MapReduce application in C# for HDInsight. I need to process whole input files.

I understand, there are two options available in Hadoop to achieve this:

  • Deriving from the InputFormat class, and letting isSplitable always return false
  • Setting min_splitsize to a large- enough value

I can't figure out, how to achieve any of these options using C# on HDInsight.

Details:

I'm either

  • Using Microsoft.Hadoop.MapReduce, and starting the job via hadoop.MapReduceJob.ExecuteJob<MyJob>();

  • Or by simply creating a console application and starting it from azure powershell via

    $mrJobDef = New-AzureHDInsightStreamingMapReduceJobDefinition -JobName MyJob -StatusFolder $mrStatusOutput -Mapper $mrMapper -Reducer $mrReducer -InputPath $mrInput -OutputPath $mrOutput

    $mrJobDef.Files.Add($mrMapperFile)

    $mrJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $mrJobDef

A solution for either way would help a lot.


Solution

  • You can set min_splitsize using the -Defines parameter in powershell

    $clusterName = "YourClusterName"
    $jobConfig = @{ "min_splitsize"="512mb"; "mapred.output.compression.codec"="org.apache.hadoop.io.compress.GzipCodec" }
    $myWordCountJob = New-AzureHDInsightMapReduceJobDefinition -JarFile "/example/jars/hadoop-examples.jar" -ClassName "wordcount" -jobName "WordCountJob" -StatusFolder "/MyMRJobs/WordCountJobStatus" -Defines $jobConfig 
    

    or in C#

        var mapReduceJob = new MapReduceJobCreateParameters() 
        {
              ClassName = "wordcount", // required
              JobName = "MyWordCountJob", //optional
              JarFile = "/example/jars/hadoop-examples.jar",  // Required, alternative syntax: wasb://[email protected]/example/jar/hadoop-examples.jar
              StatusFolder = "/AzimMRJobs/WordCountJobStatus" //Optional, but good to use to know where logs are uploaded in Azure Storage
        };
    
        mapReduceJob.Defines.Add("min_splitsize", "512mb");
    

    Although I don't think this guarantees that each file will be read completely. To do that you may need the Java SDK explained here http://www.andrewsmoll.com/3-hacks-for-hadoop-and-hdinsight-clusters/

    Resources: http://blogs.msdn.com/b/bigdatasupport/archive/2014/02/13/how-to-pass-hadoop-configuration-values-for-a-job-via-hdinsight-powershell-and-net-sdk.aspx