Problem:
I'm creating a MapReduce application in C# for HDInsight. I need to process whole input files.
I understand, there are two options available in Hadoop to achieve this:
I can't figure out, how to achieve any of these options using C# on HDInsight.
Details:
I'm either
Using Microsoft.Hadoop.MapReduce, and starting the job via hadoop.MapReduceJob.ExecuteJob<MyJob>();
Or by simply creating a console application and starting it from azure powershell via
$mrJobDef = New-AzureHDInsightStreamingMapReduceJobDefinition -JobName MyJob -StatusFolder $mrStatusOutput -Mapper $mrMapper -Reducer $mrReducer -InputPath $mrInput -OutputPath $mrOutput
$mrJobDef.Files.Add($mrMapperFile)
$mrJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $mrJobDef
A solution for either way would help a lot.
You can set min_splitsize using the -Defines parameter in powershell
$clusterName = "YourClusterName"
$jobConfig = @{ "min_splitsize"="512mb"; "mapred.output.compression.codec"="org.apache.hadoop.io.compress.GzipCodec" }
$myWordCountJob = New-AzureHDInsightMapReduceJobDefinition -JarFile "/example/jars/hadoop-examples.jar" -ClassName "wordcount" -jobName "WordCountJob" -StatusFolder "/MyMRJobs/WordCountJobStatus" -Defines $jobConfig
or in C#
var mapReduceJob = new MapReduceJobCreateParameters()
{
ClassName = "wordcount", // required
JobName = "MyWordCountJob", //optional
JarFile = "/example/jars/hadoop-examples.jar", // Required, alternative syntax: wasb://[email protected]/example/jar/hadoop-examples.jar
StatusFolder = "/AzimMRJobs/WordCountJobStatus" //Optional, but good to use to know where logs are uploaded in Azure Storage
};
mapReduceJob.Defines.Add("min_splitsize", "512mb");
Although I don't think this guarantees that each file will be read completely. To do that you may need the Java SDK explained here http://www.andrewsmoll.com/3-hacks-for-hadoop-and-hdinsight-clusters/