I'm using Scalding on Hadoop, I have a large dataset in the form of a TypedPipe I wish to output in chunks based on one of the data fields.
For example the data is <category, field1, field2>
, and I want the data for each category stored in a SequenceFile in a separate category, e.g. outPath/cat1
, outPath/cat2
etc. And I want a single MapReduce phase (or avoid loops).
I have read about the TemplatedTsv
option here:
How to bucket outputs in Scalding
And here: How to output data with Hive-style directory structure in Scalding?
However this works only if you want a Tsv file, not a SequenceFile.
Obviously a loop works:
var category = 0L
for (category <- categories) {
data
.filter(_.category == category)
.map(t => (NullWritable.get, new BytesWritable(SerializationUtils.serialize(t))))
.write(WritableSequenceFile(outPath + "/" + category))
}
So is there an equivalent way to TemplateTsv
which would work with writing a SequenceFile, avoiding a loop?
There is com.twitter.scalding.TemplatedSequenceFile which may do what you need. It looks just like TemplateTsv but with output to SequenceFile