Search code examples
hadoopapache-pigebcdic

Can Pig handle ebcdic format files?


my question is simple. Can Pig (Hadoop) handle ebcdic file? I have some of them and I'd like to handle and process them using Pig in the Hadoop Platform.

At the moment I've saved the file and try to load that as follows:

A = LOAD '/user/enrico/FilesForPigs/IRIS.txt' AS (f1,f2,f3);

It seems to work, but when I tried typing: DUMP A; I received an error.

EDIT:

Following Donald advice, I am trying to create a Java program to make the conversion, in particular I am trying to create my own LOAD function.

Actually, I have the following problem in the code:

@Override
    public InputFormat getInputFormat() {


        return new TextInputFormat();
    }

This is the example I found, but TextInputFormat is not right for my case. Do you know how can I solve that?

Thanks


Solution

  • No, the default storage mechanism assumes data is ASCII separated by tabs. You can use PigStorage(',') to change the delimiter to something like comma.

    You have two options:

    • convert the data from ebcdic to some sort of CSV format (you can do this with a single threaded program if amount of data is not an issue, or a MapReduce job if it is an issue)
    • Write a custom ebcdic load function. You can see how to do that here.

    Maybe someone else has implemented this, but after a quick google search I didn't see anything.