Search code examples
pythonhadoopapache-sparkpysparkcntk

Write PySpark DF to File of Specialized Format


I'm working with PySpark 2.1 and I need to come up with a way to write my dataframe to a .txt file of a specialized format; so not the typical json or csv, but rather a CTF format (for CNTK).

The file cannot have extra parenthesis or commas etc. It follows the form:

|label val |features val val val ... val
|label val |features val val val ... val

Some code to show this might be as follows:

l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))).toDF()
people.show(n=4)

def Convert_to_String(r):
    return '|label ' + r.name + ' ' + '|features ' + str(r.age) + '\n'

m_p = people.rdd.map(lambda r: Row(Convert_to_String(r)) ).toDF()
m_p.show(n=3)

In the above example, I would want to simply append each string from each row into a file with out any extra characters.

The real data frame is quite large; It is likely ok for it to be split into multiple files; but would be preferable if the result were a single file.

Any insights is quite helpful.

THANKS!


Solution

  • Converting my comment to an answer.

    Instead of converting each record to a Row and calling toDF(), just map each record to a string. Then call saveAsTextFile().

    path = 'path/to/output/file'
    
    # depending on your data, you may need to call flatMap
    m_p = people.rdd.flatMap(lambda r: Convert_to_String(r))
    
    # now m_p will contain a list of strings that you can write to a file
    m_p.saveAsTextFile(path)
    

    Your data will likely be stored in multiple files, but you can concatenate them together from the command line. The command would look something like this:

    hadoop fs -cat path/to/output/file/* > combined.txt