I'm working with PySpark 2.1 and I need to come up with a way to write my dataframe to a .txt file of a specialized format; so not the typical json or csv, but rather a CTF format (for CNTK).
The file cannot have extra parenthesis or commas etc. It follows the form:
|label val |features val val val ... val
|label val |features val val val ... val
Some code to show this might be as follows:
l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))).toDF()
people.show(n=4)
def Convert_to_String(r):
return '|label ' + r.name + ' ' + '|features ' + str(r.age) + '\n'
m_p = people.rdd.map(lambda r: Row(Convert_to_String(r)) ).toDF()
m_p.show(n=3)
In the above example, I would want to simply append each string from each row into a file with out any extra characters.
The real data frame is quite large; It is likely ok for it to be split into multiple files; but would be preferable if the result were a single file.
Any insights is quite helpful.
THANKS!
Converting my comment to an answer.
Instead of converting each record to a Row
and calling toDF()
, just map each record to a string. Then call saveAsTextFile()
.
path = 'path/to/output/file'
# depending on your data, you may need to call flatMap
m_p = people.rdd.flatMap(lambda r: Convert_to_String(r))
# now m_p will contain a list of strings that you can write to a file
m_p.saveAsTextFile(path)
Your data will likely be stored in multiple files, but you can concatenate them together from the command line. The command would look something like this:
hadoop fs -cat path/to/output/file/* > combined.txt