Search code examples
pythoncsvdataframeparquet

How to prevent Tabular format when writing a parquet file into CSV file using pandas.DataFrame?


I read a parquet file that is the output of spark mllib using pyarrow.parquet. The output is consists of some rows and each row has two pairs: a word and a vector(each line is a word2vec pair). like the following:

 word1 "[-0.10812066  0.04352815 0.00529436 -0.0492562 -0.0974493533  0.275364409  -0.06501597  -0.3123745185 0.28186324 -0.05055101 0.06338456   -0.0842542  -0.10491376 -0.09692618 0.02451115  0.10766134]"  
 word2 "[-0.10812066  0.04352815 0.1875908 -0.0492562 ...
 ... 

when I used DataFrame to write the results in a csv file, I got this:

 word1 "[-0.10812066  0.04352815 0.00529436 -0.0492562
    -0.0974493533  0.275364409  -0.06501597  -0.3123745185
    0.28186324 -0.05055101 0.06338456   -0.0842542   
    -0.10491376 -0.09692618 0.02451115  0.10766134]"  
 word2 "[-0.10812066  0.04352815 0.1875908 -0.0492562 ...
 ... 

as you can see, each vector at the special position is separated into some lines. How can I get csv output as something I read from parquet file? my source code is here:

import pandas as pd
import pyarrow.parquet as pq

data = pq.read_pandas('C://Users//...//p.parquet', columns=['word', 'vector']).to_pandas()

df = pd.DataFrame(data)

pd.DataFrame.to_csv(df, 'C://Users/...//p.csv', sep=" ", encoding='utf-8', columns=['word', 'vector'], index=False, header=False)

The DataFrame size is: 47524 and DataFrame shape is: (23762, 2)


Solution

  • After a lot of searches, I didn't find a direct solution for my problem. but I solved my problem using lists in python.

    data = pq.read_pandas('C://...//p.parquet', columns['word','vector']).to_pandas()
    df = pd.DataFrame(data)
    
    vector = df['vector'].tolist()
    word = df['word'].tolist()
    
    k = [[]]
    for i in range(0, word.__len__()):
        l = []
        l.append(word[i])
        l.extend(vector[i])
        k.append(l)
    
    with open('C://...//f.csv', "w", encoding="utf-8") as f:
        writer = csv.writer(f)
        for row in k:
            writer.writerow(row)
    

    so, the output was shown in the same shape as expected.