Search code examples
pythondaskquoting

Export dask to csv without csv quoting but quotes within columns


I am trying to prepare billions of records with Dask. To later import them into influx db, the files need to be in format of in line protocol as csv (or txt, dat, ...) of the exact following structure:

What I need: ['output-dask-to-csv-0.dat']

weather temp=-6.73,air=963.7,prec=0.0,datetime="2011-01-01 00:00:13" 1293840013000000000  
weather temp=-6.74,air=963.7,prec=0.0,datetime="2011-01-01 00:00:13" 1293840013000000000 
weather air=963.7,datetime="2011-01-01 00:00:22" 1293840022000000000 
weather prec=0.0,datetime="2011-01-01 00:00:32" 1293840032000000000

I need to get rid of the quotes that occur automatically at the beginning and the end when I export dask to csv. At the same time need to preserve the double quotes for datetime with 1 empty space between date and time; as well as a comma separating entries in column fields; and empty space between columns measurement and fields, as well as fields and timestamp.

Sample code:

import dask.dataframe as dd
import pandas as pd
import csv

measurement = ["weather", "weather", "weather", "weather"]
fields = ["temp=8.73,air=962.71,prec=4.0", "temp=4.12,air=963.2,prec=30.0", "air=964.21", "prec=0.0"]
datetime = ["2012-01-01 00:00:13", "2012-01-01 00:00:13", "2012-01-01 00:00:22", "2012-01-01 00:00:32"]
timestamp = [1293840013000000000,1293840013000000000,1293840022000000000 ,1293840032000000000]

d = pd.DataFrame(data={"measurement": measurement, "fields":fields, "datetime":datetime,"timestamp":timestamp})
df = dd.from_pandas(d, npartitions=1)

For instance, what hasn't worked out so far and outputs the quotes at the beginning and at the end, as well as double quotes around datetime:

df['influx_format'] = df['measurement'] + ' ' + df.fields + df.timestamp.astype(str)
df.influx_format.to_csv(filename='output-dask-to-csv-*.dat', sep=" ", escapechar='"', header=False, index=0, decimal='.')
"weather temp=-6.73,air=963.7,prec=0.0,datetime=""2011-01-01 00:00:13"" 1293840013000000000"
"weather temp=-6.74,air=963.7,prec=0.0,datetime=""2011-01-01 00:00:13"" 1293840013000000000"
"weather air=963.7,datetime=""2011-01-01 00:00:22"" 1293840022000000000"
"weather prec=0.0,datetime=""2011-01-01 00:00:32"" 1293840032000000000"

I also couldn't get it to work with quoting=csv.QUOTE_NONE:

df['influx_format'] = df['measurement'] + ' ' + df.fields + df.timestamp.astype(str)
df.influx_format.to_csv(filename='output-dask-to-csv-*.dat', quoting=csv.QUOTE_NONE, quotechar="", sep=" ", escapechar='"', header=False, index=0, decimal='.')
weather" temp=-6.73,air=963.7,prec=0.0,datetime=""2011-01-01" 00:00:13""" 1293840013000000000
weather" temp=-6.74,air=963.7,prec=0.0,datetime=""2011-01-01" 00:00:13""" 1293840013000000000
weather" air=963.7,datetime=""2011-01-01" 00:00:22""" 1293840022000000000
weather" prec=0.0,datetime=""2011-01-01" 00:00:32""" 1293840032000000000

Does anybody have an idea & could help me out?


Solution

  • Following code gives me this:

    weather temp=8.73,air=962.71,prec=4.0,datetime="2012-01-01 00:00:13" 1293840013000000000 weather temp=4.12,air=963.2,prec=30.0,datetime="2012-01-01 00:00:13" 1293840013000000000 weather air=964.21,datetime="2012-01-01 00:00:22" 1293840022000000000 weather prec=0.0,datetime="2012-01-01 00:00:32" 1293840032000000000

    df['influx_format'] = df['measurement'] + ' ' + df.fields + ',datetime=\"'+df['datetime'] + '\" ' + df.timestamp.astype(str)
    df.influx_format.to_csv(filename='output-dask-to-csv-*.dat', quoting=csv.QUOTE_NONE, quotechar="", sep=" ", escapechar=' ',header=False, index=0, decimal='.')
    

    Hope this is helpful