Search code examples
pythonpandascsvdaskintake

Column Name Shift using read_csv in Dask


I'm attempting to use Intake to catalog a csv dataset. It uses the Dask implementation of read_csv which in turn uses the pandas implementation.

The issue I'm seeing is that the csv files I'm loading don't have an index column so Dask is interpreting the first column to be the index and then shifting the column names to the right.

An example:enter image description here

The datetime (dt) column is supposed to be the first column but when the csv is read, it is interpreted to be the index and the column names are shifted and therefore offset from their proper place. I'm supplying the column names list and dtypes dictionary into the read_csv call.

As far as I can tell, if I were using pandas I would supply the index_col=False kwarg to fix as illustrated here, but Dask returns an intentional error stating that: Keywords 'index' and 'index_col' not supported. Use dd.read_csv(...).set_index('my-index') instead. This seems to be due to a parallelization limitation.

The suggested fix (using set_index('my-index)) isn't effective in this case because it expects the whole file to be read while also having column names to set the index. The main issue being, I can't accurately set the index column if the name is offset.

What is the best way, in Dask, to load a csv that doesn't explicitly have an index column such that the interpreted index column at least retains the specified column name?

More information:

The play dataset I'm using: https://www.kaggle.com/NUFORC/ufo-sightings?select=scrubbed.csv

The Intake catalog.yml file I'm using is the following:

name:
  intake-explore-catalog
metadata:
  version: 1
sources:
    ufo_sightings:
      description: data around ufo sightings
      driver: csv
      args:
        urlpath: "{{CATALOG_DIR}}/data/ufo_scrubbed.csv"
        csv_kwargs:
          header: 0
          names: ['dt', 'city', 'state', 'country', 'shape', 'duration_s', 'duration_hm', 'comments', 'date_posted', 'latitude']
          dtype: {'dt': 'str', 'city': 'str', 'state': 'str', 'country': 'str', 'shape': 'str', 'duration_s': 'str', 'duration_hm': 'str', 'comments': 'str', 'date_posted': 'str', 'latitude': 'str'}
          infer_datetime_format: true
      metadata:
        version: 1
        custom_field: blah

I'm loading the catalog and corresponding dataset using the following:

cat = intake.open_catalog("catalog.yml")
ufo_ds = cat.ufo_sightings.read()

This results in the read in dataframe shown above and a csv copy of that data:

,dt,city,state,country,shape,duration_s,duration_hm,comments,date_posted,latitude
10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church. The Baptist Church sit,4/27/2004,29.8830556,-97.9411111
10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX.  Lights racing across the sky & making 90 degree turns on a dime.,12/16/2005,29.38421,-98.581082
10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 England,1/21/2008,53.2,-2.916667
10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving the only Edna theater at about 9 PM&#44...we had our bikes and I took a different route home,1/17/2004,28.9783333,-96.6458333
10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/attack aircraft on a solo night exercise&#44 I was at 50&#44000&#39 in a "clean" aircraft (no ordinan,1/22/2004,21.4180556,-157.8036111

Compared to the original/raw data csv (no leading comma):

datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude 
10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,"This event took place in early fall around 1949-50. It occurred after a Boy Scout meeting in the Baptist Church. The Baptist Church sit",4/27/2004,29.8830556,-97.9411111
10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,"1949 Lackland AFB&#44 TX.  Lights racing across the sky & making 90 degree turns on a dime.",12/16/2005,29.38421,-98.581082
10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,"Green/Orange circular disc over Chester&#44 England",1/21/2008,53.2,-2.916667
10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,"My older brother and twin sister were leaving the only Edna theater at about 9 PM&#44...we had our bikes and I took a different route home",1/17/2004,28.9783333,-96.6458333
10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,"AS a Marine 1st Lt. flying an FJ4B fighter/attack aircraft on a solo night exercise&#44 I was at 50&#44000&#39 in a "clean" aircraft (no ordinan",1/22/2004,21.4180556,-157.8036111
10/10/1961 19:00,bristol,tn,us,sphere,300,5 minutes,"My father is now 89 my brother 52 the girl with us now 51 myself 49 and the other fellow which worked with my father if he&#39s still livi",4/27/2007,36.5950000,-82.1888889

Dask Invocation:

df = dask.dataframe.read_csv('data/ufo_scrubbed.csv',
                            names=['dt',
                                   'city',
                                   'state',
                                   'country',
                                   'shape',
                                   'duration_s',
                                   'duration_hm',
                                   'comments',
                                   'date_posted',
                                   'latitude'],
                             dtype = {'dt': 'str',
                                       'city': 'str',
                                       'state': 'str',
                                       'country': 'str',
                                       'shape': 'str',
                                       'duration_s': 'str',
                                       'duration_hm': 'str',
                                       'comments': 'str',
                                       'date_posted': 'str',
                                       'latitude': 'str'}
                            )

Solution

  • Unfortunately, the header line begins with a comma, which is why your column names are off by one. You would be best off fixing that, rather than working around it.

    However, you do not get an index automatically if you don't supply column names:

    df = dask.dataframe.read_csv('file.csv', header=0)
    

    here the index is just a range (counting from 0 in each partition). You could then assign column names after the fact

    df2 = df.rename(columns=dict(zip(df.columns, df.columns[1:]), latitude='longitude')) 
    

    You would not be able to achieve this with an Intake prescription alone, you would have to grab the dataframe via to_dask() or read() (for dask or pandas output, respectively).