Search code examples
pandasblazedask

How do I read tabulator separated CSV in blaze?


I have a "CSV" data file with the following format (well, it's rather a TSV):

event  pdg x   y   z   t   px  py  pz  ekin
3383    11  -161.515    5.01938e-05 -0.000187112    0.195413    0.664065    0.126078    -0.736968   0.00723234  
1694    11  -161.515    -0.000355633    0.000263174 0.195413    0.511853    -0.523429   0.681196    0.00472714  
4228    11  -161.535    6.59631e-06 -3.32796e-05    0.194947    -0.713983   -0.0265468  -0.69966    0.0108681   
4233    11  -161.515    -0.000524488    6.5069e-05  0.195413    0.942642    0.331324    0.0406377   0.017594

This file is interpretable as-is in pandas:

from pandas import read_csv, read_table
data = read_csv("test.csv", sep="\t", index_col=False)     # Works
data = read_table("test.csv", index_col=False)             # Works

However, when I try to read it in blaze (that declares to use pandas keyword arguments), an exception is thrown:

from blaze import Data 
Data("test.csv")                             # Attempt 1
Data("test.csv", sep="\t")                   # Attempt 2
Data("test.csv", sep="\t", index_col=False)  # Attempt 3

None of these works and pandas is not used at all. The "sniffer" that tries to deduce column names and types just calls csv.Sniffer.sniff() from standard library (which fails).

Is there a way how to properly read this file in blaze (given that its "little brother" has few hundred MBs, I want to use blaze's sequential processing capabilities)?

Thanks for any ideas.

Edit: I think it might be a problem of odo/csv and filed an issue: https://github.com/blaze/odo/issues/327

Edit2: Complete error:

Error Traceback (most recent call last)  in () ----> 1 bz.Data("test.csv", sep="\t", index_col=False)

/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/blaze/interactive.py in Data(data, dshape, name, fields, columns, schema, **kwargs)
     54     if isinstance(data, _strtypes):
     55         data = resource(data, schema=schema, dshape=dshape, columns=columns,
---> 56                         **kwargs)
     57     if (isinstance(data, Iterator) and
     58             not isinstance(data, tuple(not_an_iterator))):

/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/regex.py in __call__(self, s, *args, **kwargs)
     62 
     63     def __call__(self, s, *args, **kwargs):
---> 64         return self.dispatch(s)(s, *args, **kwargs)
     65 
     66     @property

/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in resource_csv(uri, **kwargs)
    276 @resource.register('.+\.(csv|tsv|ssv|data|dat)(\.gz|\.bz2?)?')
    277 def resource_csv(uri, **kwargs):
--> 278     return CSV(uri, **kwargs)
    279 
    280 

/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in __init__(self, path, has_header, encoding, sniff_nbytes, **kwargs)
    102         if has_header is None:
    103             self.has_header = (not os.path.exists(path) or
--> 104                                infer_header(path, sniff_nbytes))
    105         else:
    106             self.has_header = has_header

/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in infer_header(path, nbytes, encoding, **kwargs)
     58     with open_file(path, 'rb') as f:
     59         raw = f.read(nbytes)
---> 60     return csv.Sniffer().has_header(raw if PY2 else raw.decode(encoding))
     61 
     62 

/home/[username-hidden]/anaconda3/lib/python3.4/csv.py in has_header(self, sample)
    392         # subtracting from the likelihood of the first row being a header.
    393 
--> 394         rdr = reader(StringIO(sample), self.sniff(sample))
    395 
    396         header = next(rdr) # assume first row is header

/home/[username-hidden]/anaconda3/lib/python3.4/csv.py in sniff(self, sample, delimiters)
    187 
    188         if not delimiter:
--> 189             raise Error("Could not determine delimiter")
    190 
    191         class dialect(Dialect):

Error: Could not determine delimiter

Solution

  • I am working with Python 2.7.10, dask v0.7.1, blaze v0.8.2 and conda v3.17.0.

    conda install dask
    conda install blaze
    

    Here is a way you can import the data for use with blaze. Parse the data first with pandas and then convert it into blaze. Perhaps this defeats the purpose, but there are no troubles this way.

    As a side note in order to parse the data file correctly your line in pandas parse statment should be:

    from blaze import Data
    from pandas import DataFrame, read_csv
    data = read_csv("csvdata.dat", sep="\s*", index_col=False)
    bdata = Data(data)
    

    Now the data is formatted correctly with no errors, bdata:

       event  pdg        x         y         z         t        px        py  \
    0   3383   11 -161.515  0.000050 -0.000187  0.195413  0.664065  0.126078   
    1   1694   11 -161.515 -0.000356  0.000263  0.195413  0.511853 -0.523429   
    2   4228   11 -161.535  0.000007 -0.000033  0.194947 -0.713983 -0.026547   
    3   4233   11 -161.515 -0.000524  0.000065  0.195413  0.942642  0.331324   
    
         pz      ekin  
    0 -0.736968  0.007232  
    1  0.681196  0.004727  
    2 -0.699660  0.010868  
    

    Here is an alternative, use dask, it probably can do the same chunking, or large scale processing you are looking for. Dask certainly makes it immediately easy to correctly load a tsv format.

    In [17]: import dask.dataframe as dd
    
    In [18]: df = dd.read_csv('tsvdata.txt', sep='\t', index_col=False)
    
    In [19]: df.head()
    Out[19]: 
       event  pdg        x         y         z         t        px        py  \
    0   3383   11 -161.515  0.000050 -0.000187  0.195413  0.664065  0.126078   
    1   1694   11 -161.515 -0.000356  0.000263  0.195413  0.511853 -0.523429   
    2   4228   11 -161.535  0.000007 -0.000033  0.194947 -0.713983 -0.026547   
    3   4233   11 -161.515 -0.000524  0.000065  0.195413  0.942642  0.331324   
    4    854   11 -161.515  0.000032  0.000418  0.195414  0.675752  0.315671   
    
             pz      ekin  
    0 -0.736968  0.007232  
    1  0.681196  0.004727  
    2 -0.699660  0.010868  
    3  0.040638  0.017594  
    4 -0.666116  0.012641  
    
    In [20]:
    

    See also: http://dask.pydata.org/en/latest/array-blaze.html#how-to-use-blaze-with-dask