I have a "CSV" data file with the following format (well, it's rather a TSV):
event pdg x y z t px py pz ekin 3383 11 -161.515 5.01938e-05 -0.000187112 0.195413 0.664065 0.126078 -0.736968 0.00723234 1694 11 -161.515 -0.000355633 0.000263174 0.195413 0.511853 -0.523429 0.681196 0.00472714 4228 11 -161.535 6.59631e-06 -3.32796e-05 0.194947 -0.713983 -0.0265468 -0.69966 0.0108681 4233 11 -161.515 -0.000524488 6.5069e-05 0.195413 0.942642 0.331324 0.0406377 0.017594
This file is interpretable as-is in pandas
:
from pandas import read_csv, read_table
data = read_csv("test.csv", sep="\t", index_col=False) # Works
data = read_table("test.csv", index_col=False) # Works
However, when I try to read it in blaze
(that declares to use pandas keyword arguments), an exception is thrown:
from blaze import Data
Data("test.csv") # Attempt 1
Data("test.csv", sep="\t") # Attempt 2
Data("test.csv", sep="\t", index_col=False) # Attempt 3
None of these works and pandas is not used at all. The "sniffer" that tries to deduce column names and types just calls csv.Sniffer.sniff()
from standard library (which fails).
Is there a way how to properly read this file in blaze (given that its "little brother" has few hundred MBs, I want to use blaze's sequential processing capabilities)?
Thanks for any ideas.
Edit: I think it might be a problem of odo/csv and filed an issue: https://github.com/blaze/odo/issues/327
Edit2: Complete error:
Error Traceback (most recent call last) in () ----> 1 bz.Data("test.csv", sep="\t", index_col=False) /home/[username-hidden]/anaconda3/lib/python3.4/site-packages/blaze/interactive.py in Data(data, dshape, name, fields, columns, schema, **kwargs) 54 if isinstance(data, _strtypes): 55 data = resource(data, schema=schema, dshape=dshape, columns=columns, ---> 56 **kwargs) 57 if (isinstance(data, Iterator) and 58 not isinstance(data, tuple(not_an_iterator))): /home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/regex.py in __call__(self, s, *args, **kwargs) 62 63 def __call__(self, s, *args, **kwargs): ---> 64 return self.dispatch(s)(s, *args, **kwargs) 65 66 @property /home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in resource_csv(uri, **kwargs) 276 @resource.register('.+\.(csv|tsv|ssv|data|dat)(\.gz|\.bz2?)?') 277 def resource_csv(uri, **kwargs): --> 278 return CSV(uri, **kwargs) 279 280 /home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in __init__(self, path, has_header, encoding, sniff_nbytes, **kwargs) 102 if has_header is None: 103 self.has_header = (not os.path.exists(path) or --> 104 infer_header(path, sniff_nbytes)) 105 else: 106 self.has_header = has_header /home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in infer_header(path, nbytes, encoding, **kwargs) 58 with open_file(path, 'rb') as f: 59 raw = f.read(nbytes) ---> 60 return csv.Sniffer().has_header(raw if PY2 else raw.decode(encoding)) 61 62 /home/[username-hidden]/anaconda3/lib/python3.4/csv.py in has_header(self, sample) 392 # subtracting from the likelihood of the first row being a header. 393 --> 394 rdr = reader(StringIO(sample), self.sniff(sample)) 395 396 header = next(rdr) # assume first row is header /home/[username-hidden]/anaconda3/lib/python3.4/csv.py in sniff(self, sample, delimiters) 187 188 if not delimiter: --> 189 raise Error("Could not determine delimiter") 190 191 class dialect(Dialect): Error: Could not determine delimiter
I am working with Python 2.7.10, dask v0.7.1, blaze v0.8.2 and conda v3.17.0.
conda install dask
conda install blaze
Here is a way you can import the data for use with blaze. Parse the data first with pandas and then convert it into blaze. Perhaps this defeats the purpose, but there are no troubles this way.
As a side note in order to parse the data file correctly your line in pandas parse statment should be:
from blaze import Data
from pandas import DataFrame, read_csv
data = read_csv("csvdata.dat", sep="\s*", index_col=False)
bdata = Data(data)
Now the data is formatted correctly with no errors, bdata
:
event pdg x y z t px py \
0 3383 11 -161.515 0.000050 -0.000187 0.195413 0.664065 0.126078
1 1694 11 -161.515 -0.000356 0.000263 0.195413 0.511853 -0.523429
2 4228 11 -161.535 0.000007 -0.000033 0.194947 -0.713983 -0.026547
3 4233 11 -161.515 -0.000524 0.000065 0.195413 0.942642 0.331324
pz ekin
0 -0.736968 0.007232
1 0.681196 0.004727
2 -0.699660 0.010868
Here is an alternative, use dask, it probably can do the same chunking, or large scale processing you are looking for. Dask certainly makes it immediately easy to correctly load a tsv format.
In [17]: import dask.dataframe as dd
In [18]: df = dd.read_csv('tsvdata.txt', sep='\t', index_col=False)
In [19]: df.head()
Out[19]:
event pdg x y z t px py \
0 3383 11 -161.515 0.000050 -0.000187 0.195413 0.664065 0.126078
1 1694 11 -161.515 -0.000356 0.000263 0.195413 0.511853 -0.523429
2 4228 11 -161.535 0.000007 -0.000033 0.194947 -0.713983 -0.026547
3 4233 11 -161.515 -0.000524 0.000065 0.195413 0.942642 0.331324
4 854 11 -161.515 0.000032 0.000418 0.195414 0.675752 0.315671
pz ekin
0 -0.736968 0.007232
1 0.681196 0.004727
2 -0.699660 0.010868
3 0.040638 0.017594
4 -0.666116 0.012641
In [20]:
See also: http://dask.pydata.org/en/latest/array-blaze.html#how-to-use-blaze-with-dask