I need to transfer the very large dataset (between 1-10 mil records, possibly much more) from a domain-specific language (whose sole output mechanism is a C-style fprintf
statement) to Python.
Currently, I'm using the DSL's fprintf
to write records to a flat file. The flat file looks like this:
x['a',1,2]=1.23456789012345e-01
x['a',1,3]=1.23456789012345e-01
x['a',1,4]=1.23456789012345e-01
y1=1.23456789012345e-01
y2=1.23456789012345e-01
z['a',1,2]=1.23456789012345e-01
z['a',1,3]=1.23456789012345e-01
z['a',1,4]=1.23456789012345e-01
As you can see the structure of each record is very simple (but the representation of the double-precision float as a 20-char string is grossly inefficient!):
<variable-length string> + "=" + <double-precision float>
I'm currently using Python to read each line and split it on the "=".
Is there anything I can do to make the representation more compact, so as to make it faster for Python to read? Is some sort of binary-encoding possible with fprintf
?
Err.... How many times per minute are you reading this data from Python?
Because in my system I could read such a file with 20 million records (~400MB) in well under a second.
Unless you are performing this in a limited hardware, I'd say you are worrying too much about nothing.
>>> timeit("all(b.read(20) for x in xrange(0, 20000000,20) ) ", "b=open('data.dat')", number=1)
0.2856929302215576
>>> c = open("data.dat").read()
>>> len(c)
380000172