Search code examples
pythonbinaryfilesprintfdataformatdata-representation

Efficient data interchange format using only C-style fprintf() statements?


I need to transfer the very large dataset (between 1-10 mil records, possibly much more) from a domain-specific language (whose sole output mechanism is a C-style fprintf statement) to Python.

Currently, I'm using the DSL's fprintf to write records to a flat file. The flat file looks like this:

x['a',1,2]=1.23456789012345e-01
x['a',1,3]=1.23456789012345e-01
x['a',1,4]=1.23456789012345e-01
y1=1.23456789012345e-01
y2=1.23456789012345e-01
z['a',1,2]=1.23456789012345e-01
z['a',1,3]=1.23456789012345e-01
z['a',1,4]=1.23456789012345e-01

As you can see the structure of each record is very simple (but the representation of the double-precision float as a 20-char string is grossly inefficient!):

<variable-length string> + "=" + <double-precision float>

I'm currently using Python to read each line and split it on the "=".

Is there anything I can do to make the representation more compact, so as to make it faster for Python to read? Is some sort of binary-encoding possible with fprintf?


Solution

  • Err.... How many times per minute are you reading this data from Python?

    Because in my system I could read such a file with 20 million records (~400MB) in well under a second.

    Unless you are performing this in a limited hardware, I'd say you are worrying too much about nothing.

    >>> timeit("all(b.read(20) for x in xrange(0, 20000000,20)  ) ", "b=open('data.dat')", number=1)
    0.2856929302215576
    >>> c = open("data.dat").read()
    >>> len(c)
    380000172