Search code examples
pythonnumpyparsingmultilinemultilinestring

How do I efficiently read text files with complex multi-line data in Python?


In my search for a suitable solution, I came across various approaches such as numpy.loadtxt or numpy.genfromtxt that seemed promising at first glance, but didn't work straight-forwardly.

The Challenges

One line of my dataset is multiline, see the first two rows:

[array([[ 1,  0,  0,  0,  0],
       [ 0,  1,  0,  0,  0],
       [ 0,  0,  1,  0,  0],
       [ 1,  0, -1,  0,  0],
       [ 0,  0,  0,  1,  0],
       [ 0,  0,  0,  0,  1],
       [-2, -1,  0, -1, -1],
       [-2, -2,  0, -1,  0]]), 24, 4, 0, 232, 988, 1464, 10, 8, 246, 12]
[array([[ 1,  0,  0,  0,  0],
       [-1,  0,  0,  0,  0],
       [ 0,  1,  0,  0,  0],
       [ 0,  0,  1,  0,  0],
       [ 0,  0,  0,  1,  0],
       [-1, -1,  0,  1,  0],
       [ 0,  0,  0,  0,  1],
       [ 0, -2, -2, -2, -1]]), 28, 4, 0, 244, 1036, 1536, 10, 8, 260, 13]
...

CAS (computer algebra system) generate this output and my influence on the output format is limited.

The file is not very small, it can grow (after collecting more data) up to 3 GB.

What I tried so far

Trying to read data using np.genfromtxt('polytopes_5d_reflexive.txt') does not work seamlessly. Classic line-by-line reading is also not advantageous, as the number of lines describing one data row varies.

The Background

The saying goes "mathematical data is inexpensive". In our case, we actually generate such data ourselves, namely five-dimensional polytopes, in order to examine their properties such as vertex number, volume and others.

I do not want to demand a ready-made solution, but I am very grateful for the thought-inducing impulse in the right direction.


Solution

  • One possible solution is to use re/ast.literal_eval to parse the file. But yes, more correct solution would be to use proper serialization format (such as Json...):

    import re
    from ast import literal_eval
    
    import numpy as np
    
    txt = """\
    [array([[ 1,  0,  0,  0,  0],
           [ 0,  1,  0,  0,  0],
           [ 0,  0,  1,  0,  0],
           [ 1,  0, -1,  0,  0],
           [ 0,  0,  0,  1,  0],
           [ 0,  0,  0,  0,  1],
           [-2, -1,  0, -1, -1],
           [-2, -2,  0, -1,  0]]), 24, 4, 0, 232, 988, 1464, 10, 8, 246, 12]
    [array([[ 1,  0,  0,  0,  0],
           [-1,  0,  0,  0,  0],
           [ 0,  1,  0,  0,  0],
           [ 0,  0,  1,  0,  0],
           [ 0,  0,  0,  1,  0],
           [-1, -1,  0,  1,  0],
           [ 0,  0,  0,  0,  1],
           [ 0, -2, -2, -2, -1]]), 28, 4, 0, 244, 1036, 1536, 10, 8, 260, 13]"""
    
    
    for line in re.findall(r"^\[.*?\]$", txt, flags=re.S | re.M):
        arr, *rest = literal_eval(line.replace("array", ""))
        print(np.array(arr))
        print(rest)
        print()
    

    Prints:

    [[ 1  0  0  0  0]
     [ 0  1  0  0  0]
     [ 0  0  1  0  0]
     [ 1  0 -1  0  0]
     [ 0  0  0  1  0]
     [ 0  0  0  0  1]
     [-2 -1  0 -1 -1]
     [-2 -2  0 -1  0]]
    [24, 4, 0, 232, 988, 1464, 10, 8, 246, 12]
    
    [[ 1  0  0  0  0]
     [-1  0  0  0  0]
     [ 0  1  0  0  0]
     [ 0  0  1  0  0]
     [ 0  0  0  1  0]
     [-1 -1  0  1  0]
     [ 0  0  0  0  1]
     [ 0 -2 -2 -2 -1]]
    [28, 4, 0, 244, 1036, 1536, 10, 8, 260, 13]