Search code examples
pythonarraysnumpygenfromtxt

Python Import Text Array with Numpy


I have a text file that looks like this:

...
5   [0, 1]  [512, 479]  991
10  [1, 0]  [706, 280]  986
15  [1, 0]  [807, 175]  982
20  [1, 0]  [895, 92]   987
...

Each column is tab separated, but there are arrays in some of the columns. Can I import these with np.genfromtxt in some way?

The resulting unpacked lists should be, for example:

data1 = [..., 5, 10, 15, 20, ...]
data2 = [..., [512, 479], [706, 280], ... ] (i.e. a 2D list)
etc.

I tried

data1, data2, data3, data4 = np.genfromtxt('data.txt', dtype=None, delimiter='\t', unpack=True)

but data2 and data3 are lists containing 'nan'.


Solution

  • Brackets in a csv file are klunky no matter how you look at it. The default csv structure is 2d - rows and uniform columns. The brackets add a level of nesting. But the fact that the columns are tab separated, while the nested blocks are comma separated makes it a bit easier.

    Your comment code is (with added newlines)

    datastr = data[i][1][1:-1].split(',') 
    dataarray = [] 
    for j in range(0, len(datastr)): 
         dataarray.append(int(datastr[j])) 
    data2.append(dataarray)
    

    I assume data[i] looks something like (after a tab split):

    ['5', '[0, 1]', '[512, 479]',  '991']
    

    So for the '[0,1]' you strip of the [], split the rest, and put that list back on to data2.

    That certainly looks like a viable approach. genfromtxt does handle brackets or quotes. The csv reader can handle quoted text, and might be adapted to treat [] as quotes. But other than that I think the '[]` have to be handled with some sort of string processing as you do.

    Keep in mind that genfromtxt just reads lines, parses them, and collects the resulting lists in a master list. It then converts that list to an array at the end. So doing your own line by line, string by string parsing is not inferior.

    =============

    With your sample as a text file:

     In [173]: txt=b"""
     ...: 5  \t [0, 1] \t [512, 479] \t 991
     ...: 10 \t [1, 0] \t [706, 280] \t 986
     ...: 15 \t [1, 0] \t [807, 175] \t 982
     ...: 20 \t [1, 0] \t [895, 92]  \t 987"""
    

    A simple genfromtxt call with dtype=None:

    In [186]: data = np.genfromtxt(txt.splitlines(), dtype=None, delimiter='\t', autostrip=True)
    

    The result is a structured array with integer and string fields:

    In [187]: data
    Out[187]: 
    array([(5, b'[0, 1]', b'[512, 479]', 991),
           (10, b'[1, 0]', b'[706, 280]', 986),
           (15, b'[1, 0]', b'[807, 175]', 982),
           (20, b'[1, 0]', b'[895, 92]', 987)], 
          dtype=[('f0', '<i4'), ('f1', 'S6'), ('f2', 'S10'), ('f3', '<i4')])
    

    Fields are accessed by name

    In [188]: data['f0']
    Out[188]: array([ 5, 10, 15, 20])
    In [189]: data['f1']
    Out[189]: 
    array([b'[0, 1]', b'[1, 0]', b'[1, 0]', b'[1, 0]'], 
          dtype='|S6')
    

    If we can deal with the [], your data could be nicely represented a structured array with a compound dtype

    In [191]: dt=np.dtype('i,2i,2i,i')
    In [192]: np.ones((3,),dtype=dt)
    Out[192]: 
    array([(1, [1, 1], [1, 1], 1), (1, [1, 1], [1, 1], 1),
           (1, [1, 1], [1, 1], 1)], 
          dtype=[('f0', '<i4'), ('f1', '<i4', (2,)), ('f2', '<i4', (2,)), ('f3', '<i4')])
    

    where the 'f1' field is a (3,2) array.

    One approach is to pass the text/file through a function that filters out the extra characters. genfromtxt works with anything that will feed it a line at a time.

    def afilter(txt):
        for line in txt.splitlines():
            line=line.replace(b'[', b' ').replace(b']', b'').replace(b',' ,b'\t')
            yield line
    

    This generator strips out the [] and replaces the , with tab, in effect producing a flat csv file

    In [205]: list(afilter(txt))
    Out[205]: 
    [b'',
     b'5  \t  0\t 1  \t  512\t 479  \t 991',
     b'10 \t  1\t 0  \t  706\t 280  \t 986',
     b'15 \t  1\t 0  \t  807\t 175  \t 982',
     b'20 \t  1\t 0  \t  895\t 92   \t 987']
    

    genfromtxt with dtype=None will produce an array with 6 columns.

    In [209]: data=np.genfromtxt(afilter(txt),delimiter='\t',dtype=None)
    In [210]: data
    Out[210]: 
    array([[  5,   0,   1, 512, 479, 991],
           [ 10,   1,   0, 706, 280, 986],
           [ 15,   1,   0, 807, 175, 982],
           [ 20,   1,   0, 895,  92, 987]])
    In [211]: data.shape
    Out[211]: (4, 6)
    

    But if I give it the dt dtype I defined above, I get a structured array:

    In [206]: data=np.genfromtxt(afilter(txt),delimiter='\t',dtype=dt)
    In [207]: data
    Out[207]: 
    array([(5, [0, 1], [512, 479], 991), (10, [1, 0], [706, 280], 986),
           (15, [1, 0], [807, 175], 982), (20, [1, 0], [895, 92], 987)], 
          dtype=[('f0', '<i4'), ('f1', '<i4', (2,)), ('f2', '<i4', (2,)), ('f3', '<i4')])
    In [208]: data['f1']
    Out[208]: 
    array([[0, 1],
           [1, 0],
           [1, 0],
           [1, 0]], dtype=int32)
    

    The brackets could dealt with at several levels. I don't think there's a lot of advantage of one over the other.