I have a text file that looks like this:
5 [0, 1] [512, 479] 991
10 [1, 0] [706, 280] 986
15 [1, 0] [807, 175] 982
20 [1, 0] [895, 92] 987
Each column is tab separated, but there are arrays in some of the columns. Can I import these with np.genfromtxt
in some way?
The resulting unpacked lists should be, for example:
data1 = [..., 5, 10, 15, 20, ...]
data2 = [..., [512, 479], [706, 280], ... ] (i.e. a 2D list)
I tried
data1, data2, data3, data4 = np.genfromtxt('data.txt', dtype=None, delimiter='\t', unpack=True)
but data2
and data3
are lists containing 'nan'.
Brackets in a csv
file are klunky no matter how you look at it. The default csv
structure is 2d - rows and uniform columns. The brackets add a level of nesting. But the fact that the columns are tab separated, while the nested blocks are comma separated makes it a bit easier.
Your comment code is (with added newlines)
datastr = data[i][1][1:-1].split(',')
dataarray = []
for j in range(0, len(datastr)):
I assume data[i]
looks something like (after a tab split):
['5', '[0, 1]', '[512, 479]', '991']
So for the '[0,1]' you strip of the []
, split the rest, and put that list back on to data2
That certainly looks like a viable approach. genfromtxt
does handle brackets or quotes. The csv
reader can handle quoted text, and might be adapted to treat []
as quotes. But other than that I think the '[]` have to be handled with some sort of string processing as you do.
Keep in mind that genfromtxt
just reads lines, parses them, and collects the resulting lists in a master list. It then converts that list to an array at the end. So doing your own line by line, string by string parsing is not inferior.
With your sample as a text file:
In [173]: txt=b"""
...: 5 \t [0, 1] \t [512, 479] \t 991
...: 10 \t [1, 0] \t [706, 280] \t 986
...: 15 \t [1, 0] \t [807, 175] \t 982
...: 20 \t [1, 0] \t [895, 92] \t 987"""
A simple genfromtxt
call with dtype=None
In [186]: data = np.genfromtxt(txt.splitlines(), dtype=None, delimiter='\t', autostrip=True)
The result is a structured array with integer and string fields:
In [187]: data
array([(5, b'[0, 1]', b'[512, 479]', 991),
(10, b'[1, 0]', b'[706, 280]', 986),
(15, b'[1, 0]', b'[807, 175]', 982),
(20, b'[1, 0]', b'[895, 92]', 987)],
dtype=[('f0', '<i4'), ('f1', 'S6'), ('f2', 'S10'), ('f3', '<i4')])
Fields are accessed by name
In [188]: data['f0']
Out[188]: array([ 5, 10, 15, 20])
In [189]: data['f1']
array([b'[0, 1]', b'[1, 0]', b'[1, 0]', b'[1, 0]'],
If we can deal with the []
, your data could be nicely represented a structured array with a compound dtype
In [191]: dt=np.dtype('i,2i,2i,i')
In [192]: np.ones((3,),dtype=dt)
array([(1, [1, 1], [1, 1], 1), (1, [1, 1], [1, 1], 1),
(1, [1, 1], [1, 1], 1)],
dtype=[('f0', '<i4'), ('f1', '<i4', (2,)), ('f2', '<i4', (2,)), ('f3', '<i4')])
where the 'f1' field is a (3,2) array.
One approach is to pass the text/file through a function that filters out the extra characters. genfromtxt
works with anything that will feed it a line at a time.
def afilter(txt):
for line in txt.splitlines():
line=line.replace(b'[', b' ').replace(b']', b'').replace(b',' ,b'\t')
yield line
This generator strips out the [] and replaces the , with tab, in effect producing a flat csv file
In [205]: list(afilter(txt))
b'5 \t 0\t 1 \t 512\t 479 \t 991',
b'10 \t 1\t 0 \t 706\t 280 \t 986',
b'15 \t 1\t 0 \t 807\t 175 \t 982',
b'20 \t 1\t 0 \t 895\t 92 \t 987']
with dtype=None
will produce an array with 6 columns.
In [209]: data=np.genfromtxt(afilter(txt),delimiter='\t',dtype=None)
In [210]: data
array([[ 5, 0, 1, 512, 479, 991],
[ 10, 1, 0, 706, 280, 986],
[ 15, 1, 0, 807, 175, 982],
[ 20, 1, 0, 895, 92, 987]])
In [211]: data.shape
Out[211]: (4, 6)
But if I give it the dt
dtype I defined above, I get a structured array:
In [206]: data=np.genfromtxt(afilter(txt),delimiter='\t',dtype=dt)
In [207]: data
array([(5, [0, 1], [512, 479], 991), (10, [1, 0], [706, 280], 986),
(15, [1, 0], [807, 175], 982), (20, [1, 0], [895, 92], 987)],
dtype=[('f0', '<i4'), ('f1', '<i4', (2,)), ('f2', '<i4', (2,)), ('f3', '<i4')])
In [208]: data['f1']
array([[0, 1],
[1, 0],
[1, 0],
[1, 0]], dtype=int32)
The brackets could dealt with at several levels. I don't think there's a lot of advantage of one over the other.