Numpy's genfromtxt returns different structured data depending on dtype parameters

I have the following:

from numpy import genfromtxt    
seg_data1 = genfromtxt('./datasets/segmentation.all', delimiter=',', dtype="|S5")
seg_data2 = genfromtxt('./datasets/segmentation.all', delimiter=',', dtype=["|S5"] + ["float" for n in range(19)])

print seg_data1
print seg_data2

print seg_data1[:,0:1]
print seg_data2[:,0:1]

it turns out that seg_data1 and seg_data2 are not the same kind of structure. Here's what printed:

[['BRICK' '140.0' '125.0' ..., '7.777' '0.545' '-1.12']
 ['BRICK' '188.0' '133.0' ..., '8.444' '0.538' '-0.92']
 ['BRICK' '105.0' '139.0' ..., '7.555' '0.532' '-0.96']
 ..., 
 ['CEMEN' '128.0' '161.0' ..., '10.88' '0.540' '-1.99']
 ['CEMEN' '150.0' '158.0' ..., '12.22' '0.503' '-1.94']
 ['CEMEN' '124.0' '162.0' ..., '14.55' '0.479' '-2.02']]
[ ('BRICK', 140.0, 125.0, 9.0, 0.0, 0.0, 0.2777779, 0.06296301, 0.66666675, 0.31111118, 6.185185, 7.3333335, 7.6666665, 3.5555556, 3.4444444, 4.4444447, -7.888889, 7.7777777, 0.5456349, -1.1218182)
 ('BRICK', 188.0, 133.0, 9.0, 0.0, 0.0, 0.33333334, 0.26666674, 0.5, 0.077777736, 6.6666665, 8.333334, 7.7777777, 3.8888888, 5.0, 3.3333333, -8.333333, 8.444445, 0.53858024, -0.92481726)
 ('BRICK', 105.0, 139.0, 9.0, 0.0, 0.0, 0.27777782, 0.107407436, 0.83333325, 0.52222216, 6.111111, 7.5555553, 7.2222223, 3.5555556, 4.3333335, 3.3333333, -7.6666665, 7.5555553, 0.5326279, -0.96594584)
 ...,
 ('CEMEN', 128.0, 161.0, 9.0, 0.0, 0.0, 0.55555534, 0.25185192, 0.77777785, 0.16296278, 7.148148, 5.5555553, 10.888889, 5.0, -4.7777777, 11.222222, -6.4444447, 10.888889, 0.5409177, -1.9963073)
 ('CEMEN', 150.0, 158.0, 9.0, 0.0, 0.0, 2.166667, 1.6333338, 1.388889, 0.41851807, 8.444445, 7.0, 12.222222, 6.111111, -4.3333335, 11.333333, -7.0, 12.222222, 0.50308645, -1.9434487)
 ('CEMEN', 124.0, 162.0, 9.0, 0.11111111, 0.0, 1.3888888, 1.1296295, 2.0, 0.8888891, 10.037037, 8.0, 14.555555, 7.5555553, -6.111111, 13.555555, -7.4444447, 14.555555, 0.4799313, -2.0293121)]
[['BRICK']
 ['BRICK']
 ['BRICK']
 ..., 
 ['CEMEN']
 ['CEMEN']
 ['CEMEN']]
Traceback (most recent call last):
  File "segmentationdata.py", line 14, in <module>
    print seg_data2[:,0:1]
IndexError: too many indices for array

I'd rather have genfromtxt return data in the form of seg_data1, though I don't know of any built-in way to force seg_data2 to conform to that type. As far as I know there's no easy way to do:

seg_target1 = seg_data1[:,0:1]
seg_data1 = seg_data1[:,1:]

for seg_data2. Now I could do data.astype(float) but the point is, isn't that what genfromtxt should have done to begin with when I gave it that dtype array?

Solution

With dtype="|S5" you import all columns as strings (5 char). The result is a 2d array with rows like

['BRICK' '140.0' '125.0' ..., '7.777' '0.545' '-1.12']

With dtype=["|S5"] + ["float" for n in range(19)] you specify the dtype for each column, the result is a structured array. It is 1d with 20 fields. You access the fields by name (look at set_data2.dtype), not by column number.

A element, or record, of this array is displayed as a tuple, and includes a string and 19 floats:

('BRICK', 140.0, 125.0, 9.0, 0.0, 0.0, 0.2777779, 0.06296301, 0.66666675, 0.31111118, 6.185185, 7.3333335, 7.6666665, 3.5555556, 3.4444444, 4.4444447, -7.888889, 7.7777777, 0.5456349, -1.1218182)

# the initial character column

print set_data2['f0']

Specifying dtype=None should produce the same thing, possibly with some integer columns instead of all floats.

It is also possible to specify a dtype with 2 fields, one the string column, and the other the 19 floats. I'd have to check the docs and run a few test cases to be sure of the format.

I think you read enough of genfromtxt docs to see that you could specify a compound dtype, but not enough to understand the results.

=================

Example of importing csv with text and numbers:

In [139]: txt=b"""one 1 2 3
     ...: two 4 5 6
     ...: """

default: all floats

In [140]: np.genfromtxt(txt.splitlines())
Out[140]: 
array([[ nan,   1.,   2.,   3.],
       [ nan,   4.,   5.,   6.]])

automatic dtype selection - 4 fields

In [141]: np.genfromtxt(txt.splitlines(),dtype=None)
Out[141]: 
array([(b'one', 1, 2, 3), (b'two', 4, 5, 6)], 
      dtype=[('f0', 'S3'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])

user specified field dtypes

In [142]: np.genfromtxt(txt.splitlines(),dtype='str,int,float,int')
Out[142]: 
array([('', 1, 2.0, 3), ('', 4, 5.0, 6)], 
      dtype=[('f0', '<U'), ('f1', '<i4'), ('f2', '<f8'), ('f3', '<i4')])

Compound dtype, with column count for the numeric field (and correction to string column)

In [145]: np.genfromtxt(txt.splitlines(),dtype='S5,(3)int')
Out[145]: 
array([(b'one', [1, 2, 3]), (b'two', [4, 5, 6])], 
      dtype=[('f0', 'S5'), ('f1', '<i4', (3,))])

In [146]: _['f0']
Out[146]: 
array([b'one', b'two'], 
      dtype='|S5')

In [149]: _['f1']
Out[149]: 
array([[1, 2, 3],
       [4, 5, 6]])

If you need to do math across the numeric fields, this last case (or something more elaborate) might be most convenient.

To generate something more complicated it may be best to develop the dtype in a separate expression (dtype syntax can be tricky)

In [172]: dt=np.dtype([('f0','|S5'),('f1',[('f10',int),('f11',float,(2))])])

In [173]: np.genfromtxt(txt.splitlines(),dtype=dt)
Out[173]: 
array([(b'one', (1, [2.0, 3.0])), (b'two', (4, [5.0, 6.0]))], 
      dtype=[('f0', 'S5'), ('f1', [('f10', '<i4'), ('f11', '<f8', (2,))])])