Search code examples
pythonnumpygenfromtxt

How to import same column name data with np.genfromtxt?


I have data in the file data.dat of the form:

column_1    col col col col col
1   2   3   1   2   3
4   3   2   3   2   4
1   4   3   1   4   3
5   6   4   5   6   4

And I am trying to import using np.genfromtxt, so that all data with column name col is stored in variable y. I tried it using the code:

import numpy as np
data = np.genfromtxt('data.dat', comments='#', delimiter='\t', dtype=None, names=True).transpose()
y = data['col']

But it gives me the following error:

ValueError: two fields with the same name

How can this be solved in Python?


Solution

  • When you use name=True, np.genfromtxt returns a structured array. Notice that the columns labelled col in data.dat get disambiguated to column names of the form col_n:

    In [114]: arr = np.genfromtxt('data', comments='#', delimiter='\t', dtype=None, names=True)
    
    In [115]: arr
    Out[115]: 
    array([(1, 2, 3, 1, 2, 3), (4, 3, 2, 3, 2, 4), (1, 4, 3, 1, 4, 3),
           (5, 6, 4, 5, 6, 4)], 
          dtype=[('column_1', '<i8'), ('col', '<i8'), ('col_1', '<i8'), ('col_2', '<i8'), ('col_3', '<i8'), ('col_4', '<i8')])
    

    So once you use names=True it becomes harder to select all the data associated with column name col. Moreover, the structured array does not allow you to slice multiple columns at one time. So it would be more convenient to instead load the data into an array of homogenous dtype (which is what you would get without names=True):

    with open('data.dat', 'rb') as f:
        header = f.readline().strip().split('\t')
        arr = np.genfromtxt(f, comments='#', delimiter='\t', dtype=None)
    

    Then you can find the numerical index of those columns whose name is col:

    idx = [i for i, col in enumerate(header) if col=='col']
    

    and select all the data with

    y = arr[:, idx]
    

    For example,

    import numpy as np
    
    with open('data.dat', 'rb') as f:
        header = f.readline().strip().split('\t')
        arr = np.genfromtxt(f, comments='#', delimiter='\t', dtype=None)
        idx = [i for i, col in enumerate(header) if col=='col']
        y = arr[:, idx]
        print(y)
    

    yields

    [[2 3 1 2 3]
     [3 2 3 2 4]
     [4 3 1 4 3]
     [6 4 5 6 4]]
    

    If you want y to be 1-dimensional, you could use ravel():

    print(y.ravel())
    

    yields

    [2 3 1 2 3 3 2 3 2 4 4 3 1 4 3 6 4 5 6 4]