Search code examples
pythonnumpypython-2.7pandasdata-analysis

how to read from an array without a particular column in python


I have a numpy array of dtype = object (which are actually lists of various data types). So it makes a 2D array because I have an array of lists (?). I want to copy every row & only certain columns of this array to another array. I stored data in this array from a csv file. This csv file contains several fields(columns) and large amount of rows. Here's the code chunk I used to store data into the array.

data = np.zeros((401125,), dtype = object)
for i, row in enumerate(csv_file_object):
    data[i] = row

data can be basically depicted as follows

column1  column2  column3  column4  column5 ....
1         none     2       'gona'    5.3
2         34       2       'gina'    5.5
3         none     2       'gana'    5.1
4         43       2       'gena'    5.0
5         none     2       'guna'    5.7
.....     ....   .....      .....    ....
.....     ....   .....      .....    ....
.....     ....   .....      .....    ....

There're unwanted fields in the middle that I want to remove. Suppose I don't want column3. How do I remove only that column from my array? Or copy only relevant columns to another array?


Solution

  • Use pandas. Also it seems to me, that for various type of data as yours, the pandas.DataFrame may be better fit.

    from StringIO import StringIO
    from pandas import *
    import numpy as np
    
    data = """column1  column2  column3  column4  column5
    1         none     2       'gona'    5.3
    2         34       2       'gina'    5.5
    3         none     2       'gana'    5.1
    4         43       2       'gena'    5.0
    5         none     2       'guna'    5.7"""
    
    data = StringIO(data)
    print read_csv(data, delim_whitespace=True).drop('column3',axis =1)
    

    out:

       column1 column2 column4  column5
    0        1    none  'gona'      5.3
    1        2      34  'gina'      5.5
    2        3    none  'gana'      5.1
    3        4      43  'gena'      5.0
    4        5    none  'guna'      5.7
    

    If you need an array instead of DataFrame, use the to_records() method:

    df.to_records(index = False)
    #output:
    rec.array([(1L, 'none', "'gona'", 5.3),
               (2L, '34', "'gina'", 5.5),
               (3L, 'none', "'gana'", 5.1),
               (4L, '43', "'gena'", 5.0),
               (5L, 'none', "'guna'", 5.7)], 
                dtype=[('column1', '<i8'), ('column2', '|O4'),
                       ('column4', '|O4'), ('column5', '<f8')])