Search code examples
pythoncsvnumpygenfromtxt

Python : Reading CSV using np.genfromtxt resulting in different number of columns


I am using np.genfromtxt to read a csv. I am not sure why it is raising a ValueError(errmsg) on the data. When I read the file in excel and it shows a total of 23 columns for all the 33 rows in the file

Here is the code and error:

csv = np.genfromtxt (fname, delimiter=",",names=True)

Here is a snippet of the csv records:

,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_NN__alpha,param_NN__hidden_layer_sizes,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,split3_test_score,split3_train_score,split4_test_score,split4_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,0.34166226387023924,0.0010362625122070312,0.842927342927343,0.8468980402379758,0.1,"(7,)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (7,)}",25,0.8420706295240185,0.8475292052871167,0.8398771660451854,0.8463774474853288,0.845360824742268,0.846158065046893,0.8385256691531373,0.8486892618185806,0.8488040377441299,0.8457362215519605,0.05093153997183547,0.00018195987247183776,0.0037378988316037944,0.0010747322296072162
1,0.5543142318725586,0.0018250465393066407,0.8465250965250966,0.8527554135893668,0.1,"(25, 7)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (25, 7)}",5,0.846018863785918,0.8530137662480118,0.846018863785918,0.8589919376953875,0.8479929809168677,0.8496681840618658,0.8400614304519526,0.851486234506965,0.8525345622119815,0.8506169454346038,0.10835399357094619,0.00018853748087819175,0.004013613789285713,0.003306836154659678
2,0.5266880512237548,0.0013680458068847656,0.8437609687609687,0.8478413817137904,0.1,"(11, 7)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (11, 7)}",17,0.842509322219785,0.8479679701639884,0.8354902390875192,0.8431964021280096,0.8455801710901514,0.8520265452750507,0.8433523475208424,0.851595919710431,0.8518762343647136,0.8444200712914725,0.1041624682160838,0.0003233587082439388,0.005278162504355272,0.0036030369022985215
3,0.49459095001220704,0.0011162281036376954,0.8406458406458407,0.845428443186931,0.1,"(7, 5)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (7, 5)}",32,0.8383417416100022,0.848461580650469,0.8429480149155516,0.8501617945483464,0.8468962491774512,0.8514780891789612,0.8312856516015796,0.8381046396841066,0.8437568575817423,0.8389361118727722,0.10397613499936685,0.00018889068500539376,0.005421511394261151,0.005726975087304059
4,0.6175418376922608,0.0024899959564208983,0.8449017199017199,0.8508140227747922,0.1,"(25, 11, 7)","{'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (25, 11, 7)}",11,0.8414125904803685,0.8493939560138211,0.8427286685676684,0.8546591345362804,0.8501864443957008,0.8519716996654417,0.8459850811759544,0.8564769112646704,0.8441957428132544,0.8415684123937482,0.1940231074769015,0.00047604030307216253,0.003049662553913791,0.005209439647677219

Error Received:

ValueError: Some errors were detected !
    Line #2 (got 26 columns instead of 22)
    Line #3 (got 26 columns instead of 22)
    Line #4 (got 26 columns instead of 22)
    Line #5 (got 26 columns instead of 22)
    Line #6 (got 28 columns instead of 22)
    Line #7 (got 26 columns instead of 22)
    Line #8 (got 28 columns instead of 22)
    Line #9 (got 26 columns instead of 22)
    Line #10 (got 26 columns instead of 22)
    Line #11 (got 26 columns instead of 22)
    Line #12 (got 26 columns instead of 22)
    Line #13 (got 26 columns instead of 22)
    Line #14 (got 28 columns instead of 22)
    Line #15 (got 26 columns instead of 22)
    Line #16 (got 28 columns instead of 22)
    Line #17 (got 26 columns instead of 22)
    Line #18 (got 26 columns instead of 22)
    Line #19 (got 26 columns instead of 22)
    Line #20 (got 26 columns instead of 22)
    Line #21 (got 26 columns instead of 22)
    Line #22 (got 28 columns instead of 22)
    Line #23 (got 26 columns instead of 22)
    Line #24 (got 28 columns instead of 22)
    Line #25 (got 26 columns instead of 22)
    Line #26 (got 26 columns instead of 22)
    Line #27 (got 26 columns instead of 22)
    Line #28 (got 26 columns instead of 22)
    Line #29 (got 26 columns instead of 22)
    Line #30 (got 28 columns instead of 22)
    Line #31 (got 26 columns instead of 22)
    Line #32 (got 28 columns instead of 22)
    Line #33 (got 26 columns instead of 22)

Solution

  • You're passing a , as a delimiter while a lot of your column values contain elements themselves. You'd need to specify an explicit quotechar to get this to work.

    Fortunately, pandas handles this really well without much handholding. You could try loading your data with read_csv and then convert the loaded dataframe to an array.

    import pandas as pd
    array = pd.read_csv(name, index_col=[0]).values
    

    The loaded dataframe (what you get before calling .values) looks like this:

    df = pd.read_csv(name, index_col=[0])
    print(df)
    
       mean_fit_time  mean_score_time  mean_test_score  mean_train_score  \
    0       0.341662         0.001036         0.842927          0.846898   
    1       0.554314         0.001825         0.846525          0.852755   
    2       0.526688         0.001368         0.843761          0.847841   
    3       0.494591         0.001116         0.840646          0.845428   
    4       0.617542         0.002490         0.844902          0.850814   
    
       param_NN__alpha param_NN__hidden_layer_sizes  \
    0              0.1                         (7,)   
    1              0.1                      (25, 7)   
    2              0.1                      (11, 7)   
    3              0.1                       (7, 5)   
    4              0.1                  (25, 11, 7)   
    
                                                  params  rank_test_score  \
    0  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...               25   
    1  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...                5   
    2  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...               17   
    3  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...               32   
    4  {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (...               11   
    
       split0_test_score  split0_train_score       ...         split2_test_score  \
    0           0.842071            0.847529       ...                  0.845361   
    1           0.846019            0.853014       ...                  0.847993   
    2           0.842509            0.847968       ...                  0.845580   
    3           0.838342            0.848462       ...                  0.846896   
    4           0.841413            0.849394       ...                  0.850186   
    
       split2_train_score  split3_test_score  split3_train_score  \
    0            0.846158           0.838526            0.848689   
    1            0.849668           0.840061            0.851486   
    2            0.852027           0.843352            0.851596   
    3            0.851478           0.831286            0.838105   
    4            0.851972           0.845985            0.856477   
    
       split4_test_score  split4_train_score  std_fit_time  std_score_time  \
    0           0.848804            0.845736      0.050932        0.000182   
    1           0.852535            0.850617      0.108354        0.000189   
    2           0.851876            0.844420      0.104162        0.000323   
    3           0.843757            0.838936      0.103976        0.000189   
    4           0.844196            0.841568      0.194023        0.000476   
    
       std_test_score  std_train_score  
    0        0.003738         0.001075  
    1        0.004014         0.003307  
    2        0.005278         0.003603  
    3        0.005422         0.005727  
    4        0.003050         0.005209  
    
    [5 rows x 22 columns
    

    And yes, columns are automatically converted to the appropriate datatypes.

    print(df.dtypes)
    
    mean_fit_time                   float64
    mean_score_time                 float64
    mean_test_score                 float64
    mean_train_score                float64
    param_NN__alpha                 float64
    param_NN__hidden_layer_sizes     object
    params                           object
    rank_test_score                   int64
    split0_test_score               float64
    split0_train_score              float64
    split1_test_score               float64
    split1_train_score              float64
    split2_test_score               float64
    split2_train_score              float64
    split3_test_score               float64
    split3_train_score              float64
    split4_test_score               float64
    split4_train_score              float64
    std_fit_time                    float64
    std_score_time                  float64
    std_test_score                  float64
    std_train_score                 float64
    dtype: object
    

    Statutory warning: This data, owing to its nature, will probably be more useful to you as a python list, than a numpy array (which is optimised to work with scalars).