Search code examples
pythonnumpydatasetgenfromtxt

Issues importing datasets (txt file) with Python using numpy library genfromtxt function


I am trying to learn Python, however I am trying to import a dataset and cant get it work correctly...

This dataset contains 16 columns and 16 320 rows saved as txt file. I used the genfromtxt function as follow :

import numpy as np  
dt=np.dtype([('name', np.str_, 16),('platform', np.str_, 16),('year', np.float_, (2,)),('genre', np.str_, 16),('publisher', np.str_, 16),('na_sales', np.float_, (2,)), ('eu_sales', np.float64, (2,)), ('jp_sales', np.float64, (2,)), ('other_sales', np.float64, (2,)), ('global_sales', np.float64, (2,)), ('critic_scores', np.float64, (2,)),('critic_count', np.float64, (2,)),('user_scores', np.float64, (2,)),('user_count', np.float64, (2,)),('developer', np.str_, 16),('rating', np.str_, 16)])  
data=np.genfromtxt('D:\\data3.txt',delimiter=',',names=True,dtype=dt)

I get this error :

ValueError: size of tuple must match number of fields.

But my dt variable contains 16 types one for each column. I specify the datatype because otherwise the strings are replaced by nan.

Any help would be appreciated.


Solution

  • Look at an array made with your dt:

    In [78]: np.ones((1,),dt)
    Out[78]: 
    array([ ('1', '1', [ 1.,  1.], '1', '1', [ 1.,  1.], [ 1.,  1.], [ 1.,  1.], 
          [ 1.,  1.], [ 1.,  1.], [ 1.,  1.], [ 1.,  1.], [ 1.,  1.], 
          [ 1.,  1.], '1', '1')], 
          dtype=[('name', '<U16'), ('platform', '<U16'), ('year', '<f8', (2,)), ('genre', '<U16'), ('publisher', '<U16'), ('na_sales', '<f8', (2,)), ('eu_sales', '<f8', (2,)), ('jp_sales', '<f8', (2,)), ('other_sales', '<f8', (2,)), ('global_sales', '<f8', (2,)), ('critic_scores', '<f8', (2,)), ('critic_count', '<f8', (2,)), ('user_scores', '<f8', (2,)), ('user_count', '<f8', (2,)), ('developer', '<U16'), ('rating', '<U16')])
    

    I count 26 1s (string and float), not the 16 you need. Were you thinking the (2,) denoted a double? It denotes a 2 element subfield.

    Take out all those (2,)

    In [80]: np.ones((1,),dt)
    Out[80]: 
    array([ ('1', '1',  1., '1', '1',  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., '1', '1')], 
          dtype=[('name', '<U16'), ('platform', '<U16'), ('year', '<f8'), ('genre', '<U16'), ('publisher', '<U16'), ('na_sales', '<f8'), ('eu_sales', '<f8'), ('jp_sales', '<f8'), ('other_sales', '<f8'), ('global_sales', '<f8'), ('critic_scores', '<f8'), ('critic_count', '<f8'), ('user_scores', '<f8'), ('user_count', '<f8'), ('developer', '<U16'), ('rating', '<U16')])
    

    Now I have 16 fields that should parse your 16 columns just right.

    But often dtype=None works just as well. It lets genfromtxt deduce the best dtype for each field. In that case it will take field names from the column header line (your names=True parameter).

    It's a good idea to test complicated lines of code before throwing them into larger scripts. Especially if you in the process of learning.