I am trying to learn Python, however I am trying to import a dataset and cant get it work correctly...
This dataset contains 16 columns and 16 320 rows saved as txt file. I used the genfromtxt function as follow :
import numpy as np
dt=np.dtype([('name', np.str_, 16),('platform', np.str_, 16),('year', np.float_, (2,)),('genre', np.str_, 16),('publisher', np.str_, 16),('na_sales', np.float_, (2,)), ('eu_sales', np.float64, (2,)), ('jp_sales', np.float64, (2,)), ('other_sales', np.float64, (2,)), ('global_sales', np.float64, (2,)), ('critic_scores', np.float64, (2,)),('critic_count', np.float64, (2,)),('user_scores', np.float64, (2,)),('user_count', np.float64, (2,)),('developer', np.str_, 16),('rating', np.str_, 16)])
data=np.genfromtxt('D:\\data3.txt',delimiter=',',names=True,dtype=dt)
I get this error :
ValueError: size of tuple must match number of fields.
But my dt variable contains 16 types one for each column. I specify the datatype because otherwise the strings are replaced by nan.
Any help would be appreciated.
Look at an array made with your dt
:
In [78]: np.ones((1,),dt)
Out[78]:
array([ ('1', '1', [ 1., 1.], '1', '1', [ 1., 1.], [ 1., 1.], [ 1., 1.],
[ 1., 1.], [ 1., 1.], [ 1., 1.], [ 1., 1.], [ 1., 1.],
[ 1., 1.], '1', '1')],
dtype=[('name', '<U16'), ('platform', '<U16'), ('year', '<f8', (2,)), ('genre', '<U16'), ('publisher', '<U16'), ('na_sales', '<f8', (2,)), ('eu_sales', '<f8', (2,)), ('jp_sales', '<f8', (2,)), ('other_sales', '<f8', (2,)), ('global_sales', '<f8', (2,)), ('critic_scores', '<f8', (2,)), ('critic_count', '<f8', (2,)), ('user_scores', '<f8', (2,)), ('user_count', '<f8', (2,)), ('developer', '<U16'), ('rating', '<U16')])
I count 26 1
s (string and float), not the 16 you need. Were you thinking the (2,) denoted a double? It denotes a 2 element subfield.
Take out all those (2,)
In [80]: np.ones((1,),dt)
Out[80]:
array([ ('1', '1', 1., '1', '1', 1., 1., 1., 1., 1., 1., 1., 1., 1., '1', '1')],
dtype=[('name', '<U16'), ('platform', '<U16'), ('year', '<f8'), ('genre', '<U16'), ('publisher', '<U16'), ('na_sales', '<f8'), ('eu_sales', '<f8'), ('jp_sales', '<f8'), ('other_sales', '<f8'), ('global_sales', '<f8'), ('critic_scores', '<f8'), ('critic_count', '<f8'), ('user_scores', '<f8'), ('user_count', '<f8'), ('developer', '<U16'), ('rating', '<U16')])
Now I have 16 fields that should parse your 16 columns just right.
But often dtype=None
works just as well. It lets genfromtxt
deduce the best dtype for each field. In that case it will take field names from the column header line (your names=True
parameter).
It's a good idea to test complicated lines of code before throwing them into larger scripts. Especially if you in the process of learning.