Search code examples
pythoncsvsklearn-pandas

Why this program could not convert string to float in Python


Whats wrong with this?

from sklearn.preprocessing import Normalizer
from pandas import read_csv
from numpy import set_printoptions

namaFile = 'dataset.csv'
nama = ['rt', 'niagak', 'niagab', 'sosum', 'soskhus', 'p', 'tni', 'ik', 'ib', 'TARGET']
dataFrame = read_csv(namaFile, names=nama)
array = dataFrame.values

#membagi array
X = array[:,0:9]
Y = array[:,9]

skala = Normalizer().fit(X)
normalisasiX = skala.transform(X)

#data hasil
set_printoptions(precision = 3)
print(normalisasiX[0:10,:])

And when I run this program

File "C:\Users\Dini\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array

array = np.array(array, dtype=dtype, order=order, copy=copy)

ValueError: could not convert string to float: 'ib'

csv file please help me


Solution

  • I was looking at the docs ( the same one that @OliverRadini referred to ), and that same page states has the following:

    header : int, list of int, default ‘infer’

    Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file

    You're defining the names in code, so you shouldn't include the header in the file. Either do one (write headers in csv data ) or the other (write column names in code). Don't do both.

    EDIT: My answer remains the same, but here's one way you could have discovered this yourself:

    With the following csv data (what you showed in the picture):

    BULAN,rt,nigak,niagab,sosum,soskhus,p,tni,ik,ib,TARGET
    13-Jan,84876,902,1192,2098,3623,169,39,133,1063,94095
    13-Feb,79194,902,1050,2109,3606,153,39,133,806,87992
    13-Mar,75836,902,1060,1905,3166,161,39,133,785,83987
    13-Apr,75571,902,112,1878,3190,158,39,133,635,82618
    13-May,83797,1156,134,1900,3518,218,39,133,709,91604
    13-Jun,91648,1291,127,2220,3596,249,39,133,659,99967
    13-Jul,79063,1346,107,1844,3428,247,39,133,951,86798
    

    Running this code...

    from pandas import read_csv
    from numpy import set_printoptions
    
    namaFile = 'dataset.csv'
    nama = ['rt', 'niagak', 'niagab', 'sosum', 'soskhus', 'p', 'tni', 'ik', 'ib', 'TARGET']
    
    dataFrame = read_csv(namaFile, names=nama)
    array = dataFrame.values
    
    print("with names=nama...")
    print(array)
    
    dataFrame = read_csv(namaFile)
    array = dataFrame.values
    
    print("with no names...")
    print(array)
    
    dataFrame = read_csv(namaFile, names=nama, header=0)
    array = dataFrame.values
    
    print("with no names=nama and header=0...")
    print(array)
    

    You get this output:

    with names=nama...
    [['rt' 'nigak' 'niagab' 'sosum' 'soskhus' 'p' 'tni' 'ik' 'ib' 'TARGET']
     ['84876' '902' '1192' '2098' '3623' '169' '39' '133' '1063' '94095']
     ['79194' '902' '1050' '2109' '3606' '153' '39' '133' '806' '87992']
     ['75836' '902' '1060' '1905' '3166' '161' '39' '133' '785' '83987']
     ['75571' '902' '112' '1878' '3190' '158' '39' '133' '635' '82618']
     ['83797' '1156' '134' '1900' '3518' '218' '39' '133' '709' '91604']
     ['91648' '1291' '127' '2220' '3596' '249' '39' '133' '659' '99967']
     ['79063' '1346' '107' '1844' '3428' '247' '39' '133' '951' '86798']]
    
    with no names...
    [['13-Jan' 84876 902 1192 2098 3623 169 39 133 1063 94095]
     ['13-Feb' 79194 902 1050 2109 3606 153 39 133 806 87992]
     ['13-Mar' 75836 902 1060 1905 3166 161 39 133 785 83987]
     ['13-Apr' 75571 902 112 1878 3190 158 39 133 635 82618]
     ['13-May' 83797 1156 134 1900 3518 218 39 133 709 91604]
     ['13-Jun' 91648 1291 127 2220 3596 249 39 133 659 99967]
     ['13-Jul' 79063 1346 107 1844 3428 247 39 133 951 86798]]
    
    with no names=nama and header=0...
    [[84876   902  1192  2098  3623   169    39   133  1063 94095]
     [79194   902  1050  2109  3606   153    39   133   806 87992]
     [75836   902  1060  1905  3166   161    39   133   785 83987]
     [75571   902   112  1878  3190   158    39   133   635 82618]
     [83797  1156   134  1900  3518   218    39   133   709 91604]
     [91648  1291   127  2220  3596   249    39   133   659 99967]
     [79063  1346   107  1844  3428   247    39   133   951 86798]]
    

    We can see clearly here that when you include the names on both, you get the headers listed in the first item, which is not what we want. When you remove the names=nama then you get all of the data from the file. When you explicitly over-write the names with names=nama header=0, you also can achieve this desired result. HOWEVER I would also like to note that your headers in your code are missing the BULAN column so be careful with that.

    print() is your friend. Use it. It will tell you what your problems are.