Whats wrong with this?
from sklearn.preprocessing import Normalizer
from pandas import read_csv
from numpy import set_printoptions
namaFile = 'dataset.csv'
nama = ['rt', 'niagak', 'niagab', 'sosum', 'soskhus', 'p', 'tni', 'ik', 'ib', 'TARGET']
dataFrame = read_csv(namaFile, names=nama)
array = dataFrame.values
#membagi array
X = array[:,0:9]
Y = array[:,9]
skala = Normalizer().fit(X)
normalisasiX = skala.transform(X)
#data hasil
set_printoptions(precision = 3)
print(normalisasiX[0:10,:])
And when I run this program
File "C:\Users\Dini\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: 'ib'
I was looking at the docs ( the same one that @OliverRadini referred to ), and that same page states has the following:
header : int, list of int, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to
header=0
and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical toheader=None
. Explicitly passheader=0
to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines ifskip_blank_lines=True
, soheader=0
denotes the first line of data rather than the first line of the file
You're defining the names in code, so you shouldn't include the header in the file. Either do one (write headers in csv data ) or the other (write column names in code). Don't do both.
EDIT: My answer remains the same, but here's one way you could have discovered this yourself:
With the following csv data (what you showed in the picture):
BULAN,rt,nigak,niagab,sosum,soskhus,p,tni,ik,ib,TARGET
13-Jan,84876,902,1192,2098,3623,169,39,133,1063,94095
13-Feb,79194,902,1050,2109,3606,153,39,133,806,87992
13-Mar,75836,902,1060,1905,3166,161,39,133,785,83987
13-Apr,75571,902,112,1878,3190,158,39,133,635,82618
13-May,83797,1156,134,1900,3518,218,39,133,709,91604
13-Jun,91648,1291,127,2220,3596,249,39,133,659,99967
13-Jul,79063,1346,107,1844,3428,247,39,133,951,86798
Running this code...
from pandas import read_csv
from numpy import set_printoptions
namaFile = 'dataset.csv'
nama = ['rt', 'niagak', 'niagab', 'sosum', 'soskhus', 'p', 'tni', 'ik', 'ib', 'TARGET']
dataFrame = read_csv(namaFile, names=nama)
array = dataFrame.values
print("with names=nama...")
print(array)
dataFrame = read_csv(namaFile)
array = dataFrame.values
print("with no names...")
print(array)
dataFrame = read_csv(namaFile, names=nama, header=0)
array = dataFrame.values
print("with no names=nama and header=0...")
print(array)
You get this output:
with names=nama...
[['rt' 'nigak' 'niagab' 'sosum' 'soskhus' 'p' 'tni' 'ik' 'ib' 'TARGET']
['84876' '902' '1192' '2098' '3623' '169' '39' '133' '1063' '94095']
['79194' '902' '1050' '2109' '3606' '153' '39' '133' '806' '87992']
['75836' '902' '1060' '1905' '3166' '161' '39' '133' '785' '83987']
['75571' '902' '112' '1878' '3190' '158' '39' '133' '635' '82618']
['83797' '1156' '134' '1900' '3518' '218' '39' '133' '709' '91604']
['91648' '1291' '127' '2220' '3596' '249' '39' '133' '659' '99967']
['79063' '1346' '107' '1844' '3428' '247' '39' '133' '951' '86798']]
with no names...
[['13-Jan' 84876 902 1192 2098 3623 169 39 133 1063 94095]
['13-Feb' 79194 902 1050 2109 3606 153 39 133 806 87992]
['13-Mar' 75836 902 1060 1905 3166 161 39 133 785 83987]
['13-Apr' 75571 902 112 1878 3190 158 39 133 635 82618]
['13-May' 83797 1156 134 1900 3518 218 39 133 709 91604]
['13-Jun' 91648 1291 127 2220 3596 249 39 133 659 99967]
['13-Jul' 79063 1346 107 1844 3428 247 39 133 951 86798]]
with no names=nama and header=0...
[[84876 902 1192 2098 3623 169 39 133 1063 94095]
[79194 902 1050 2109 3606 153 39 133 806 87992]
[75836 902 1060 1905 3166 161 39 133 785 83987]
[75571 902 112 1878 3190 158 39 133 635 82618]
[83797 1156 134 1900 3518 218 39 133 709 91604]
[91648 1291 127 2220 3596 249 39 133 659 99967]
[79063 1346 107 1844 3428 247 39 133 951 86798]]
We can see clearly here that when you include the names on both, you get the headers listed in the first item, which is not what we want. When you remove the names=nama
then you get all of the data from the file. When you explicitly over-write the names with names=nama header=0
, you also can achieve this desired result. HOWEVER I would also like to note that your headers in your code are missing the BULAN column so be careful with that.
print()
is your friend. Use it. It will tell you what your problems are.