Search code examples
pythoncsvscikit-learndefaultdict

ValueError: Inconsistent number of samples when using sklearn on defaultdict


I am reading columns in .csv files as inputs to a sklearn Naive Bayes fit. However, I am running into these errors and warnings:

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.

and

ValueError: Found arrays with inconsistent numbers of samples: [ 1 10509]

And here is my code:

clf = GaussianNB()

columns = defaultdict(list)
with open('file.CSV', 'rb') as f:
    reader = csv.reader(f)
    for row in reader:
        for(i, v) in enumerate(row):
            columns[i].append(v)

clf.fit(columns[9], columns[10])

As a note, len(columns[9]) and len(columns[10]) are both 10509

As the warning suggested, I tried a lot of different combinations of reshape(), flatten(), ravel(), and also tried to use a numpy arrays, but nothing seems to be working.

Any suggestions? It seems that most people are using some kind of data structure other than a defaultdict, but I'm not sure about how to use other data structures to read from a .csv


Solution

  • I found the solution to my problem. Seems like the issue wasn't about shaping the data structure, but with setting it to be a number type rather than a string type.

    x = np.array(columns[9]).reshape(len(columns[10]), 1).astype(np.float)
    y = np.array(columns[10])
    clf.fit(x, y)