Search code examples
pythonmachine-learningnlpkerastraining-data

How to train NLP classification using keras library?


Here is my training data, I want to predict 'y' with X_data using keras library. I am getting an error for a lot of time, I know its about the data shape but I am stuck for a some time. Hope you guys could help.

X_data =
0     [construction, materials, labour, charges, con...
1     [catering, catering, lunch]
2     [passenger, transport, local, transport, passe...
3     [goods, transport, road, transport, goods, inl...
4     [rental, rental, aircrafts]
5     [supporting, transport, cargo, handling, agenc...
6     [postal, courier, postal, courier, local, deli...
7     [electricity, charges, reimbursement, electric...
8     [facility, management, facility, management, p...
9     [leasing, leasing, aircrafts]
10    [professional, technical, business, selling, s...
11    [telecommunications, broadcasting, information...
12    [support, personnel, search, contract, tempora...
13    [maintenance, repair, installation, maintenanc...
14    [manufacturing, physical, inputs, owned, other...
15    [accommodation, hotel, accommodation, hotel, i...
16    [leasing, rental, leasing, renting, motor, veh...
17    [real, estate, rental, leasing, involving, pro...
18    [rental, transport, vehicles, rental, road, ve...
19    [cleaning, sanitary, pad, vending, machine]
20    [royalty, transfer, use, ip, intellectual, pro...
21    [legal, accounting, legal, accounting, legal, ...
22    [veterinary, clinic, health, care, relation, a...
23    [human, health, social, care, inpatient, medic...
Name: Data, dtype: object

​And here is my training predictor

y = 

0      1
1      1
2      1
3      1
4      1
5      1
6      1
7      1
8      1
9      1
10     1
11     1
12     1
13     1
14     1
15    10
16     2
17    10
18     2
19     2
20    10
21    10
22    10
23    10

I am using this model:

top_words = 5000
length= len(X_data)
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(embedding_vecor_length, top_words, input_length=length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_data, y, epochs=3, batch_size=32)

ValueError: Error when checking input: expected embedding_8_input to have shape (None, 24) but got array with shape (24, 1)

What is the problem with using this data in this model? I want to predict 'y' using input X_data?


Solution

  • You need to convert your pandas dataframes to numpy arrays, the arrays are going to be ragged so you need to pad them. You also need to setup a dictionary of word vectors, as you cannot just pass words directly into a neural network. Some examples are, here,here, and here. Your're going to need to do your own research here, Its not possible to do much with the data sample you provided

    length = len(X_data) is how many samples of data you have, keras doesn't care about this, it wants to know how many words you have as an input, (has to be the same for each, which is why padding was stated earlier)

    so your input to the network is how many columns you have

    #assuming you converted X_data correctly to numpy arrays and word vectors
    model.add(Embedding(embedding_vecor_length, top_words, input_length=X_data.shape[1]))
    

    Your categorical values need to be binary.

    from keras.utils import to_categorical
    
    y = to_categorical(y)
    

    Your last dense layer is now 10, assuming that you have 10 categories and the correct activation is softmax for a mulitclass problem

    model.add(Dense(10, activation='softmax'))
    

    your loss now has to be categorical_crossentropy, since this is multiclass

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])