I'm new to programming, I've literally just started learning right now and I'm doing it with various free tools, so I don't understand much about programming yet
I'm trying to write a neural network for self-learning
The meaning is as follows: I have 3 files. In the first (category) there is 1 column with 37 values and the column name is category
The second (ex) has 2 columns. The first column, called categ, contains 785 rows. The second column, called fix, contains 785 rows
In the third file (match), 1 column called match contains 3543 lines.
I need the match file to get a second column and add a value from the categ file based on data from the excel file to each of its values.
At the moment, I have this code
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from keras.utils import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.layers import Input, Embedding, LSTM, Dense
from keras.models import Model
# Reading downloaded Excel files
# File with categories
from google.colab import files
upload = files.upload()
!ls
df_categories = pd.read_excel(open('categ.xlsx', 'rb'))
df_categories = pd.read_excel('categ.xlsx', index_col=None)
print(df_categories.columns)
# File with examples
from google.colab import files
upload = files.upload()
!ls
df_examples = pd.read_excel(open('ex.xlsx', 'rb'))
df_examples = pd.read_excel('ex.xlsx', index_col=None)
print(df_examples.columns)
# File with values for distribution
from google.colab import files
upload = files.upload()
!ls
df_to_distribute = pd.read_excel(open('match.xlsx', 'rb'))
df_to_distribute = pd.read_excel('match.xlsx', index_col=None)
print(df_to_distribute.columns)
# Data preprocessing
categories = df_categories['categ'].tolist()
values = df_examples['fix'].tolist()
to_distribute = df_to_distribute['match'].tolist()
categories = [str(category) for category in categories]
values = [str(value) for value in values]
to_distribute = [str(item) for item in to_distribute]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(categories + values + to_distribute)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(categories + values + to_distribute)
category_sequences = tokenizer.texts_to_sequences(categories)
value_sequences = tokenizer.texts_to_sequences(values)
to_distribute_sequences = tokenizer.texts_to_sequences(to_distribute)
max_length = max(len(seq) for seq in category_sequences + value_sequences + to_distribute_sequences)
padded_category_sequences = pad_sequences(category_sequences, maxlen=max_length, padding='post')
padded_value_sequences = pad_sequences(value_sequences, maxlen=max_length, padding='post')
padded_to_distribute_sequences = pad_sequences(to_distribute_sequences, maxlen=max_length, padding='post')
# Creating a model
input_layer = Input(shape=(max_length,))
embedding_layer = Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=64)(input_layer)
lstm_layer = LSTM(64)(embedding_layer)
output_layer = Dense(36, activation='softmax')(lstm_layer)
model = Model(inputs=input_layer, outputs=output_layer)
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Model training
#model.fit(padded_to_distribute_sequences, padded_category_sequences, epochs=10, batch_size=32, validation_split=0.2)
#model.fit(np.array(data_list), np.array(y), verbose=0, epochs=100)
model.fit(np.array(padded_to_distribute_sequences), np.array(padded_category_sequences), verbose=0, epochs=100)
At the moment I'm getting the following error and I don't know how to fix it
ValueError Traceback (most recent call last)
<ipython-input-18-4e982bc70a7f> in <cell line: 37>()
35 #model.fit(padded_to_distribute_sequences, padded_category_sequences, epochs=10, batch_size=32, validation_split=0.2)
36 #model.fit(np.array(data_list), np.array(y), verbose=0, epochs=100)
---> 37 model.fit(np.array(padded_to_distribute_sequences), np.array(padded_category_sequences), verbose=0, epochs=100)
1 frames
/usr/local/lib/python3.10/dist-packages/keras/src/engine/data_adapter.py in _check_data_cardinality(data)
1958 )
1959 msg += "Make sure all arrays contain the same number of samples."
-> 1960 raise ValueError(msg)
1961
1962
ValueError: Data cardinality is ambiguous:
x sizes: 3549
y sizes: 36
Make sure all arrays contain the same number of sample
I've tried changing lines of code based on recommendations from websites and forums, but it hasn't helped yet. I will be glad of your help!
I'm writing code in Google colab
Unfortunately, I cannot discard the original files that I use, since they contain personal data, but I can share a brief summary so that the logic of my actions is clear. I attach it at the end of the description
I think the problem is that the target data padded_category_sequences
and the input data padded_to_distribute_sequences
have different numbers of samples, which causes a ValueError
.
Add this after the "Data Processing":-
target_data = np.tile(padded_category_sequences, (len(padded_to_distribute_sequences) // len(padded_category_sequences), 1))
I am assuming that padded_category_sequences is your target data