My goal is to train a classifier able to do sentiment analysis in Slovak language using loaded SlovakBert model and HuggingFace library. Code is executed on Google Colaboratory.
My test dataset is read from this csv file: https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv
and train dataset: https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_accomodation.csv
Data has two columns: column of Slovak sentences and 2nd column of labels which indicate sentiment of the sentence. Labels have values -1, 0 or 1.
Load_dataset() function throws this error:
ValueError: Couldn't cast Vrtuľník je veľmi zraniteľný pri dobre mierenej streľbe zo zeme. Brániť sa, unikať, alebo vedieť zneškodniť nepriateľa je vecou sekúnd, ak nie stotín, kedy ide život. : string -1: int64 -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 954 to {'Priestorovo a vybavenim OK.': Value(dtype='string', id=None), '1': Value(dtype='int64', id=None)} because column names don't match
Code:
!pip install transformers==4.10.0 -qqq
!pip install datasets -qqq
from re import M
import numpy as np
from datasets import load_metric, load_dataset, Dataset
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding
import pandas as pd
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
#links to dataset
test = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv'
train = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_accomodation.csv'
model_name = 'gerulata/slovakbert'
#Load data
dataset = load_dataset('csv', data_files={'train': train, 'test': test})
What is done wrong while loading the dataset?
The reason is since delimiter is used in first column multiple times the code fails to automatically determine number of columns ( some time segment a sentence into multiple columns as it cannot automatically determine ,
is a delimiter or a part of sentence.
But, the solution is simple: (just add column names)
dataset = load_dataset('csv', data_files={'train': train,'test':test},column_names=['sentence','label'])
output:
DatasetDict({
train: Dataset({
features: ['sentence', 'label'],
num_rows: 89
})
test: Dataset({
features: ['sentence', 'label'],
num_rows: 91
})
})