Search code examples
pythonpandasdataframecsvetl

Expected String or bytes-like object, got 'float'


I'm trying to make an ETL (Extract, transform and load) algorithm with python. I got an amazon review database, but when i use the DataFrame.apply() method to apply the function with regex i got the error:

expected string or bytes-like object, got 'float'

The code i've used is the following:

import pandas as pd
import pathlib
#from sqlalchemy import create_engine
import re
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

#Create the pattern for regex ETL process
pattern = re.compile(r"[\u0041-\u1EFF\s]+\s?")

def iterator_func (x):
    match = pattern.search(x[1])
    return "".join(i for i in match.groups() if i not in stop_words)


try:
    #Open the database, create a connection and upload the data to a database after the ETL process.
    with open(pathlib.Path("database\\test.csv"), encoding="utf-8") as f:
        csv_table = pd.read_csv(f, header=None)

    #Remove incorret values from the first index, stop words and ponctuation characters using regex and nltk
    csv_table[1] = csv_table.apply(iterator_func)
    csv_table[2] = csv_table[2].apply(iterator_func)

Here you can download and check the database: Amazon reviews on kaggle

I've tried to manually iterate over each row, and it works well, but i've noticed that will have serious performance issues.

   for x in csv_table.index():
        if csv_table.loc[x, 0] != "1" or csv_table.loc[x, 0] != "2":
            csv_table.drop(x, inplace=True, erros="ignore")
        #TODO: Create a regex function to avoid numbers, pontuations and stop words.
        temp_phrase = "".join(i for i in pattern.findall(csv_table.loc[x, 1]) if i not in stop_words)

        temp_phrase_two = "".join(i for i in pattern.findall(csv_table.loc[x, 2]) if i not in stop_words)

        csv_table.loc[x, 1] = temp_phrase

        csv_table.loc[x, 2] = temp_phrase_two

Solution

  • I just tried to convert the type of a column to the correct type and that work fine.

    csv_table[1] = csv_table[1].astype("str")