Search code examples
pythonpandasdataframenlpnltk

NLP pre-processing on two columns in data frame gives error


I have the following data frame:

gmeDateDf.head(2)
title score id url comms_num body timestamp
It's not about the money, it's about sending a... 55.0 l6ulcx https://v.redd.it/6j75regs72e61 6.0 NaN 2021-01-28 21:37:41
Math Professor Scott Steiner says the numbers ... 110.0 l6uibd https://v.redd.it/ah50lyny62e61 23.0 NaN 2021-01-28 21:32:10

I have the following function to pre-process the text (with the proper libraries imported and so on):

def preprocess_text(text):
  # Tokenize words
  tokens = word_tokenize(text.lower())

  # Remove stopwords and non-alphabetic words, and lemmatize
  processed_tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalpha() and word not in stop_words]

  return processed_tokens

Then calling it on specific column:

gmeDateDf.loc[:, 'body'] = gmeDateDf['body'].fillna('NaN').astype(str)
gmeDateDfProcessed = gmeDateDf['body'].apply(preprocess_text)

That works properly as expected. However, when I try to do it on two columns, like so:

gmeDateDf.loc[:, 'title','body'] = gmeDateDf['title', 'body'].fillna('NaN').astype(str)
gmeDateDfProcessed = gmeDateDf['title', 'body'].apply(preprocess_text)

I get the following error:

   3802                 return self._engine.get_loc(casted_key)
   3803             except KeyError as err:
-> 3804                 raise KeyError(key) from err
   3805             except TypeError:
   3806                 # If we have a listlike key, _check_indexing_error will raise

KeyError: ('title', 'body')

I’ve looked around, asked chatGPT for some help, but I can’t figure it out.

Please, bear with me as I’m still learning the basics of Python.

Why can’t I give it a listlike key? And why is it a listlike key only when I have the two columns? When I call it only on gmeDateDf.loc[:, 'body'] it is some kind of listlike key, no? So why would it not work otherwise?

I’m confused, and don’t even know where to look to see what I’m doing wrong now.


Solution

  • As the error suggests, using gmeDateDf['title', 'body'] attempts to find a column in the DataFrame under the following key: ('title', 'body'). No column in your DataFrame is called that, therefore the code fails.

    If you wish to select multiple columns at once, you need to provide them in a list, like so: gmeDateDf[['title', 'body']]. For more information, head to the documentation page on data selection from a DataFrame.

    Given your specific example, you will need to fix the data selection, and then use some string vectorisation, something like:

    gmeDateDfProcessed[['title', 'body']] = gmeDateDf[['title', 'body']].apply(lambda x: preprocess_text(x.str))