NLP pre-processing on two columns in data frame gives error

I have the following data frame:

gmeDateDf.head(2)

title	score	id	url	comms_num	body	timestamp
It's not about the money, it's about sending a...	55.0	l6ulcx	https://v.redd.it/6j75regs72e61	6.0	NaN	2021-01-28 21:37:41
Math Professor Scott Steiner says the numbers ...	110.0	l6uibd	https://v.redd.it/ah50lyny62e61	23.0	NaN	2021-01-28 21:32:10

I have the following function to pre-process the text (with the proper libraries imported and so on):

def preprocess_text(text):
  # Tokenize words
  tokens = word_tokenize(text.lower())

  # Remove stopwords and non-alphabetic words, and lemmatize
  processed_tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalpha() and word not in stop_words]

  return processed_tokens

Then calling it on specific column:

gmeDateDf.loc[:, 'body'] = gmeDateDf['body'].fillna('NaN').astype(str)
gmeDateDfProcessed = gmeDateDf['body'].apply(preprocess_text)

That works properly as expected. However, when I try to do it on two columns, like so:

gmeDateDf.loc[:, 'title','body'] = gmeDateDf['title', 'body'].fillna('NaN').astype(str)
gmeDateDfProcessed = gmeDateDf['title', 'body'].apply(preprocess_text)

I get the following error:

   3802                 return self._engine.get_loc(casted_key)
   3803             except KeyError as err:
-> 3804                 raise KeyError(key) from err
   3805             except TypeError:
   3806                 # If we have a listlike key, _check_indexing_error will raise

KeyError: ('title', 'body')

I’ve looked around, asked chatGPT for some help, but I can’t figure it out.

Please, bear with me as I’m still learning the basics of Python.

Why can’t I give it a listlike key? And why is it a listlike key only when I have the two columns? When I call it only on gmeDateDf.loc[:, 'body'] it is some kind of listlike key, no? So why would it not work otherwise?

I’m confused, and don’t even know where to look to see what I’m doing wrong now.

Solution

As the error suggests, using gmeDateDf['title', 'body'] attempts to find a column in the DataFrame under the following key: ('title', 'body'). No column in your DataFrame is called that, therefore the code fails.

If you wish to select multiple columns at once, you need to provide them in a list, like so: gmeDateDf[['title', 'body']]. For more information, head to the documentation page on data selection from a DataFrame.

Given your specific example, you will need to fix the data selection, and then use some string vectorisation, something like:

gmeDateDfProcessed[['title', 'body']] = gmeDateDf[['title', 'body']].apply(lambda x: preprocess_text(x.str))