I have the following data frame:
gmeDateDf.head(2)
title | score | id | url | comms_num | body | timestamp |
---|---|---|---|---|---|---|
It's not about the money, it's about sending a... | 55.0 | l6ulcx | https://v.redd.it/6j75regs72e61 | 6.0 | NaN | 2021-01-28 21:37:41 |
Math Professor Scott Steiner says the numbers ... | 110.0 | l6uibd | https://v.redd.it/ah50lyny62e61 | 23.0 | NaN | 2021-01-28 21:32:10 |
I have the following function to pre-process the text (with the proper libraries imported and so on):
def preprocess_text(text):
# Tokenize words
tokens = word_tokenize(text.lower())
# Remove stopwords and non-alphabetic words, and lemmatize
processed_tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalpha() and word not in stop_words]
return processed_tokens
Then calling it on specific column:
gmeDateDf.loc[:, 'body'] = gmeDateDf['body'].fillna('NaN').astype(str)
gmeDateDfProcessed = gmeDateDf['body'].apply(preprocess_text)
That works properly as expected. However, when I try to do it on two columns, like so:
gmeDateDf.loc[:, 'title','body'] = gmeDateDf['title', 'body'].fillna('NaN').astype(str)
gmeDateDfProcessed = gmeDateDf['title', 'body'].apply(preprocess_text)
I get the following error:
3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:
-> 3804 raise KeyError(key) from err
3805 except TypeError:
3806 # If we have a listlike key, _check_indexing_error will raise
KeyError: ('title', 'body')
I’ve looked around, asked chatGPT for some help, but I can’t figure it out.
Please, bear with me as I’m still learning the basics of Python.
Why can’t I give it a listlike key? And why is it a listlike key only when I have the two columns? When I call it only on gmeDateDf.loc[:, 'body']
it is some kind of listlike key, no? So why would it not work otherwise?
I’m confused, and don’t even know where to look to see what I’m doing wrong now.
As the error suggests, using gmeDateDf['title', 'body']
attempts to find a column in the DataFrame under the following key: ('title', 'body')
. No column in your DataFrame is called that, therefore the code fails.
If you wish to select multiple columns at once, you need to provide them in a list, like so: gmeDateDf[['title', 'body']]
. For more information, head to the documentation page on data selection from a DataFrame.
Given your specific example, you will need to fix the data selection, and then use some string vectorisation, something like:
gmeDateDfProcessed[['title', 'body']] = gmeDateDf[['title', 'body']].apply(lambda x: preprocess_text(x.str))