I have the following sample data frame with a 'problem_definition' column:
ID problem_definition
1 cat, dog fish
2 turtle; cat; fish fish
3 hello book fish
4 dog hello fish cat
I want to word tokenize the 'problem_definition' column.
Below is my code:
from nltk.tokenize import sent_tokenize, word_tokenize
import pandas as pd
df = pd.read_csv('log_page_nlp_subset.csv')
df['problem_definition_tokenized'] = df['problem_definition'].apply(word_tokenize)
The code above gives me the following error:
TypeError: expected string or bytes-like object
There is probably a non-string-like object (such as NaN
) in your actual df['TEXT']
which is not shown in the data you posted.
Here is how you might be able to find the problematic values:
mask = [isinstance(item, (str, bytes)) for item in df['TEXT']]
print(df.loc[~mask])
If you wish to remove these rows, you could use
df = df.loc[mask]
Or, as PineNuts0 points out,
the entire column can be coerced to str
dtype using
df['TEXT'] = df['TEXT'].astype(str)
For example if there is a NaN value in df['TEXT']
,
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
df = pd.DataFrame({'ID': [1, 2, 3, 4],
'TEXT': ['cat, dog fish',
'turtle; cat; fish fish',
'hello book fish',
np.nan]})
# ID TEXT
# 0 1 cat, dog fish
# 1 2 turtle; cat; fish fish
# 2 3 hello book fish
# 3 4 NaN
# df['TEXT'].apply(word_tokenize)
# TypeError: expected string or buffer
mask = [isinstance(item, (str, bytes)) for item in df['TEXT']]
df = df.loc[mask]
# ID TEXT
# 0 1 cat, dog fish
# 1 2 turtle; cat; fish fish
# 2 3 hello book fish
and now applying word_tokenize
works:
In [108]: df['TEXT'].apply(word_tokenize)
Out[108]:
0 [cat, ,, dog, fish]
1 [turtle, ;, cat, ;, fish, fish]
2 [hello, book, fish]
Name: TEXT, dtype: object