I have a table of about 30,000 rows and need to extract non-English words from a column named dummy_df
from a dummy_df
dataframe. I need to put the non-english words in an adjacent column named non_english
.
A dummy data is as thus:
dummy_df = pandas.DataFrame({'outcome': ["I want to go to church", "I love Matauranga", "Take me to Oranga Tamariki"]})
My idea is to extract non-English words from a sentence, and then iterate the process over a dataframe. I was able to accurately extract non-English words from a sentence with this code:
import nltk
nltk.download('words')
from nltk.corpus import words
words = set(nltk.corpus.words.words())
sent = "I love Matauranga"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if not w.lower() in words or not w.isalpha())
The result of the above code is 'Matauranga'
which is perfectly correct.
But when I try to iterate the code over a dataframe using this code:
import nltk
nltk.download('words')
from nltk.corpus import words
def no_english(text):
words = set(nltk.corpus.words.words())
" ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
if not w.lower() in words or not w.isalpha())
dummy_df['non_english'] = dummy_df.apply(no_english, axis = 1)
print(dummy_df)
I got an undesirable result in that the non_english
column has none
value instead of the desired non-english words (see below):
outcome non_english
0 I want to go to church None
1 I love Matauranga None
2 Take me to Oranga Tamariki None
3 None
Instead, the desired result should be:
outcome non_english
0 I want to go to church
1 I love Matauranga Matauranga
2 Take me to Oranga Tamariki Oranga Tamariki
You are missing the return
in your function:
import nltk
nltk.download('words')
from nltk.corpus import words
def no_english(text):
words = set(nltk.corpus.words.words())
return " ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
if not w.lower() in words or not w.isalpha())
dummy_df['non_english'] = dummy_df.apply(no_english, axis = 1)
print(dummy_df)
output:
outcome non_english
0 I want to go to church
1 I love Matauranga Matauranga
2 Take me to Oranga Tamariki Oranga Tamariki