I want to remove the whole tweet or a row from a data-frame if it contains any non-english word. My data-frame looks like
text
1 | morning why didnt i go to sleep earlier oh well im seEING DNP TODAY!!
JIP UHH <f0><U+009F><U+0092><U+0096><f0><U+009F><U+0092><U+0096>
2 | @natefrancis00 @SimplyAJ10 <f0><U+009F><U+0098><U+0086><f0><U+009F
<U+0086> if only Alan had a Twitter hahaha
3 | @pchirsch23 @The_0nceler @livetennis Whoa whoa let’s not take this too
far now
4 | @pchirsch23 @The_0nceler @livetennis Well Pat that’s just not true
5 | One word #Shame on you! #Ji allowing looters to become president
The expected dataframe should be like this:
text
3 | @pchirsch23 @The_0nceler @livetennis Whoa whoa let’s not take this too
far now
4 | @pchirsch23 @The_0nceler @livetennis Well Pat that’s just not true
5 | One word #Shame on you! #Ji allowing looters to become president.
You want to preserve the alpha-numeric characters along with some of punctuation's like @, ! etc.
If your column contains mainly of <unicode>
, then this should do:
For data frame df
with text
column, using grep
:
new_str <- grep(df_str$text, pattern = "<*>", value= TRUE , invert = TRUE )
new_str[new_str != ""]
To put it back to your original column text
. you can just work with indices that you need and put other to NA
:
idx <- grep(df$text, pattern = "<*>", invert = TRUE )
df$text[-idx] <- NA
For cleaning the tweet, you can use gsub
function. refer this post cleaning tweet in R