I want to replicate the result similar to df_long below using python pandas. This is the R code:
df <- data.frame("id" = 1, "author" = 'trump', "Tweet" = "RT @kin2souls: @KimStrassel Anyone that votes")
unnest_regex <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
df_long <- df %>%
unnest_tokens(
word, Tweet, token = "regex", pattern = unnest_regex)
If I understand correctly, the unnest_regex is written in a way that it also finds numbers (among whitespace and few punctuation marks). I don't get why R would treat a number in a string, for example "@kin2souls" as a not match condition. Therefore, we got a result in df_long with @kin2souls as a row on its own. However, when I try to replicate this in pandas:
unnest_regex = r"([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
df = df_long.assign(word=df['Tweet'].str.split(unnest_regex)).explode('word')
df.drop("Tweet", axis=1, inplace=True)
It will split the "@kin2souls" string into "@kin" and "souls" as separate rows. Furthermore, since the unnest_regex uses capturing parenthesis, in Python, I modify it to:
unnest_regex = r"[^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@])"
This to avoid empty string as a result. I wonder if it is also a contributing factor. However, the split at "2" still happens. Could anyone propose a solution in Python and potentially explain why R behave this way? Thank you!
Here's the data in Python:
data = {'id':[1], "author":["trump"], "Tweet": ["RT @kin2souls: @KimStrassel Anyone that votes"]}
df = pd.DataFrame.from_dict(data)
And the expected result:
data_long = {'id':[1,1,1,1,1,1], "author":["trump","trump","trump","trump","trump","trump"], "word": ["rt", "@kin2souls", "@kimstrassel", "anyone", "that", "votes"]}
df_long = pd.DataFrame.from_dict(data_long)
A combination of str split and explode should replicate your output :
(df
.assign(Tweet=df.Tweet.str.lower().str.split(r"[:\s]"))
.explode("Tweet")
.query('Tweet != ""')
.reset_index(drop=True)
)
id author Tweet
0 1 trump rt
1 1 trump @kin2souls
2 1 trump @kimstrassel
3 1 trump anyone
4 1 trump that
5 1 trump votes
I took advantage of the fact that the text is delimited by space, and the occasional :
Alternatively, you could use str extractall - I feel it is a bit longer though :
(
df.set_index(["id", "author"])
.Tweet.str.lower()
.str.extractall(r"\s*([a-z@\d]+)[:\s]*")
.droplevel(-1)
.rename(columns={0: "Tweet"})
.reset_index()
)
Not sure how unnest_token
works with regex - maybe someone else can resolve that