Regular Expression Behavior in R unnest_token() v.s Python pandas str.split()

I want to replicate the result similar to df_long below using python pandas. This is the R code:

df <- data.frame("id" = 1, "author" = 'trump', "Tweet" = "RT @kin2souls: @KimStrassel Anyone that votes")

unnest_regex  <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"

df_long <- df %>%
  unnest_tokens(
    word, Tweet, token = "regex", pattern = unnest_regex)

If I understand correctly, the unnest_regex is written in a way that it also finds numbers (among whitespace and few punctuation marks). I don't get why R would treat a number in a string, for example "@kin2souls" as a not match condition. Therefore, we got a result in df_long with @kin2souls as a row on its own. However, when I try to replicate this in pandas:

unnest_regex = r"([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"

df = df_long.assign(word=df['Tweet'].str.split(unnest_regex)).explode('word')
df.drop("Tweet", axis=1, inplace=True)

It will split the "@kin2souls" string into "@kin" and "souls" as separate rows. Furthermore, since the unnest_regex uses capturing parenthesis, in Python, I modify it to:

unnest_regex = r"[^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@])"

This to avoid empty string as a result. I wonder if it is also a contributing factor. However, the split at "2" still happens. Could anyone propose a solution in Python and potentially explain why R behave this way? Thank you!

Here's the data in Python:

data = {'id':[1], "author":["trump"], "Tweet": ["RT @kin2souls: @KimStrassel Anyone that votes"]}
df = pd.DataFrame.from_dict(data)

And the expected result:

data_long = {'id':[1,1,1,1,1,1], "author":["trump","trump","trump","trump","trump","trump"], "word": ["rt", "@kin2souls", "@kimstrassel", "anyone", "that", "votes"]}
df_long = pd.DataFrame.from_dict(data_long)

Solution

A combination of str split and explode should replicate your output :

(df
 .assign(Tweet=df.Tweet.str.lower().str.split(r"[:\s]"))
 .explode("Tweet")
 .query('Tweet != ""')
 .reset_index(drop=True)
)


    id  author  Tweet
0   1   trump   rt
1   1   trump   @kin2souls
2   1   trump   @kimstrassel
3   1   trump   anyone
4   1   trump   that
5   1   trump   votes

I took advantage of the fact that the text is delimited by space, and the occasional :

Alternatively, you could use str extractall - I feel it is a bit longer though :

(
    df.set_index(["id", "author"])
    .Tweet.str.lower()
    .str.extractall(r"\s*([a-z@\d]+)[:\s]*")
    .droplevel(-1)
    .rename(columns={0: "Tweet"})
    .reset_index()
)

Not sure how unnest_token works with regex - maybe someone else can resolve that