Search code examples
pythonrregexpandastidytext

Regular Expression Behavior in R unnest_token() v.s Python pandas str.split()


I want to replicate the result similar to df_long below using python pandas. This is the R code:

df <- data.frame("id" = 1, "author" = 'trump', "Tweet" = "RT @kin2souls: @KimStrassel Anyone that votes")

unnest_regex  <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"

df_long <- df %>%
  unnest_tokens(
    word, Tweet, token = "regex", pattern = unnest_regex)

If I understand correctly, the unnest_regex is written in a way that it also finds numbers (among whitespace and few punctuation marks). I don't get why R would treat a number in a string, for example "@kin2souls" as a not match condition. Therefore, we got a result in df_long with @kin2souls as a row on its own. However, when I try to replicate this in pandas:

unnest_regex = r"([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"

df = df_long.assign(word=df['Tweet'].str.split(unnest_regex)).explode('word')
df.drop("Tweet", axis=1, inplace=True)

It will split the "@kin2souls" string into "@kin" and "souls" as separate rows. Furthermore, since the unnest_regex uses capturing parenthesis, in Python, I modify it to:

unnest_regex = r"[^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@])"

This to avoid empty string as a result. I wonder if it is also a contributing factor. However, the split at "2" still happens. Could anyone propose a solution in Python and potentially explain why R behave this way? Thank you!

Here's the data in Python:

data = {'id':[1], "author":["trump"], "Tweet": ["RT @kin2souls: @KimStrassel Anyone that votes"]}
df = pd.DataFrame.from_dict(data)

And the expected result:

data_long = {'id':[1,1,1,1,1,1], "author":["trump","trump","trump","trump","trump","trump"], "word": ["rt", "@kin2souls", "@kimstrassel", "anyone", "that", "votes"]}
df_long = pd.DataFrame.from_dict(data_long)

Solution

  • A combination of str split and explode should replicate your output :

    (df
     .assign(Tweet=df.Tweet.str.lower().str.split(r"[:\s]"))
     .explode("Tweet")
     .query('Tweet != ""')
     .reset_index(drop=True)
    )
    
    
        id  author  Tweet
    0   1   trump   rt
    1   1   trump   @kin2souls
    2   1   trump   @kimstrassel
    3   1   trump   anyone
    4   1   trump   that
    5   1   trump   votes
    

    I took advantage of the fact that the text is delimited by space, and the occasional :

    Alternatively, you could use str extractall - I feel it is a bit longer though :

    (
        df.set_index(["id", "author"])
        .Tweet.str.lower()
        .str.extractall(r"\s*([a-z@\d]+)[:\s]*")
        .droplevel(-1)
        .rename(columns={0: "Tweet"})
        .reset_index()
    )
    

    Not sure how unnest_token works with regex - maybe someone else can resolve that