Search code examples
pythonpandasdataframenlpgensim

Tokenizing and summarizing Textual data by group efficiently in Python


I have a dataset in Python that look like this one:

data = pd.DataFrame({
    'ID': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'TEXT': [
        "Mouthwatering BBQ ribs cheese, and coleslaw.",
        "Delicious pizza with pepperoni and extra cheese.",
        "Spicy Thai curry with cheese and jasmine rice.",
        "Tiramisu dessert topped with cocoa powder.",
        "Sushi rolls with fresh fish and soy sauce.",
        "Freshly baked chocolate chip cookies.",
        "Homemade lasagna with layers of cheese and pasta.",
        "Gourmet burgers with all the toppings and extra cheese.",
        "Crispy fried chicken with mashed potatoes and extra cheese.",
        "Creamy tomato soup with a grilled cheese sandwich."
    ],
    'DATE': [
        '2023-02-01', '2023-02-01', '2023-02-01', '2023-02-01', '2023-02-02',
        '2023-02-02', '2023-02-01', '2023-02-01', '2023-02-02', '2023-02-02'
    ]
})

What I'd like to do is group by DATE and get the frequency of each token after removing punctuation. I'm very new to the Python environment; I come from R, and I have been looking into the gensim library for further reference. It looks quite complicated to me. My desired output would look like this: for each group (DATE), we'll have the frequency of each unique token.

TOKEN SUBTOTAL DATE
cheese 5 1/02/2023
and 5 1/02/2023
with 5 1/02/2023
extra 2 1/02/2023
mouthwatering 1 1/02/2023
bbq 1 1/02/2023
ribs 1 1/02/2023
coleslaw 1 1/02/2023
delicious 1 1/02/2023
pizza 1 1/02/2023
pepperoni 1 1/02/2023

In R this can be done very easy with quanteda like this:

corpus_food<-corpus(data,
                  docid_field = "ID",
                  text_field = "TEXT")

corpus_food %>%
  tokens(remove_punct = TRUE) %>% 
  dfm() %>% 
  textstat_frequency(groups = lubridate::date(DATE)) 

Which only creates a corpus and then tokenizes to remove punctuation. Later, it creates a document-term matrix and finally summarizes the tokens and their frequencies by group.

I am in no way comparing the two languages, Python and R. They are amazing, but at the moment, I'm interested in a very straightforward and fast method to achieve my results in Python. If perhaps you don't use the gensim library, I'd still be interested in a way to achieve what I'm looking for in a faster and more efficient way in Python. I'm new to Python.


Solution

  • I would simply extractall the words then value_counts :

    out = (
        data[["DATE"]].join(
            data["TEXT"].str.extractall("(\w+)")[0]
                .droplevel(1).rename("TOKEN").str.lower()
        ).groupby(["DATE", "TOKEN"]).value_counts().reset_index(name="SUBTOTAL")
            .sort_values(["DATE", "SUBTOTAL"], ascending=[True, False])
    )
    

    Output :

    print(out)
    
              DATE   TOKEN  SUBTOTAL
    1   2023-02-01     and         5
    4   2023-02-01  cheese         5
    30  2023-02-01    with         5
    10  2023-02-01   extra         2
    0   2023-02-01     all         1
    ..         ...     ...       ...
    51  2023-02-02   sauce         1
    52  2023-02-02    soup         1
    53  2023-02-02     soy         1
    54  2023-02-02   sushi         1
    55  2023-02-02  tomato         1
    
    [57 rows x 3 columns]