Search code examples
pythonpandasdataframenltkdata-analysis

Group rows in Pandas dataframe, apply custom function and store results in a new dataframe as rows


I have a pandas dataframe df_org with three columns - Index (integer), Titles (string) and Dates (date).

enter image description here

I have a method process_title(text), which takes a string as input and tokenize, remove stop words and lemmatize the input string and returns the words as a list.

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def process_title(text):
    tokens = word_tokenize(text.lower())
    try:
        tokens.remove("google")
        tokens.remove("search")
        tokens.remove("-")
    except:
        pass

    lemm_tokens = list(map(lemmatizer.lemmatize,tokens))
    without_stop = [word for word in lemm_tokens if word not in stop_words]
    return without_stop

I want a new dataframe which contains three columns - Word(string), Frequency(integer), Date(date). The Word column contain words(single word) from the list returned by process_title(text), the Frequency column contains the frequency of that word appearing on a given date and Date column contains the date.

    ---------------------------------------
    |  Word    | Frequency     |   Date   |
    ---------------------------------------
    | computer | 1             |2021-08-01|
    | science  | 1             |2021-08-01|
    | something| 5             |2021-08-02|
.....

How can I group the df_org dataframe along date and create the new dataframe? Changes can be made to the process_title(text) method without compromising the end requirement.


Solution

  • You can use the DataFrame.explode method, followed by groupby and size:

    I am going to just use a simple .str.split instead of your function, as I don't know where word_tokenize comes from.

    In [1]: import pandas as pd
    
    In [2]: df = pd.DataFrame({'title': ['Hello World', 'Foo Bar'], 'date': ['2021-01-12T20:00', '2021-02-10T22:00']})
    
    In [3]: df['words'] = df['title'].apply(lambda s: process_title(str(s)))
    
    In [4]: df
    Out[4]:
             title              date           words
    0  Hello World  2021-01-12T20:00  [Hello, World]
    1      Foo Bar  2021-02-10T22:00      [Foo, Bar]
    
    In [5]: exploded = df.explode('words')
    
    In [6]: exploded
    Out[6]:
             title              date  words
    0  Hello World  2021-01-12T20:00  Hello
    0  Hello World  2021-01-12T20:00  World
    1      Foo Bar  2021-02-10T22:00    Foo
    1      Foo Bar  2021-02-10T22:00    Bar
    
    In [7]: exploded.groupby(['date', 'words']).size()
    Out[7]:
    date              words
    2021-01-12T20:00  Hello    1
                      World    1
    2021-02-10T22:00  Bar      1
                      Foo      1
    dtype: int64