I have a pandas dataframe df_org with three columns - Index (integer), Titles (string) and Dates (date).
I have a method process_title(text), which takes a string as input and tokenize, remove stop words and lemmatize the input string and returns the words as a list.
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def process_title(text):
tokens = word_tokenize(text.lower())
try:
tokens.remove("google")
tokens.remove("search")
tokens.remove("-")
except:
pass
lemm_tokens = list(map(lemmatizer.lemmatize,tokens))
without_stop = [word for word in lemm_tokens if word not in stop_words]
return without_stop
I want a new dataframe which contains three columns - Word(string), Frequency(integer), Date(date). The Word column contain words(single word) from the list returned by process_title(text), the Frequency column contains the frequency of that word appearing on a given date and Date column contains the date.
---------------------------------------
| Word | Frequency | Date |
---------------------------------------
| computer | 1 |2021-08-01|
| science | 1 |2021-08-01|
| something| 5 |2021-08-02|
.....
How can I group the df_org dataframe along date and create the new dataframe? Changes can be made to the process_title(text) method without compromising the end requirement.
You can use the DataFrame.explode
method, followed by groupby
and size
:
I am going to just use a simple .str.split
instead of your function, as I don't know where word_tokenize
comes from.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'title': ['Hello World', 'Foo Bar'], 'date': ['2021-01-12T20:00', '2021-02-10T22:00']})
In [3]: df['words'] = df['title'].apply(lambda s: process_title(str(s)))
In [4]: df
Out[4]:
title date words
0 Hello World 2021-01-12T20:00 [Hello, World]
1 Foo Bar 2021-02-10T22:00 [Foo, Bar]
In [5]: exploded = df.explode('words')
In [6]: exploded
Out[6]:
title date words
0 Hello World 2021-01-12T20:00 Hello
0 Hello World 2021-01-12T20:00 World
1 Foo Bar 2021-02-10T22:00 Foo
1 Foo Bar 2021-02-10T22:00 Bar
In [7]: exploded.groupby(['date', 'words']).size()
Out[7]:
date words
2021-01-12T20:00 Hello 1
World 1
2021-02-10T22:00 Bar 1
Foo 1
dtype: int64