I have a panda dataframe that contains a list of articles; the outlet, publish date, link etc. One of the columns in this dataframe is a list of keywords. For example, in the keyword column each cell contains a list like [drop, right, states, laws].
My ultimate goal is to count the number of occurrences of each unique word on each day. The challenge that I'm having is breaking the keywords out of their lists and then matching them to the date on which they occurred. ...assuming this is even the most logical first step.
At present I have a solution in the code below, however I'm new to python and in thinking through these things I still think in an Excel mindset. The code below works but it's very slow. Is there a fast way to do this?
# Create a list of the keywords for articles in the last 30 days to determine their quantity
keyword_list = stories_full_recent_df['Keywords'].tolist()
keyword_list = [item for sublist in keyword_list for item in sublist]
# Create a blank dataframe and new iterator to write the keyword appearances to
wordtrends_df = pd.DataFrame(columns=['Captured_Date', 'Brand' , 'Coverage' ,'Keyword'])
r = 0
print("Creating table on keywords: {:,}".format(len(keyword_list)))
print(time.strftime("%H:%M:%S"))
# Write the keywords out into their own rows with the dates and origins in which they occur
while r <= len(keyword_list):
for i in stories_full_recent_df.index:
words = stories_full_recent_df.loc[i]['Keywords']
for word in words:
wordtrends_df.loc[r] = [stories_full_recent_df.loc[i]['Captured_Date'], stories_full_recent_df.loc[i]['Brand'],
stories_full_recent_df.loc[i]['Coverage'], word]
r += 1
print(time.strftime("%H:%M:%S"))
print("Keyword compilation complete.")
Once I have each word on it's own row I'm simply using .groupby() to figure out the number of occurences each day.
# Group and count the keywords and days to find the day with the least of each word
test_min = wordtrends_df.groupby(('Keyword', 'Captured_Date'), as_index=False).count().sort_values(by=['Keyword','Brand'], ascending=True)
keyword_min = test_min.groupby(['Keyword'], as_index=False).first()
At present there around about 100,000 words in this list and it takes me an hour to run through that list. I'd love thoughts on a faster way to do it.
I think you can get the expected result by doing this:
wordtrends_df = pd.melt(pd.concat((stories_full_recent_df[['Brand', 'Captured_Date', 'Coverage']],
stories_full_recent_df.Keywords.apply(pd.Series)),axis=1),
id_vars=['Brand','Captured_Date','Coverage'],value_name='Keyword')\
.drop(['variable'],axis=1).dropna(subset=['Keyword'])
An explanation with a small example below.
Consider an example dataframe:
df = pd.DataFrame({'Brand': ['X', 'Y'],
'Captured_Date': ['2017-04-01', '2017-04-02'],
'Coverage': [10, 20],
'Keywords': [['a', 'b', 'c'], ['c', 'd']]})
# Brand Captured_Date Coverage Keywords
# 0 X 2017-04-01 10 [a, b, c]
# 1 Y 2017-04-02 20 [c, d]
First thing you can do is expand the keywords column so that each keyword occupies its own column:
a = df.Keywords.apply(pd.Series)
# 0 1 2
# 0 a b c
# 1 c d NaN
Concatenate this with the original df without Keywords column:
b = pd.concat((df[['Captured_Date','Brand','Coverage']],a),axis=1)
# Captured_Date Brand Coverage 0 1 2
# 0 2017-04-01 X 10 a b c
# 1 2017-04-02 Y 20 c d NaN
Melt this last result to create a row per keyword:
c = pd.melt(b,id_vars=['Captured_Date','Brand','Coverage'],value_name='Keyword')
# Captured_Date Brand Coverage variable Keyword
# 0 2017-04-01 X 10 0 a
# 1 2017-04-02 Y 20 0 c
# 2 2017-04-01 X 10 1 b
# 3 2017-04-02 Y 20 1 d
# 4 2017-04-01 X 10 2 c
# 5 2017-04-02 Y 20 2 NaN
Finally, drop the useless variable
column and drop rows where Keyword
is missing:
d = c.drop(['variable'],axis=1).dropna(subset=['Keyword'])
# Captured_Date Brand Coverage Keyword
# 0 2017-04-01 X 10 a
# 1 2017-04-02 Y 20 c
# 2 2017-04-01 X 10 b
# 3 2017-04-02 Y 20 d
# 4 2017-04-01 X 10 c
Now you're ready to count by keywords and dates.