python pandas dataframe grouping word-frequency

Python Count occurrences of a substring in pandas by row appending distinct string as column

Initial note: I cannot use many third party packages and it's highly likely I won't be able to use any you suggest. Try to keep a solution to Pandas, NumPy, or Python 3.7 built in libraries. My end goal is a word-bubble like graph, where word frequency is coded by categoricalInt

Say I have a pandas data frame like:

index | categoricalInt1 | categoricalInt2 | sanitizedStrings 
0     |    -4           |    -5           |   some lowercase strings
1     |     2           |     4           |   addtnl lowercase strings here
2     |     3           |     3           |   words

Is there any easier way than iterating over every single value in sanitizedStrings to return a structure like

index | categoricalInt1 | categoricalInt2 | sanitizedStrings | some | lowercase | strings | addtnl | here | words
0     |     -4          |    -5           |      ...         |  1   |    1      |   1     |   0    |  0  | 0
1     |      2          |     4           |      ...         |  0   |    1      |   1     |   1    |  1  | 0
2     |      3          |     3           |      ...         |  0   |    0      |   0     |   0    |  0  | 1

My overall goal is simple: Count all substrings by categorical grouping. I've managed to get the strings aggregated together and condensed down into the categorical bins, but I'm struggling to get the counts together.

So far my code looks like:

df['Comments'] = df['Comments'].str.lower()

punct = string.punctuation.replace('|', '')
transtab = str.maketrans(dict.fromkeys(punct, ''))

df['Comments'] = '|'.join(df['Comments'].tolist()).translate(transtab).split('|')

pattern = '|'.join([r'\b{}\b'.format(w) for w in commonStrings]) # commonStrings defined elsewhere
df['SanitizedStrings'] = df['Comments'].str.replace(pattern, '')
df = df.drop(columns = 'Comments')
# end splitting bad values out of strings

# group the dataframe on like categories
groupedComments = df.groupby(['categoricalInt1', 'categoricalInt2'], as_index = False, sort=False).agg(' '.join)

print(groupedComments)

Previous to realizing I needed to bin these strings by categoricalInt, I was using the following function: groupedComments['SanitizedStrings'].str.split(expand=True).stack().value_counts()

If I could get that to return by row instead of stacking across column, I bet we'd be pretty close.

Solution

This isn't a particularly elegant solution and I am not sure how much data you are working with, but you could use an apply function to add the additional columns.

After reading your comment, it seems you are also looking at grouping by your categorical columns.

This can be achieved by also tweaking the column you create.

import pandas as pd
import numpy as np

##Make test data
string_list = ['some lowercase string','addtnl lowercase strings here','words']
categorical_int = [-4,3,2]
df = pd.DataFrame(zip(categorical_int,string_list),columns = ['categoricalInt1','sanitizedStrings'])

#create apply function
def add_cols(row):
    col_dict = {}
    new_cols = row['sanitizedStrings'].split(' ')
    for col in new_cols:
        if col not in col_dict.keys():
            col_dict[col]=1
        else:
            col_dict[col]+=1
    for key,value in col_dict.items():
        #add _ so we can query these columns later
        row['_'+key] = value
    return row

#run apply function on dataframe
final_df = df.apply(add_cols,axis=1).fillna(0)
final_df

   _addtnl  _here  _lowercase  _some  _string  _strings  _words  \
0      0.0    0.0         1.0    1.0      1.0       0.0     0.0   
1      1.0    1.0         1.0    0.0      0.0       1.0     0.0   
2      0.0    0.0         0.0    0.0      0.0       0.0     1.0   

   categoricalInt1               sanitizedStrings  
0               -4          some lowercase string  
1                3  addtnl lowercase strings here  
2                2                          words 

#add the group by and sum
final_group = final_df.groupby(['categoricalInt1'])[[col for col in final_df.columns if col.startswith('_')]].sum()
final_group.columns = [col.replace('_','') for col in final_group.columns]
final_group


                 addtnl  here  lowercase  some  string  strings  words
categoricalInt1                                                       
-4                  0.0   0.0        1.0   1.0     1.0      0.0    0.0
 2                  0.0   0.0        0.0   0.0     0.0      0.0    1.0
 3                  1.0   1.0        1.0   0.0     0.0      1.0    0.0