Search code examples
pythonpandasdataframegroupingword-frequency

Python Count occurrences of a substring in pandas by row appending distinct string as column


Initial note: I cannot use many third party packages and it's highly likely I won't be able to use any you suggest. Try to keep a solution to Pandas, NumPy, or Python 3.7 built in libraries. My end goal is a word-bubble like graph, where word frequency is coded by categoricalInt

Say I have a pandas data frame like:

index | categoricalInt1 | categoricalInt2 | sanitizedStrings 
0     |    -4           |    -5           |   some lowercase strings
1     |     2           |     4           |   addtnl lowercase strings here
2     |     3           |     3           |   words

Is there any easier way than iterating over every single value in sanitizedStrings to return a structure like

index | categoricalInt1 | categoricalInt2 | sanitizedStrings | some | lowercase | strings | addtnl | here | words
0     |     -4          |    -5           |      ...         |  1   |    1      |   1     |   0    |  0  | 0
1     |      2          |     4           |      ...         |  0   |    1      |   1     |   1    |  1  | 0
2     |      3          |     3           |      ...         |  0   |    0      |   0     |   0    |  0  | 1

My overall goal is simple: Count all substrings by categorical grouping. I've managed to get the strings aggregated together and condensed down into the categorical bins, but I'm struggling to get the counts together.

So far my code looks like:

df['Comments'] = df['Comments'].str.lower()

punct = string.punctuation.replace('|', '')
transtab = str.maketrans(dict.fromkeys(punct, ''))

df['Comments'] = '|'.join(df['Comments'].tolist()).translate(transtab).split('|')

pattern = '|'.join([r'\b{}\b'.format(w) for w in commonStrings]) # commonStrings defined elsewhere
df['SanitizedStrings'] = df['Comments'].str.replace(pattern, '')
df = df.drop(columns = 'Comments')
# end splitting bad values out of strings

# group the dataframe on like categories
groupedComments = df.groupby(['categoricalInt1', 'categoricalInt2'], as_index = False, sort=False).agg(' '.join)

print(groupedComments)

Previous to realizing I needed to bin these strings by categoricalInt, I was using the following function: groupedComments['SanitizedStrings'].str.split(expand=True).stack().value_counts()

If I could get that to return by row instead of stacking across column, I bet we'd be pretty close.


Solution

  • This isn't a particularly elegant solution and I am not sure how much data you are working with, but you could use an apply function to add the additional columns.

    After reading your comment, it seems you are also looking at grouping by your categorical columns.

    This can be achieved by also tweaking the column you create.

    import pandas as pd
    import numpy as np
    
    ##Make test data
    string_list = ['some lowercase string','addtnl lowercase strings here','words']
    categorical_int = [-4,3,2]
    df = pd.DataFrame(zip(categorical_int,string_list),columns = ['categoricalInt1','sanitizedStrings'])
    
    #create apply function
    def add_cols(row):
        col_dict = {}
        new_cols = row['sanitizedStrings'].split(' ')
        for col in new_cols:
            if col not in col_dict.keys():
                col_dict[col]=1
            else:
                col_dict[col]+=1
        for key,value in col_dict.items():
            #add _ so we can query these columns later
            row['_'+key] = value
        return row
    
    #run apply function on dataframe
    final_df = df.apply(add_cols,axis=1).fillna(0)
    final_df
    
       _addtnl  _here  _lowercase  _some  _string  _strings  _words  \
    0      0.0    0.0         1.0    1.0      1.0       0.0     0.0   
    1      1.0    1.0         1.0    0.0      0.0       1.0     0.0   
    2      0.0    0.0         0.0    0.0      0.0       0.0     1.0   
    
       categoricalInt1               sanitizedStrings  
    0               -4          some lowercase string  
    1                3  addtnl lowercase strings here  
    2                2                          words 
    
    #add the group by and sum
    final_group = final_df.groupby(['categoricalInt1'])[[col for col in final_df.columns if col.startswith('_')]].sum()
    final_group.columns = [col.replace('_','') for col in final_group.columns]
    final_group
    
    
                     addtnl  here  lowercase  some  string  strings  words
    categoricalInt1                                                       
    -4                  0.0   0.0        1.0   1.0     1.0      0.0    0.0
     2                  0.0   0.0        0.0   0.0     0.0      0.0    1.0
     3                  1.0   1.0        1.0   0.0     0.0      1.0    0.0