Initial note: I cannot use many third party packages and it's highly likely I won't be able to use any you suggest. Try to keep a solution to Pandas, NumPy, or Python 3.7 built in libraries. My end goal is a word-bubble like graph, where word frequency is coded by categoricalInt
Say I have a pandas data frame like:
index | categoricalInt1 | categoricalInt2 | sanitizedStrings
0 | -4 | -5 | some lowercase strings
1 | 2 | 4 | addtnl lowercase strings here
2 | 3 | 3 | words
Is there any easier way than iterating over every single value in sanitizedStrings
to return a structure like
index | categoricalInt1 | categoricalInt2 | sanitizedStrings | some | lowercase | strings | addtnl | here | words
0 | -4 | -5 | ... | 1 | 1 | 1 | 0 | 0 | 0
1 | 2 | 4 | ... | 0 | 1 | 1 | 1 | 1 | 0
2 | 3 | 3 | ... | 0 | 0 | 0 | 0 | 0 | 1
My overall goal is simple: Count all substrings by categorical grouping. I've managed to get the strings aggregated together and condensed down into the categorical bins, but I'm struggling to get the counts together.
So far my code looks like:
df['Comments'] = df['Comments'].str.lower()
punct = string.punctuation.replace('|', '')
transtab = str.maketrans(dict.fromkeys(punct, ''))
df['Comments'] = '|'.join(df['Comments'].tolist()).translate(transtab).split('|')
pattern = '|'.join([r'\b{}\b'.format(w) for w in commonStrings]) # commonStrings defined elsewhere
df['SanitizedStrings'] = df['Comments'].str.replace(pattern, '')
df = df.drop(columns = 'Comments')
# end splitting bad values out of strings
# group the dataframe on like categories
groupedComments = df.groupby(['categoricalInt1', 'categoricalInt2'], as_index = False, sort=False).agg(' '.join)
print(groupedComments)
Previous to realizing I needed to bin these strings by categoricalInt
, I was using the following function:
groupedComments['SanitizedStrings'].str.split(expand=True).stack().value_counts()
If I could get that to return by row instead of stacking across column, I bet we'd be pretty close.
This isn't a particularly elegant solution and I am not sure how much data you are working with, but you could use an apply function to add the additional columns.
After reading your comment, it seems you are also looking at grouping by your categorical columns.
This can be achieved by also tweaking the column you create.
import pandas as pd
import numpy as np
##Make test data
string_list = ['some lowercase string','addtnl lowercase strings here','words']
categorical_int = [-4,3,2]
df = pd.DataFrame(zip(categorical_int,string_list),columns = ['categoricalInt1','sanitizedStrings'])
#create apply function
def add_cols(row):
col_dict = {}
new_cols = row['sanitizedStrings'].split(' ')
for col in new_cols:
if col not in col_dict.keys():
col_dict[col]=1
else:
col_dict[col]+=1
for key,value in col_dict.items():
#add _ so we can query these columns later
row['_'+key] = value
return row
#run apply function on dataframe
final_df = df.apply(add_cols,axis=1).fillna(0)
final_df
_addtnl _here _lowercase _some _string _strings _words \
0 0.0 0.0 1.0 1.0 1.0 0.0 0.0
1 1.0 1.0 1.0 0.0 0.0 1.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 1.0
categoricalInt1 sanitizedStrings
0 -4 some lowercase string
1 3 addtnl lowercase strings here
2 2 words
#add the group by and sum
final_group = final_df.groupby(['categoricalInt1'])[[col for col in final_df.columns if col.startswith('_')]].sum()
final_group.columns = [col.replace('_','') for col in final_group.columns]
final_group
addtnl here lowercase some string strings words
categoricalInt1
-4 0.0 0.0 1.0 1.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 1.0
3 1.0 1.0 1.0 0.0 0.0 1.0 0.0