Search code examples
pandaspandas-groupby

Creating batches based on city in pandas


I have two different dataframes that I want to fuzzy match against each other to find and remove duplicates. To make the process faster/more accurate I want to only fuzzy match records from both dataframes in the same cities. So that makes it necessary to create batches based on cities in the one dataframe then running the fuzzy matcher between each batch and a subset of the other dataframe with like cities. I can't find another post that does this and I am stuck. Here is what I have so far. Thanks!

df = pd.DataFrame({'A':[1,1,2,2,2,2,3,3],'B':['Q','Q','R','R','R','P','L','L'],'origin':['file1','file2','file3','file4','file5','file6','file7','file8']})
cols = ['B']
df1 = df[df.duplicated(subset=cols,keep=False)].copy()
df1 = df1.sort_values(cols)
df1['group'] = 'g' + (df1.groupby(cols).ngroup() + 1).astype(str)
df1['duplicate_count'] = df1.groupby(cols)['origin'].transform('size')
df1_g1 = df1.loc[df1['group'] == 'g1']
print(df1_g1)

which will not factor in anything that isn't duplicated so if a value only appears once then it will be skipped as is the case with 'P' in column B. It also requires me to go in and hard-code the group in each time which is not ideal. I haven't been able to figure out a for loop or any other method to solve this. Thanks!


Solution

  • You can pass to locals

    variables = locals()
    for i,j in df1.groupby('group'):
        variables["df1_{0}".format(i)] = j
    df1_g1
    Out[314]: 
       A  B origin group  duplicate_count
    6  3  L  file7    g1                2
    7  3  L  file8    g1                2