Search code examples
pythonpandasmergedifferencegeopandas

How can I get the difference between two dataframes based on one column?


I have a dataframe (allPop) and a geodataframe (allTracts). I'm merging them on a the column GEOID, which they both share:

newTracts = allTracts.merge(allPop, on='GEOID')

My problem is that I'm losing data on this merge, which conceptually shouldn't be happening. Each of the records in allPop should match with one of the records from allTracts, but newTracts has a couple hundred fewer records than allPop. I'd like to be able to look at the records not being included in the merge to try to diagnose the problem. Is there a way to do this? Or else, can I find the difference between allPop and allTracts based on their columns 'GEOID'? I've seen how to do this when both dataframes have all of the same column names/types, but can I do this based only on one column? I'm not sure what the output for this would look like, but lists of the GEOIDs that aren't being merged from both dataframes would be good. Or else the dataframes themselves without the records that were merged. Thanks!


Solution

  • You can use the isin method in Pandas.

    badPop = allPop[~allPop['GEOID'].isin(allTracts['GEOID'])
    

    You can also use the indicator option of the merge method along with how='outer' to find the offending rows.