I have a dataframe (allPop) and a geodataframe (allTracts). I'm merging them on a the column GEOID, which they both share:
newTracts = allTracts.merge(allPop, on='GEOID')
My problem is that I'm losing data on this merge, which conceptually shouldn't be happening. Each of the records in allPop should match with one of the records from allTracts, but newTracts has a couple hundred fewer records than allPop. I'd like to be able to look at the records not being included in the merge to try to diagnose the problem. Is there a way to do this? Or else, can I find the difference between allPop and allTracts based on their columns 'GEOID'? I've seen how to do this when both dataframes have all of the same column names/types, but can I do this based only on one column? I'm not sure what the output for this would look like, but lists of the GEOIDs that aren't being merged from both dataframes would be good. Or else the dataframes themselves without the records that were merged. Thanks!
You can use the isin
method in Pandas.
badPop = allPop[~allPop['GEOID'].isin(allTracts['GEOID'])
You can also use the indicator
option of the merge
method along with how='outer'
to find the offending rows.