Search code examples
python-3.xpandasdataframemerget-test

Create two Dataframes based on series membership in Pandas


I'm a beginner, I can't seem to find an exact answer to this.

I have two dataframes, the first has localized economic data (df1):

(index)  (index)     2000     2010  Diff   
State    Region    
NY       NYC         1000     1100   100
NY       Upstate      200      270    70
NY       Long_Island 1700     1800   100 
IL       Chicago      300      500   200
IL       South         50       35    15
IL       Suburbs      800      650  -150

The second has a list of state and regions, (df2):

index   State   Region
0        NY      NYC
1        NY      Long_Island
2        IL      Chicago

Ultimately what I'm trying to do is run a t-test on the Diff column between the state and regions in df2 vs all the other ones in df1 that are not included in df2. However, I haven't managed to divide the groups yet so I can't run the test.

My latest attempt (of many) looks like this:

df1['Region', 'State'].isin(df2['Region', 'State'])

I've tried pd.merge too but can't seem to get it to work. I think it's because of the multi-level indexing but I still don't know how to get the state/regions that are not in df2.


Solution

  • It seems you need difference of MultiIndexes and then select by loc:

    print (df1.index)
    MultiIndex(levels=[['IL', 'NY'], ['Chicago', 'Long_Island', 
                                      'NYC', 'South', 'Suburbs', 'Upstate']],
               labels=[[1, 1, 1, 0, 0, 0], [2, 5, 1, 0, 3, 4]],
               names=['State', 'Region'])
    
    print (df2.index)
    Int64Index([0, 1, 2], dtype='int64', name='index')
    
    print (df1.index.names)
    ['State', 'Region']
    

    #create index from both columns
    df2 =  df2.set_index(df1.index.names)
    what is same as
    #df2 = df2.set_index(['State','Region'])
    
    mux = df1.index.difference(df2.index)
    print (mux)
    MultiIndex(levels=[['IL', 'NY'], ['South', 'Suburbs', 'Upstate']],
               labels=[[0, 0, 1], [0, 1, 2]],
               names=['State', 'Region'],
               sortorder=0)
    
    print (df1.loc[mux])
                   2000  2010  Diff
    State Region                   
    IL    South      50    35    15
          Suburbs   800   650  -150
    NY    Upstate   200   270    70
    

    All together:

    df2 =  df2.set_index(df1.index.names)
    df = df1.loc[df1.index.difference(df2.index)]
    print (df)