Search code examples
pythonpandasdataframeindexingdata-masking

Pandas- masking rows/columns between two dataframes where indexes are not shared


The Problem

I have two datasets that describe, let's say, the temperature at certain depths and at certain latitudes for a sea. The datasets are from two different models and therefore have differing resolution, with model 1 have a higher resolution for latitude and both models having different levels for the depth dimension. I've turned both datasets into pandas dataframes with depth as the vertical index and latitudes as the column labels. I want to mask out the rows (depths) and columns (latitudes) that are not shared between both dataframes since I'll be taking a difference and don't want to interpolate data. I've found how to mask out certain values within rows and columns, but I want to mask out rows and columns in their entirety.

I've used np.intersect1d on the depths as lists to find which depths are not shared between the models, and I created a boolean list using a conditional statement showing True for every index where the value is unique to that dataframe. However, I'm not sure how to use this as a mask or even if I can. DataFrame.mask says the "array conditional must be the same shape as self", but array conditional is one-dimensional and the dataframe is two-dimensional. I'm not sure how to refer to the indexes of the dataframe only to apply the mask. I feel like I'm on the right track, but I'm not entirely sure since I'm still new to pandas. (I've tried searching for similar questions, but none match my problem quite exactly from what I've seen.)

The Code (simplified working example)

Note- This was written in the Jupyter notebook environment

import numpy as np
import pandas as pd

# Model 1 data
depthmod1 = [5, 10, 15, 20, 30, 50, 60, 80, 100]  #depth in meters
latmod1 = [50, 50.5, 51, 51.5, 52, 52.5, 53] #latitude in degrees north
tmpumod1 = np.random.randint(273,303,size=(len(depthmod1),len(latmod1))) #temperature
dfmod1 = pd.DataFrame(tmpumod1,index=depthmod1,columns=latmod1)
print(dfmod1)
     50.0  50.5  51.0  51.5  52.0  52.5  53.0
5     299   300   300   293   285   293   273
10    273   288   293   292   290   302   273
15    277   279   284   302   280   294   284
20    291   295   277   276   295   279   274
30    281   284   284   275   295   284   282
50    284   276   291   282   286   295   295
60    298   294   289   294   285   289   288
80    285   284   275   298   287   277   300
100   292   295   294   273   291   276   290
# Model 2 data
depthmod2  = [5, 10, 15, 25, 35, 50, 60, 100]
latmod2  = [50, 51, 52, 53]
tmpumod2  = np.random.randint(273,303,size=(len(depthmod2), len(latmod2)))
dfmod2 = pd.DataFrame(tmpumod2,index=depthmod2,columns=latmod2)
print(dfmod2)
      50   51   52   53
5    297  282  275  292
10   298  286  292  282
15   286  285  288  273
25   292  288  279  299
35   301  295  300  288
50   277  301  281  277
60   276  293  295  297
100  275  279  292  287
# Find shared depths
depthxsect = np.intersect1d(depthmod1, depthmod2)
print(depthxsect, depthxsect.shape)
Shared depths:  [  5  10  15  50  60 100] (6,)
# Boolean mask for model 1
depthmask = dfmod1.index.isin(depthxsect) == False
print("Bool showing where mod1 index is NOT in mod2: ", depthmask)
Bool showing where mod1 index is NOT in mod2:  [False False False  True  True False False  True False]
# Mask data
dfmod1masked = dfmod1.mask(depthmask1,np.nan)
print(dfmod1masked)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-fedf013c2200> in <module>
----> 1 dfmod1masked = dfmod1.mask(depthmask1,np.nan)
      2 print(dfmod1masked)
[...]
ValueError: Array conditional must be same shape as self

The Question

How can I mask the rows by index such that I'm left only with rows/indexes [ 5 10 15 50 60 100] useable in both dataframes? I'll be doing similar masking for the columns (latitudes) so hopefully the solution for the rows will work for columns as well. I also do not want to merge the dataframes. They should remain separate unless a merge is needed for this.


Solution

  • depthxsect returns an np.array of the indices that you need. So, you can skip creating the boolean array depthmask and just pass the np.array to your datframe using .loc. You should use .mask if you are trying to keep all of the rows but just return NaN values on the other indices.

    After getting dfmod1 and depthxsect, you can simply use:

    dfmod1.loc[depthxsect]
    

    Full reproducible code:

    import pandas as pd
    import numpy as np
    
    # Model 1 data
    depthmod1 = [5, 10, 15, 20, 30, 50, 60, 80, 100]  #depth in meters
    latmod1 = [50, 50.5, 51, 51.5, 52, 52.5, 53] #latitude in degrees north
    tmpumod1 = np.random.randint(273,303,size=(len(depthmod1),len(latmod1))) #temperature
    dfmod1 = pd.DataFrame(tmpumod1,index=depthmod1,columns=latmod1)
    
    depthmod2  = [5, 10, 15, 25, 35, 50, 60, 100]
    latmod2  = [50, 51, 52, 53]
    tmpumod2  = np.random.randint(273,303,size=(len(depthmod2), len(latmod2)))
    dfmod2 = pd.DataFrame(tmpumod2,index=depthmod2,columns=latmod2)
    depthxsect = np.intersect1d(depthmod1, depthmod2)
    dfmod1.loc[depthxsect]
    Out[2]: 
         50.0  50.5  51.0  51.5  52.0  52.5  53.0
    5     284   291   280   287   297   286   277
    10    294   279   302   283   284   298   291
    15    278   296   286   298   279   275   286
    50    284   281   297   290   302   299   280
    60    290   301   302   298   283   286   287
    100   285   283   297   287   289   282   283
    

    I have included the approach you were trying as well. You have t ospecify mask on a column. You were doing it on the entire dataframe:

    import pandas as pd
    import numpy as np
    # Model 1 data
    depthmod1 = [5, 10, 15, 20, 30, 50, 60, 80, 100]  #depth in meters
    latmod1 = [50, 50.5, 51, 51.5, 52, 52.5, 53] #latitude in degrees north
    tmpumod1 = np.random.randint(273,303,size=(len(depthmod1),len(latmod1))) #temperature
    dfmod1 = pd.DataFrame(tmpumod1,index=depthmod1,columns=latmod1)
    dfmod1
    depthmod2  = [5, 10, 15, 25, 35, 50, 60, 100]
    latmod2  = [50, 51, 52, 53]
    tmpumod2  = np.random.randint(273,303,size=(len(depthmod2), len(latmod2)))
    dfmod2 = pd.DataFrame(tmpumod2,index=depthmod2,columns=latmod2)
    depthxsect = np.intersect1d(depthmod1, depthmod2)
    depthmask = dfmod1.index.isin(depthxsect) == False
    for col in dfmod1.columns:
        dfmod1[col] = dfmod1[col].mask(depthmask, np.nan)
    dfmod1
    Out[3]: 
          50.0   50.5   51.0   51.5   52.0   52.5   53.0
    5    289.0  274.0  297.0  274.0  277.0  278.0  277.0
    10   282.0  280.0  277.0  302.0  297.0  289.0  278.0
    15   300.0  282.0  297.0  297.0  300.0  279.0  291.0
    20     NaN    NaN    NaN    NaN    NaN    NaN    NaN
    30     NaN    NaN    NaN    NaN    NaN    NaN    NaN
    50   285.0  297.0  292.0  301.0  296.0  289.0  291.0
    60   295.0  299.0  278.0  295.0  299.0  293.0  277.0
    80     NaN    NaN    NaN    NaN    NaN    NaN    NaN
    100  292.0  293.0  289.0  291.0  289.0  276.0  286.0