Search code examples
pythonpandasmachine-learningdata-scienceconcatenation

How to remove NaN values in pandas dataframe whose columns are referenced in another dataframe as encoded values indicating missing or unknown values


Dataframe 1

1  C1  C2  C3  .  .  .  C85
2  
3  
4  
.
.
800000  .  .  .  .  .  .

Columns with missing values across rows

0       32
100     10
200     7
300     7
400     6
1000    5
2000    3
3000    3
9000    3
12000   2
13000   1
15000   1
20000   1
30000   1
40000   1
50000   1
60000   1    

Dataframe 2

attribute  missing_or_unknown
C1         [-1,X]
C2         [XX]
.          .
.          .
C85        []

Missing values sorted by value_Counts()

[-1]        26
[-1,9]      17
[-1,0]      16
[0]         12
[]          10
[-1,0,9]     1
[-1,XX]      1
[-1,X]       1
[XX]         1

Need Dataframe 1 is the master table that has many missing or unknown values that needs to be cleaned up. However that determination needs to happen by referencing dataframe 2 and using those encoded indicators in missing_or_unknown column

Approach To be able to do that, I was trying to concat the 2 dataframes and see if i can add that missing_or_unknown column to dataframe 1 before i could proceed and use replace function to replace those indicators with np.nan

Question How do i perform concatenation when the 2 dataframes don't have same number of rows? Basically 1st dataframe's columns are rows in 2nd dataframe?


Solution

  • I suggest that you transpose the Dataframe2 and replace the column headings with the values of 1st row and then concanate Dataframe1 and Dataframe2. After this, you can operate on the Row1 of the resultant Dataframe to further replace it with "Nan" values.

    Here is a sample of this:

    import pandas as pd
    
    dummy_data1 = {
            'C1': ['11', '12', '13', '14', '15', '16', '17', '18', '19', '20'],
            'C2': ['A', 'E', 'I', 'M', 'Q', 'A', 'E', 'I', 'M', 'Q', ],
            'C3': ['B', 'F', 'J', 'N', 'R', 'B', 'F', 'J', 'N', 'R', ],
            'C4': ['C', 'G', 'K', 'O', 'S', 'C', 'G', 'K', 'O', 'S', ],
            'C5': ['D', 'H', 'L', 'P', 'T', 'D', 'H', 'L', 'P', 'T', ]}
    
    df1 = pd.DataFrame(dummy_data1, columns = ['C1', 'C2', 'C3', 'C4', 'C5'])
    
    dummy_data2 = {
            'attribute': ['C1', 'C2', 'C4', 'C5', 'C3', ],
            'missing_or_unknown': ['X1', 'X2', 'X4', 'X5', 'X3', ]}
    
    df2 = pd.DataFrame(dummy_data2, columns = ['attribute', 'missing_or_unknown'])
    
    df2_transposed = df2.transpose()
    print("df2_transposed=\n", df2_transposed)
    df2_transposed.columns = df2_transposed.iloc[0]
    df2_transposed = df2_transposed.drop(df2_transposed.index[0])
    print("df2_transposed with HEADER Replaced=\n", df2_transposed)
    
    df_new = pd.concat([df2_transposed, df1])
    print("df_new=\n", df_new)