Dataframe 1
1 C1 C2 C3 . . . C85
2
3
4
.
.
800000 . . . . . .
Columns with missing values across rows
0 32
100 10
200 7
300 7
400 6
1000 5
2000 3
3000 3
9000 3
12000 2
13000 1
15000 1
20000 1
30000 1
40000 1
50000 1
60000 1
Dataframe 2
attribute missing_or_unknown
C1 [-1,X]
C2 [XX]
. .
. .
C85 []
Missing values sorted by value_Counts()
[-1] 26
[-1,9] 17
[-1,0] 16
[0] 12
[] 10
[-1,0,9] 1
[-1,XX] 1
[-1,X] 1
[XX] 1
Need Dataframe 1 is the master table that has many missing or unknown values that needs to be cleaned up. However that determination needs to happen by referencing dataframe 2 and using those encoded indicators in missing_or_unknown column
Approach To be able to do that, I was trying to concat the 2 dataframes and see if i can add that missing_or_unknown column to dataframe 1 before i could proceed and use replace function to replace those indicators with np.nan
Question How do i perform concatenation when the 2 dataframes don't have same number of rows? Basically 1st dataframe's columns are rows in 2nd dataframe?
I suggest that you transpose the Dataframe2 and replace the column headings with the values of 1st row and then concanate Dataframe1 and Dataframe2. After this, you can operate on the Row1 of the resultant Dataframe to further replace it with "Nan" values.
Here is a sample of this:
import pandas as pd
dummy_data1 = {
'C1': ['11', '12', '13', '14', '15', '16', '17', '18', '19', '20'],
'C2': ['A', 'E', 'I', 'M', 'Q', 'A', 'E', 'I', 'M', 'Q', ],
'C3': ['B', 'F', 'J', 'N', 'R', 'B', 'F', 'J', 'N', 'R', ],
'C4': ['C', 'G', 'K', 'O', 'S', 'C', 'G', 'K', 'O', 'S', ],
'C5': ['D', 'H', 'L', 'P', 'T', 'D', 'H', 'L', 'P', 'T', ]}
df1 = pd.DataFrame(dummy_data1, columns = ['C1', 'C2', 'C3', 'C4', 'C5'])
dummy_data2 = {
'attribute': ['C1', 'C2', 'C4', 'C5', 'C3', ],
'missing_or_unknown': ['X1', 'X2', 'X4', 'X5', 'X3', ]}
df2 = pd.DataFrame(dummy_data2, columns = ['attribute', 'missing_or_unknown'])
df2_transposed = df2.transpose()
print("df2_transposed=\n", df2_transposed)
df2_transposed.columns = df2_transposed.iloc[0]
df2_transposed = df2_transposed.drop(df2_transposed.index[0])
print("df2_transposed with HEADER Replaced=\n", df2_transposed)
df_new = pd.concat([df2_transposed, df1])
print("df_new=\n", df_new)