Search code examples
pythonpandasdataframeduplicatescell

How can I remove duplicate cells of any row in pandas DataFrame?


I need to update a pandas DataFrame as below. Is it possible by any means? [I highly appreciate all of your time and endeavors. Sorry that my question arose confusion among you. I tried to update the question. Thanks again]

Sample1:

import pandas as pd    
#original dataframe
data = {'row_1': ['x','y','x','y'], 'row_2': ['a', 'b', 'a', None]}
data=pd.DataFrame.from_dict(data, orient='index')
print(data)

#desired dataframe from data
data1 = {'row_1': ['x','y'], 'row_2': ['a', 'b']}
data1=pd.DataFrame.from_dict(data1, orient='index')
print(data1)

Sample 2:

import pandas as pd    
#original dataframe
data = {'row_1': ['x','y','p','x'], 'row_2': ['a', 'b', 'a', None]}
data=pd.DataFrame.from_dict(data, orient='index')
print(data)

#desired dataframe from data
data1 = {'row_1': ['x','y','p'], 'row_2': ['a', 'b']}
data1=pd.DataFrame.from_dict(data1, orient='index')
print(data1)

Solution

  • data = data.apply(lambda x: x.transpose().dropna().unique().transpose(), axis=1)
    

    This is what you are looking for. Use dropna to get rid of NaN's and then only keep the unique elements. Apply this logic to each row of the dataframe to get the desired result.