Search code examples
pythonpandasstringerror-handlingpython-dedupe

How to solve the issue of malformed node or string error in pandas?


Here I have this dataframe and I am trying to remove the duplicate elements from each array in column 2 as follows and resultant array in Column 3.

Column1    Column 2                                                      Column3
0        [ABC|QWER|12345, ABC|QWER|12345]                         [ABC|QWER|12345] 
1        [TBC|WERT|567890,TBC|WERT|567890]                        [TBC|WERT|567890]
2        [ERT|TYIO|9845366, ERT|TYIO|9845366,ERT|TYIO|5]   [ERT|TYIO|9845366, ERT|TYIO|5]
3        NaN                                               NaN
4        [SAR|QWPO|34564557,SAR|QWPO|3456455]             [SAR|QWPO|34564557,SAR|QWPO|3456455]
5        NaN                                              NaN
6        [SE|WERT|12233412]                                [SE|WERT|12233412]
7        NaN                                               NaN

I m using following codes but its showing the error of malformed node or string.Please help to solve this.

import ast
    def ddpe(a):
    return list(dict.fromkeys(ast.literal_eval(a)))

  df['column3'] = df['column2'].apply(ddpe)

Solution

  • I'm assuming the values of 'column2' are strings since you are trying to use ast.literal_eval. In that case, try this instead

    import pandas as pd
    import numpy as np
    
    def ddpe(str_val):
        if pd.isna(str_val):  # return NaN if value is NaN
            return np.nan  
        # Remove the square brackets, split on ',' and strip possible
        # whitespaces between elements   
        vals = [v.strip() for v in str_val.strip('[]').split(',')]
        # remove duplicates keeping the original order
        return list(dict.fromkeys(vals))
    
    df['column3'] = df['column2'].apply(ddpe)
    

    If the column values are lists already, you just need

    def ddpe(lst_val):
        # return NaN is value is not a list. 
        # Assuming those are only the two options.
        if not isinstance(lst_val, list):   
            return np.nan  
        return list(dict.fromkeys(lst_val))
    
    df['column3'] = df['column2'].apply(ddpe)