Search code examples
pythonpandasstringsubstringruntime

Check if XOR substring appears in String in pandas


I have a Pandas Dataframe containing Numbers and strings I want the entries with the numbers that contain exactly 1 entry of a list of strings i have

Pd dataframe df:

Value subjects
"P1" "'A'"
"P2" "'A','B'"
"P3" "'A','B','C'"
"P4" "'B'"
"P5" "'C'"
"P6" "'A','K'"
"P7" "'B','ABC'"

substring list sub:

["'A'","'B'","'C'"]

after the Operation i want to have the PD dataframe:

"P1" "'A'"
"P4" "'B'"
"P5" "'C'"
"P6" "'A','K'"
"P7" "'B','ABC'"

I have a working solution, but it requires concatenating Dataframes which scale very badly with more data entries and i read shouldn't be done.

completedf = pd.DataFrame()
for string in sub:
    for entry in df[df["subjects"].str.contains.(str("'"+ string+"'")).replace(np.nan,False)]
        for s in sub:
            if s != string:      
                m = df[~df["subjects"].str.contains(str("'"+ s[0]+"'")).replace(np.nan,False)]
    completedf = pd.concat(completedf,m)

Solution

  • I would use set operations after splitting your strings:

    df = pd.DataFrame({'Value': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7'],
                       'subjects': ["'A'", "'A','B'", "'A','B','C'", "'B'", "'C'", "'A','K'", "'B','ABC'"]})
    lst = ["'A'", "'B'", "'C'"]
    
    S = set(lst)
    
    out = df[[len(S.intersection(x.split(',')))<2 for x in df['subjects']]]
    

    Output:

      Value   subjects
    0    P1        'A'
    3    P4        'B'
    4    P5        'C'
    5    P6    'A','K'
    6    P7  'B','ABC'