I have a Pandas Dataframe containing Numbers and strings I want the entries with the numbers that contain exactly 1 entry of a list of strings i have
Pd dataframe df:
Value subjects
"P1" "'A'"
"P2" "'A','B'"
"P3" "'A','B','C'"
"P4" "'B'"
"P5" "'C'"
"P6" "'A','K'"
"P7" "'B','ABC'"
substring list sub:
["'A'","'B'","'C'"]
after the Operation i want to have the PD dataframe:
"P1" "'A'"
"P4" "'B'"
"P5" "'C'"
"P6" "'A','K'"
"P7" "'B','ABC'"
I have a working solution, but it requires concatenating Dataframes which scale very badly with more data entries and i read shouldn't be done.
completedf = pd.DataFrame()
for string in sub:
for entry in df[df["subjects"].str.contains.(str("'"+ string+"'")).replace(np.nan,False)]
for s in sub:
if s != string:
m = df[~df["subjects"].str.contains(str("'"+ s[0]+"'")).replace(np.nan,False)]
completedf = pd.concat(completedf,m)
I would use set operations after splitting your strings:
df = pd.DataFrame({'Value': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7'],
'subjects': ["'A'", "'A','B'", "'A','B','C'", "'B'", "'C'", "'A','K'", "'B','ABC'"]})
lst = ["'A'", "'B'", "'C'"]
S = set(lst)
out = df[[len(S.intersection(x.split(',')))<2 for x in df['subjects']]]
Output:
Value subjects
0 P1 'A'
3 P4 'B'
4 P5 'C'
5 P6 'A','K'
6 P7 'B','ABC'