I have following dataframe (called items) for example:
| index | itemID | maintopic | subtopics |
|:----- |:------:|:---------:| ------------------:|
| 1 | 235 | FBR | [FZ, 1RH, FL] |
| 2 | 1787 | NaN | [1RH, YRS, FZ, FL] |
| 3 | 2454 | NaN | [FZX, 1RH, FZL] |
| 4 | 3165 | NaN | [YHS] |
I would like to fill the NaN-Values in the maintopic-column with the first element of the subtopics list which starts with a letter. Does someone has an idea? (Question No 1)
I tried this, but it didn´t work:
import pandas as pd
import string
alphabet = list(string.ascii_lowercase)
items['maintopic'] = items['maintopic'].apply(lambda x : items['maintopic'].fillna(items['subtopics'][x][0]) if items['subtopics'][x][0].lower().startswith(tuple(alphabet)) else x)
Advanced (Question No 2): Even better would be to have a look at all elements of the subtopics list and if there are more elements which have the first letter or even the first and the second letter in common, then I would like to take this. For example in line 2 there is FZ and FL, so i would like to fill the maintopic in this row with an F. And in line 3 there is FZX and FZL, then I would like to fill the maintopic with FZ. But if this is way too complicated then I would be also very happy with an answer to Question No 1.
I appreciate any help!
Try:
from itertools import chain, combinations
def commonprefix(m):
"Given a list of pathnames, returns the longest common leading component"
if not m:
return ""
s1 = min(m)
s2 = max(m)
for i, c in enumerate(s1):
if c != s2[i]:
return s1[:i]
return s1
def powerset(iterable, n=0):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(n, len(s) + 1))
def choose(x):
if not isinstance(x, list):
return x
if len(x) == 1:
return x[0]
filtered = [v for v in x if not v[0].isdigit()]
if not filtered:
return np.nan
longest = ""
for s in powerset(filtered, 2):
pref = commonprefix(s)
if len(pref) > len(longest):
longest = pref
return filtered[0] if longest == "" else longest
m = df["maintopic"].isna()
df.loc[m, "maintopic"] = df.loc[m, "subtopics"].apply(choose)
print(df)
Prints:
index itemID maintopic subtopics
0 1 235 FBR [FZ, 1RH, FL]
1 2 1787 F [1RH, YRS, FZ, FL]
2 3 2454 FZ [FZX, 1RH, FZL]
3 4 3165 YHS [YHS]
EDIT: Added checks for a list/float.