Search code examples
pythonpandasdataframefillna

Can I fill NaN-Values of one column with specific list elements of another column?


I have following dataframe (called items) for example:

| index | itemID | maintopic | subtopics          |
|:----- |:------:|:---------:| ------------------:|
| 1     | 235    | FBR       | [FZ, 1RH, FL]      |
| 2     | 1787   | NaN       | [1RH, YRS, FZ, FL] |
| 3     | 2454   | NaN       | [FZX, 1RH, FZL]    |
| 4     | 3165   | NaN       | [YHS]              |

I would like to fill the NaN-Values in the maintopic-column with the first element of the subtopics list which starts with a letter. Does someone has an idea? (Question No 1)

I tried this, but it didn´t work:

import pandas as pd
import string
alphabet = list(string.ascii_lowercase)
    
items['maintopic'] = items['maintopic'].apply(lambda x : items['maintopic'].fillna(items['subtopics'][x][0]) if items['subtopics'][x][0].lower().startswith(tuple(alphabet)) else x)

Advanced (Question No 2): Even better would be to have a look at all elements of the subtopics list and if there are more elements which have the first letter or even the first and the second letter in common, then I would like to take this. For example in line 2 there is FZ and FL, so i would like to fill the maintopic in this row with an F. And in line 3 there is FZX and FZL, then I would like to fill the maintopic with FZ. But if this is way too complicated then I would be also very happy with an answer to Question No 1.

I appreciate any help!


Solution

  • Try:

    from itertools import chain, combinations
    
    
    def commonprefix(m):
        "Given a list of pathnames, returns the longest common leading component"
        if not m:
            return ""
        s1 = min(m)
        s2 = max(m)
        for i, c in enumerate(s1):
            if c != s2[i]:
                return s1[:i]
        return s1
    
    
    def powerset(iterable, n=0):
        "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
        s = list(iterable)
        return chain.from_iterable(combinations(s, r) for r in range(n, len(s) + 1))
    
    
    def choose(x):
        if not isinstance(x, list):
            return x
    
        if len(x) == 1:
            return x[0]
    
        filtered = [v for v in x if not v[0].isdigit()]
        if not filtered:
            return np.nan
    
        longest = ""
        for s in powerset(filtered, 2):
            pref = commonprefix(s)
            if len(pref) > len(longest):
                longest = pref
    
        return filtered[0] if longest == "" else longest
    
    
    m = df["maintopic"].isna()
    df.loc[m, "maintopic"] = df.loc[m, "subtopics"].apply(choose)
    print(df)
    

    Prints:

       index  itemID maintopic           subtopics
    0      1     235       FBR       [FZ, 1RH, FL]
    1      2    1787         F  [1RH, YRS, FZ, FL]
    2      3    2454        FZ     [FZX, 1RH, FZL]
    3      4    3165       YHS               [YHS]
    

    EDIT: Added checks for a list/float.