Search code examples
pythonpandasdata-transform

Pandas Filling Missing Rows Based on Defined List


Original DF has values that are calculated from a table.

It may have type missing for few ids. However, the goal is explicitly list if the flag exists. If not, type row should be added with no in the exists column.

id=1 is good in this example. If I filter on id=2 only, and run the following code, it returns c. which I can then append.

aa = set(type)
bb = set(b['type'].to_list())
list(aa - bb)

However, as you can tell, this is not feasible when applying to all id.

I needed help so the code can be applied for all the ids.

This returns the type that are no

type = [a, b, c]

Original
id      type    flag    
1       a       y        
1       b       y
1       c       y
2       a       y      
2       b       y
3       b       y     

Goal
id      type    flag    exists 
1       a       y       yes        
1       b       y       yes       
1       c       y       yes       
2       a       y       yes            
2       b       y       yes   
2       c       y       no     
3       a       y       yes        
3       b       y       no 
3       b       y       no    

Hoping there is a better way to handle this.


Solution

  • The key is to generate a Cartesian product between set of ids and set of types. I've used this Cartesian product axiom

    data = '''id ,     type ,   flag    
    1   ,    a    ,   y        
    1    ,   b    ,   y
    1    ,   c    ,   y
    2   ,    a    ,   y      
    2    ,   b    ,   y
    3     ,  b     ,  y     '''
    da = [[i.strip() for i in l.split(",")] for l in data.split("\n")]
    df = pd.DataFrame(da[1:], columns=da[0])
    # generate cartesian product https://apassionatechie.wordpress.com/2018/01/15/pandas-equivalent-of-sql-cross-join-cartesian-product/
    cp = pd.DataFrame(["a","b","c"], columns=["type"])\
            .assign(foo=1)\
            .merge(pd.DataFrame(df["id"].unique(), columns=["id"]).assign(foo=1)).drop('foo',1)
    df = cp.merge(df, how="outer", on=["id","type"])
    df["exists"] = ~df["flag"].isnull()
    df.fillna(method="ffill")
    

    output

        type    id  flag    exists
    0   a   1   y   True
    1   a   2   y   True
    2   a   3   y   False
    3   b   1   y   True
    4   b   2   y   True
    5   b   3   y   True
    6   c   1   y   True
    7   c   2   y   False
    8   c   3   y   False