Search code examples
pythonpandasdataframenormalizationhierarchy

Conditional replacement of values in column A, B, C with value in column D


I'm cleaning up a messy data source describing a hierarchical structure identified as follows. I'm using Python and pandas.

¦ A ¦ B ¦ C ¦ D ¦
-----------------
¦ x ¦   ¦   ¦ a ¦
¦   ¦ x ¦   ¦ b ¦
¦   ¦   ¦ x ¦ c ¦
¦   ¦   ¦ x ¦ d ¦
¦ x ¦   ¦   ¦ e ¦
¦   ¦ x ¦   ¦ f ¦
¦   ¦   ¦ x ¦ g ¦
¦   ¦   ¦ x ¦ h ¦

I'd like to generate unique IDs that also keep the hierarchical nature of the data. (Names per parent are unique, do not focus on that part please.)

¦ A ¦ B ¦ C ¦ D ¦ ID    ¦
-------------------------
¦ x ¦   ¦   ¦ a ¦ a     ¦
¦   ¦ x ¦   ¦ b ¦ a.b   ¦
¦   ¦   ¦ x ¦ c ¦ a.b.c ¦
¦   ¦   ¦ x ¦ d ¦ a.b.d ¦
¦ x ¦   ¦   ¦ e ¦ e     ¦ <-- note, this is NOT e.b.d,
¦   ¦ x ¦   ¦ f ¦ e.f   ¦     so when parent changes
¦   ¦   ¦ x ¦ g ¦ e.f.g ¦     fillna must not be applied
¦   ¦   ¦ x ¦ h ¦ e.f.h ¦

My strategy is:

  1. replace 'x' values in A, B, C with value from D
  2. use pandas' forward na fill
  3. concatenate A, B and C into column ID

2 and 3 are easy, but I can not pass 1. I can replace the x-es with a single value:

df[df.loc[:,'A':'C'] == 'x'] = 1

but that does not work if I try to pass df.D instead of 1.

Please recommend an elegant pythonic solution.


Source to work with:

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO
import pandas as pd

TESTDATA=StringIO("""
A;B;C;D;solution
x;;;x;x
;x;;a;xa
;x;;b;xb
;x;;c;xc
;;x;1;xc1
;;x;2;xc2
;x;;d;xd
;;x;3;xd3
;;x;4;xd4
x;;;y;y
;x;;e;ye
;;x;5;ye5
;;x;6;ye6
;x;;f;yf
;;x;7;yf7
;;x;8;yf8
;;x;9;yf9""")

df = pd.read_csv(TESTDATA, sep=";", header=False)

Solution

  • Not the prettiest ever, but something like

    w0 = df.iloc[:,:3]
    wx = w0 == 'x'
    wempty = (wx.cumsum(axis=1) >= 1).shift(axis=1).fillna(False)
    wfilled = w0.where(~wx, df.D, axis=0).ffill()
    w = w0.where(wempty, wfilled, axis=1).fillna('')
    df["new_solution"] = w.apply('.'.join,axis=1).str.rstrip(".")
    

    gives me

    >>> df
          A    B    C  D solution new_solution
    0     x  NaN  NaN  x        x            x
    1   NaN    x  NaN  a       xa          x.a
    2   NaN    x  NaN  b       xb          x.b
    3   NaN    x  NaN  c       xc          x.c
    4   NaN  NaN    x  1      xc1        x.c.1
    5   NaN  NaN    x  2      xc2        x.c.2
    6   NaN    x  NaN  d       xd          x.d
    7   NaN  NaN    x  3      xd3        x.d.3
    8   NaN  NaN    x  4      xd4        x.d.4
    9     x  NaN  NaN  y        y            y
    10  NaN    x  NaN  e       ye          y.e
    11  NaN  NaN    x  5      ye5        y.e.5
    12  NaN  NaN    x  6      ye6        y.e.6
    13  NaN    x  NaN  f       yf          y.f
    14  NaN  NaN    x  7      yf7        y.f.7
    15  NaN  NaN    x  8      yf8        y.f.8
    16  NaN  NaN    x  9      yf9        y.f.9
    

    The trick here is the use of cumsum, which lets us distinguish the cells which should be empty from the cells which should be filled.