Search code examples
pythonpandasregexgroup-bysplit

Pandas Column Split a row with Conditional and create a separate Column


It seems the problem is not difficult, but somehow I am not able to make it work. My problem is as follows. I have a dataframe say as follows:

       dfin 

           A      B     C
           a      1     198q24
           a      2     128q6
           a      6     1456
           b      7     67q22
           b      1     56
           c      3     451q2
           d      11    1q789

So now what I want to do is as follows, whenever the script will encounter a 'q', it will split the values and create a separate column with values starting from 'q'. The part before 'q' will remain in the original ( or maybe can crate a new column). So my desired output should be as follows:

        dfout 

           A      B     C        D
           a      1     198     q24
           a      2     128     q6
           a      6     1456
           b      7     67      q22
           b      1     56
           c      3     451     q2
           d      11    1       q789

So what I have tried till now is as follows:

       dfout = dfin.replace('\q\d*', '', regex=True)

Its creating one column without q, but I am not able to create the column D and not working as expected.

Any help/ideas will help and be appreciated.


Solution

  • import pandas as pd
    
    
    def get_input() -> pd.DataFrame:
        csv_text = """
             a      1     198q24
             a      2     128q6
             a      6     1456
             b      7     67q22
             b      1     56
             c      3     451q2
             d      11    1q789
             """.strip()
        return pd.DataFrame(map(str.split, csv_text.splitlines()), columns=["a", "b", "c"])
    
    
    def split_on_q(df_in: pd.DataFrame) -> pd.DataFrame:
        df = df_in.c.str.split("q", expand=True)
        df_out = df_in.copy()
        df_out["c"] = df[0]
        df_out["d"] = _prepend_q(df[1])
        return df_out
    
    
    def _prepend_q(series: pd.Series) -> pd.Series:
        return series.apply(lambda s: None if s is None else f"q{s}")
    
    
    if __name__ == "__main__":
        print(split_on_q(get_input()))
    
    

    Output:

       a   b     c     d
    0  a   1   198   q24
    1  a   2   128    q6
    2  a   6  1456  None
    3  b   7    67   q22
    4  b   1    56  None
    5  c   3   451    q2
    6  d  11     1  q789