Search code examples
pythonpandasdummy-variable

Create a dummy column and input value based on condition


I have data containing a list of topics (topics 1-6; and 0 meaning no topic is assigned) and their value. I want to create a new column for each topic and each label, and fill the column with the value. Here's what the table looks like...

reviewId   topic   value     label
      01       2      -4  negative
      02       2       9  positive
      03       0      -7  negative
      04       5      -1  negative
      05       1      38  positive

What should I do to create a table looking like this?

reviewId topic value label t1p t1n t2p t2n t3p t3n t4p t4n t5p t5n
01 2 -4 negative 0 0 0 -4 0 0 0 0 0 0
02 2 9 positive 0 0 9 0 0 0 0 0 0 0
03 0 -7 negative 0 0 0 0 0 0 0 0 0 0
04 5 -1 negative 0 0 0 0 0 0 0 0 0 -1
05 1 38 positive 38 0 0 0 0 0 0 0 0 0

Here's reproducible data to work on.

raw_df = pd.DataFrame({
    'reviewId': ['01', '02', '03', '04', '05'],
    'topic': [2, 2, 0, 5, 1],
    'value': ['-4', '9', '-7', '-1', '38']})

Here's a link similar topic on this problem..

  • Problem about inputting value: link
  • Problem about creating columns based on label: link

Solution

  • from itertools import product
    
    (df.join(pd.get_dummies(zip(df["topic"], df["label"]))
               .reindex(columns=product(range(1, 5+1), ["positive", "negative"]),
                        fill_value=0)
               .mul(df["value"], axis="rows")
               .rename(columns=lambda c: f"t{c[0]}{c[1][0]}")))
    
    • dummify the "topic"-"label" pairs
      • reindex with the entire (1...5) X (p, n) possible values to ensure all are here
      • and initiate the missings, e.g., "4"-"negative" pair in the sample data with 0
    • given the 1/0 frame at this point, multiply it with "values" row-wise
      • 1/0s will act as a selector of values
    • column names are like "(2, positive)" etc.; rename them to "t2p" etc.

    and join it with the original frame to get

       reviewId  topic  value     label  t1p  t1n  t2p  t2n  t3p  t3n  t4p  t4n  t5p  t5n
    0         1      2     -4  negative    0    0    0   -4    0    0    0    0    0    0
    1         2      2      9  positive    0    0    9    0    0    0    0    0    0    0
    2         3      0     -7  negative    0    0    0    0    0    0    0    0    0    0
    3         4      5     -1  negative    0    0    0    0    0    0    0    0    0   -1
    4         5      1     38  positive   38    0    0    0    0    0    0    0    0    0