Search code examples
pythonpandasdummy-variable

Create dummy column and input value from other column


I have data containing a list of topics (topics 1-5; and 0 meaning no topic is assigned) and their value. I want to create a new column for each topic and fill the column with the value. Here's what the table looks like...

reviewId   topic   value
      01       2      -4
      02       2       9
      03       0      -7
      04       5      -1
      05       1      38

What should I do to create a table looking like this?

reviewId topic value t1 t2 t3 t4 t5
01 2 -4 0 -4 0 0 0
02 2 9 0 9 0 0 0
03 0 -7 0 0 0 0 0
04 5 -1 0 0 0 0 -1
05 1 38 38 0 0 0 0

Here's reproducible data to work on.

raw_df = pd.DataFrame({
    'reviewId': ['01', '02', '03', '04', '05'],
    'topic': [2, 2, 0, 5, 1],
    'value': ['-4', '9', '-7', '-1', '38']})

Here's a link to a similar problem with this one.


Solution

  • df.join(pd.get_dummies(df["topic"])
              .reindex(columns=range(1, 5+1), fill_value=0)
              .mul(df["value"], axis="rows")
              .add_prefix("t"))
    
    • get "1-hot" representation of the "topic" column
    • some topics may be missing, e.g., 3, 4 in the sample data, so reindex to guarantee that
      • and put zeros to those topics values as they were missing
    • now we have a 1/0 dataframe of shape (len(df), len(topics))
      • we multiply this with the "value" column row-wise
      • 1/0's will act as a "selector" of values
    • add "t" from left to the column names for "t1" ... "t5"

    and join it with the original frame to get

      reviewId  topic  value  t1  t2  t3  t4  t5
    0       01      2     -4   0  -4   0   0   0
    1       02      2      9   0   9   0   0   0
    2       03      0     -7   0   0   0   0   0
    3       04      5     -1   0   0   0   0  -1
    4       05      1     38  38   0   0   0   0