Search code examples
pythonpython-polars

How to combine columns with extra strings into a concatenated string column in Polars Python?


I am trying to add another column that will contain combination of two columns (Total & percentage) into a result column(labels_value) which look like: (Total) percentage%.

Basically to wrap bracket strings on Total column and add % string at the end of combination of these two columns.

df

import polars as pl

so_df = pl.DataFrame(
[{'Flag': 'Outof Range',
  'Category': 'Thyroid',
  'len': 7,
  'Total': 21,
  'percentage': 33.33,
  'value': 33.33},
 {'Flag': 'Outof Range',
  'Category': 'Inflammatory Marker',
  'len': 2,
  'Total': 8,
  'percentage': 25.0,
  'value': 25.0},
 {'Flag': 'Outof Range',
  'Category': 'Lipid',
  'len': 12,
  'Total': 63,
  'percentage': 19.05,
  'value': 19.05},
 {'Flag': 'Outof Range',
  'Category': 'LFT',
  'len': 14,
  'Total': 87,
  'percentage': 16.09,
  'value': 16.09},
 {'Flag': 'Outof Range',
  'Category': 'DLC',
  'len': 11,
  'Total': 126,
  'percentage': 8.73,
  'value': 8.73},
 {'Flag': 'Outof Range',
  'Category': 'Vitamin',
  'len': 1,
  'Total': 14,
  'percentage': 7.14,
  'value': 7.14},
 {'Flag': 'Outof Range',
  'Category': 'CBC',
  'len': 2,
  'Total': 45,
  'percentage': 4.44,
  'value': 4.44},
 {'Flag': 'Outof Range',
  'Category': 'KFT',
  'len': 2,
  'Total': 56,
  'percentage': 3.57,
  'value': 3.57},
 {'Flag': 'Outof Range',
  'Category': 'Urine Examination',
  'len': 1,
  'Total': 28,
  'percentage': 3.57,
  'value': 3.57},
 {'Flag': 'Within Range',
  'Category': 'Thyroid',
  'len': 14,
  'Total': 21,
  'percentage': 66.67,
  'value': -66.67},
 {'Flag': 'Within Range',
  'Category': 'Inflammatory Marker',
  'len': 6,
  'Total': 8,
  'percentage': 75.0,
  'value': -75.0},
 {'Flag': 'Within Range',
  'Category': 'Lipid',
  'len': 51,
  'Total': 63,
  'percentage': 80.95,
  'value': -80.95},
 {'Flag': 'Within Range',
  'Category': 'LFT',
  'len': 73,
  'Total': 87,
  'percentage': 83.91,
  'value': -83.91},
 {'Flag': 'Within Range',
  'Category': 'DLC',
  'len': 115,
  'Total': 126,
  'percentage': 91.27,
  'value': -91.27},
 {'Flag': 'Within Range',
  'Category': 'Vitamin',
  'len': 13,
  'Total': 14,
  'percentage': 92.86,
  'value': -92.86},
 {'Flag': 'Within Range',
  'Category': 'CBC',
  'len': 43,
  'Total': 45,
  'percentage': 95.56,
  'value': -95.56},
 {'Flag': 'Within Range',
  'Category': 'KFT',
  'len': 54,
  'Total': 56,
  'percentage': 96.43,
  'value': -96.43},
 {'Flag': 'Within Range',
  'Category': 'Urine Examination',
  'len': 27,
  'Total': 28,
  'percentage': 96.43,
  'value': -96.43},
 {'Flag': 'Within Range',
  'Category': 'Anemia',
  'len': 38,
  'Total': 38,
  'percentage': 100.0,
  'value': -100.0},
 {'Flag': 'Within Range',
  'Category': 'Diabetes',
  'len': 22,
  'Total': 22,
  'percentage': 100.0,
  'value': -100.0},
 {'Flag': 'Within Range',
  'Category': 'Electrolyte',
  'len': 46,
  'Total': 46,
  'percentage': 100.0,
  'value': -100.0}]
)
shape: (21, 6)
Flag    Category    len Total   percentage  value
str str i64 i64 f64 f64
"Outof Range"   "Thyroid"   7   21  33.33   33.33
"Outof Range"   "Inflammatory Marker"   2   8   25.0    25.0
"Outof Range"   "Lipid" 12  63  19.05   19.05
"Outof Range"   "LFT"   14  87  16.09   16.09
"Outof Range"   "DLC"   11  126 8.73    8.73
…   …   …   …   …   …
"Within Range"  "KFT"   54  56  96.43   -96.43
"Within Range"  "Urine Examination" 27  28  96.43   -96.43
"Within Range"  "Anemia"    38  38  100.0   -100.0
"Within Range"  "Diabetes"  22  22  100.0   -100.0
"Within Range"  "Electrolyte"   46  46  100.0   -100.0

I have tried below three ways and none of them worked:

(so_df
#  .with_columns(labels_value = "("+str(pl.col("Total"))+") "+str(pl.col("percentage"))+"%")
#  .with_columns(labels_value = "".join(["(",str(pl.col("Total")),") ",str(pl.col("percentage")),"%"]))
#  .with_columns(labels_value =pl.concat_str([pl.col("Total"),pl.col("percentage")])))

Desired Result would be like:

(21) 33.33% # 1st row of labels_value column
(8) 25%
(63) 19.05%
.
.


Solution

  • Here are a few options:

    so_df.with_columns(
        labels_value_1=pl.format("({}) {}%", "Total", "percentage"),
        labels_value_2=pl.concat_str(
            pl.lit("("), "Total", pl.lit(") "), "percentage", pl.lit("%")
        ),
        labels_value_3=(
            "("
            + pl.col("Total").cast(pl.String)
            + ") "
            + pl.col("percentage").cast(pl.String)
            + "%"
        ),
    )
    
    # Only Total, percentage and outputs shown for brevity
    # ┌───────┬────────────┬────────────────┬────────────────┬────────────────┐
    # │ Total ┆ percentage ┆ labels_value_1 ┆ labels_value_2 ┆ labels_value_3 │
    # │ ---   ┆ ---        ┆ ---            ┆ ---            ┆ ---            │
    # │ i64   ┆ f64        ┆ str            ┆ str            ┆ str            │
    # ╞═══════╪════════════╪════════════════╪════════════════╪════════════════╡
    # │ 21    ┆ 33.33      ┆ (21) 33.33%    ┆ (21) 33.33%    ┆ (21) 33.33%    │
    # │ 8     ┆ 25.0       ┆ (8) 25.0%      ┆ (8) 25.0%      ┆ (8) 25.0%      │
    # │ 63    ┆ 19.05      ┆ (63) 19.05%    ┆ (63) 19.05%    ┆ (63) 19.05%    │
    # │ 87    ┆ 16.09      ┆ (87) 16.09%    ┆ (87) 16.09%    ┆ (87) 16.09%    │
    # │ 126   ┆ 8.73       ┆ (126) 8.73%    ┆ (126) 8.73%    ┆ (126) 8.73%    │
    # │ …     ┆ …          ┆ …              ┆ …              ┆ …              │
    # │ 56    ┆ 96.43      ┆ (56) 96.43%    ┆ (56) 96.43%    ┆ (56) 96.43%    │
    # │ 28    ┆ 96.43      ┆ (28) 96.43%    ┆ (28) 96.43%    ┆ (28) 96.43%    │
    # │ 38    ┆ 100.0      ┆ (38) 100.0%    ┆ (38) 100.0%    ┆ (38) 100.0%    │
    # │ 22    ┆ 100.0      ┆ (22) 100.0%    ┆ (22) 100.0%    ┆ (22) 100.0%    │
    # │ 46    ┆ 100.0      ┆ (46) 100.0%    ┆ (46) 100.0%    ┆ (46) 100.0%    │
    # └───────┴────────────┴────────────────┴────────────────┴────────────────┘
    

    pl.format I would say is most idiomatic, and it does the casting to string for you (much like Python's f-strings). It uses {} as a placeholder. This is what I'd personally recommend out of the options presented.

    For concat_str, you can add the parentheses and "%" with pl.lit (strings are parsed as column names in this function). It will cast numeric columns to string for you.

    Lastly, string concatenation with the + operator does work, but you must cast the column(s) with Polars' .cast method, rather than Python's builtin str.

    One thing to note is that row 2 of the output does differ slightly from your desired output. (8) 25% (expected) vs (8) 25.0% (actual). This is because the string representation of floating point columns in Polars contain at least one decimal place. If that is an issue, drop a comment and I will try to come up with a solution.