Search code examples
replacesplitpython-polars

Polars .str.replace with expression or .str.split with regex


I have this dataframe:

sample = pl.DataFrame({"equip": ['AmuletsMedals', 'Guns, CrossbowsOff-Hands', 'Melee WeaponsShieldsOff-Hands',
     'All Armor', 'Chest Armor', 'Shields', 'All WeaponsShieldsOff-Hands']})
    print(sample)

   shape: (7, 1)
┌───────────────────────────────┐
│ equip                         │
│ ---                           │
│ str                           │
╞═══════════════════════════════╡
│ AmuletsMedals                 │
│ Guns, CrossbowsOff-Hands      │
│ Melee WeaponsShieldsOff-Hands │
│ All Armor                     │
│ Chest Armor                   │
│ Shields                       │
│ All WeaponsShieldsOff-Hands   │
└───────────────────────────────┘

My aim is to put a comma between words:

answer = pl.DataFrame({"equip": ['Amulets, Medals', 'Guns, Crossbows, Off-Hands', 'Melee Weapons, Shields, Off-Hands',
 'All Armor', 'Chest Armor', 'Shields', 'All Weapons, Shields, Off-Hands']})
print(answer)
shape: (7, 1)
┌─────────────────────────────────────┐
│ equip                               │
│ ---                                 │
│ str                                 │
╞═════════════════════════════════════╡
│ Amulets, Medals                     │
│ Guns, Crossbows, Off-Hands          │
│ Melee Weapons, Shields, Off-Hand... │
│ All Armor                           │
│ Chest Armor                         │
│ Shields                             │
│ All Weapons, Shields, Off-Hands     │
└─────────────────────────────────────┘

I tried replace, but the replace didn't take an expression:

sample.with_columns(pl.col("equip").str.replace("[a-z][A-Z]", "[a-z], [A-Z]"))

and a tip found on polars github, but it cuts the last and first letter of the first and last word on each encounter, as it would with:

sample.with_columns(pl.col("equip").str.replace("[a-z][A-Z]", ", "))

Any ideas?

Bonus question: I imagine the answer for the simple case would also solve the harder case, but in case it does not, here is the hard case:

I do have another column with a slightly harder regex pattern than "[a-z][A-Z]", should be something like "[a-z][A-Z]|[a-z]+|[a-z][1-9]" (I did not stress much about the exact regex yet). The aim is also to just put a comma between attributes:

sample2 = pl.DataFrame({"attributes": ['+10% Aether Damage+30 Defensive Ability16% Aether Resistance6% Less Damage from Aetherials6% Less Damage from Aether Corruptions',
     '4-6 Aether Damage+25% Aether Damage10% Physical Damage converted to Aether DamageAether Tendril (Granted by Item)',
     '2-8 Lightning Damage+25% Lightning Damage+25% Electrocute Damage10% Physical Damage converted to Lightning DamageEmpowered Lightning Nova (Granted by Item)',
     '+10 Health Regenerated per Second+24 Armor20% Poison & Acid Resistance',
     '+22 Defensive Ability10% Chance to Avoid Projectiles+18 Armor',
     '+15 Physique+10% Shield Block ChanceShield Slam (Granted by Item)',
     '+10% Chaos Damage+30 Defensive Ability16% Chaos Resistance6% Less Damage from Chthonics']})

Solution

  • You can use capture groups in your pattern:

    df.with_columns(pl.col("equip").str.replace_all(r"([a-z])([A-Z])", "$1, $2"))
    
    shape: (7, 1)
    ┌─────────────────────────────────────┐
    │ equip                               │
    │ ---                                 │
    │ str                                 │
    ╞═════════════════════════════════════╡
    │ Amulets, Medals                     │
    │ Guns, Crossbows, Off-Hands          │
    │ Melee Weapons, Shields, Off-Hand... │
    │ All Armor                           │
    │ Chest Armor                         │
    │ Shields                             │
    │ All Weapons, Shields, Off-Hands     │
    └─────────────────────────────────────┘
    

    You may also want to use the unicode classes \p{lower} and \p{upper} instead.

    The regex syntax that polars supports is: https://docs.rs/regex/latest/regex/