Search code examples
pythonregexpyspark

Regex that removes whitespaces between two specific characters


In pyspark I have the following expression

df.withColumn('new_descriptions',lower(regexp_replace('descriptions',r"\t+",'')))

Which basically removes tab characters and makes my descriptions columns become lower

Here is a list samples of my descriptions columns

['banha frimesa 450 gr','manteiga com sal tourinho pote 200 g','acucar refinado caravelas pacote 1kg',
'acucar refinado light uniao fit pacote 500g','farinha de trigo especial 101 5kg']

What I want to do is to be able to remove the whitespaces that are between the value and it is unit. For example in this guy banha frimesa 450 gr, I want it to become banha frimesa 450gr.

But I also need to avoid removing whitespaces that are between a digit and digit with unit.

For example, this guy farinha de trigo especial 101 5kg** should stay the same.

What kind of regex should I use to only remove the whitespace that are between the kg,ml,l,g unit and it is value?

Wanted Result:

['banha frimesa 450gr','manteiga com sal tourinho pote 200g','acucar refinado caravelas pacote 1kg',
    'acucar refinado light uniao fit pacote 500g','farinha de trigo especial 101 5kg']

Solution

  • You could replace whitespace preceded by a digit and followed by a letter (you could also specify all the possible units in the lookahead).

    r'(?<=\d)\s+(?=[a-zA-Z])'