In pyspark I have the following expression
df.withColumn('new_descriptions',lower(regexp_replace('descriptions',r"\t+",'')))
Which basically removes tab characters and makes my descriptions columns become lower
Here is a list samples of my descriptions columns
['banha frimesa 450 gr','manteiga com sal tourinho pote 200 g','acucar refinado caravelas pacote 1kg',
'acucar refinado light uniao fit pacote 500g','farinha de trigo especial 101 5kg']
What I want to do is to be able to remove the whitespaces that are between the value and it is unit. For example in this guy banha frimesa 450 gr, I want it to become banha frimesa 450gr.
But I also need to avoid removing whitespaces that are between a digit and digit with unit.
For example, this guy farinha de trigo especial 101 5kg** should stay the same.
What kind of regex should I use to only remove the whitespace that are between the kg,ml,l,g unit and it is value?
Wanted Result:
['banha frimesa 450gr','manteiga com sal tourinho pote 200g','acucar refinado caravelas pacote 1kg',
'acucar refinado light uniao fit pacote 500g','farinha de trigo especial 101 5kg']
You could replace whitespace preceded by a digit and followed by a letter (you could also specify all the possible units in the lookahead).
r'(?<=\d)\s+(?=[a-zA-Z])'