Search code examples
pythoncharacter-encodingstrip

python - removing whitespaces not working - encoding issue?


I have a dataset which contains data scraped from a real estate agency. I want to clean the variables from special characters (€, m², etc), remove whitespaces and convert them into integers. For instance I have these for the price (Prix de vente) and size (Surface habitable):

enter image description here

I manage to strip the "€" from the price variable with the following:

data["Prix de vente"] = data["Prix de vente"].str.strip("€")

However none of these two techniques work to remove the whitespaces

data["Prix de vente"] = data["Prix de vente"].str.strip() 
data["Prix de vente"] = data["Prix de vente"].replace(" ","")

I keep having:

      540 000
      900 000
      890 000
      900 000
      900 000
      ...

I have a feeling this is due to the way my data is encoded. Would that be the issue? And if so, how do I deal with it? I'm using a jupyter notebook.


Solution

  • You can do

    data["Prix de vente"] = data["Prix de vente"].str.replace("\s","", regex=True)
    

    Matching "\s" combined with regex=True instead of " " makes sure to match not only regular spaces but any whitespace character

    More background:

    Your first approach

    data["Prix de vente"] = data["Prix de vente"].str.strip()
    

    does not work, since .strip() only removes leading and trailing characters.

    Your second approach

    data["Prix de vente"] = data["Prix de vente"].replace(" ","")
    

    does not work, since it is using not str.replace, but pd.Series.replace, which only replaces exact matches of values (e.g. "900 000").