I have a dataset which contains data scraped from a real estate agency. I want to clean the variables from special characters (€, m², etc), remove whitespaces and convert them into integers. For instance I have these for the price (Prix de vente) and size (Surface habitable):
I manage to strip the "€" from the price variable with the following:
data["Prix de vente"] = data["Prix de vente"].str.strip("€")
However none of these two techniques work to remove the whitespaces
data["Prix de vente"] = data["Prix de vente"].str.strip()
data["Prix de vente"] = data["Prix de vente"].replace(" ","")
I keep having:
540 000
900 000
890 000
900 000
900 000
...
I have a feeling this is due to the way my data is encoded. Would that be the issue? And if so, how do I deal with it? I'm using a jupyter notebook.
You can do
data["Prix de vente"] = data["Prix de vente"].str.replace("\s","", regex=True)
Matching "\s"
combined with regex=True
instead of " "
makes sure to match not only regular spaces but any whitespace character
More background:
Your first approach
data["Prix de vente"] = data["Prix de vente"].str.strip()
does not work, since .strip() only removes leading and trailing characters.
Your second approach
data["Prix de vente"] = data["Prix de vente"].replace(" ","")
does not work, since it is using not str.replace, but pd.Series.replace, which only replaces exact matches of values (e.g. "900 000").