I have a string column in a dataframe with values with accents, like
'México', 'Albânia', 'Japão'
How to replace letters with accents to get this:
'Mexico', 'Albania', 'Japao'
I tried many solutions available in Stack OverFlow, like this:
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
But disappointed returns
>>> 'M?xico'
You can use translate
df = spark.createDataFrame(
('3','São Paulo'),
["id", "Local"]
df.show(truncate = False)
|id |Local |
|1 |Japão |
|2 |Irã |
|3 |São Paulo|
|5 |Canadá |
|6 |Tókio |
|7 |México |
|8 |Albânia |
from pyspark.sql import functions as F
.withColumn('Loc_norm', F.translate('Local',
|id |Local |Loc_norm |
|1 |Japão |Japao |
|2 |Irã |Ira |
|3 |São Paulo|Sao Paulo|
|5 |Canadá |Canada |
|6 |Tókio |Tokio |
|7 |México |Mexico |
|8 |Albânia |Albânia |