Search code examples
python-3.xstringpandasencoding

Remove unicode '\xa0' from pandas column


I was given a latin-1 .txt dataset, which I am trying to clean up to use for proper analysis using python 3 and pandas. The dataset, being scraped from html contains a number of \xa0 occurences, which I can't seem to deal with using decode, strip, -u replace, or any other method which I found on stack overflow. All of my attempts seem to be ignored by python, still printing out the same results. As I am new to data scraping, chances are that I missed out on something obvious, but right now I don't see a way forward

I have tried to decode to ascii, strip to str and then replace, or replace using a -u clause, not leading to anything

filepath = 'meow3.txt'  
outF = open("myOutFile.txt", "a")
with open(filepath) as fp:  
    line = fp.readline()
    for line in fp:
        if line.strip().startswith(','):
            line = line.replace(',','',1)
        line = line.replace(u'\xa0', u' ')
        print(line)

df = pd.read_csv('myOutFile.txt', sep=",", encoding="latin-1", header=None, names=["Company name", "Total", "Invested since-to"])

print (df)
3sun Group, £10m ,Feb 2014 
,Abacus, £4.15m ,Aug 2013 
,Accsys Group ,£12m, Mar 2017, 
Acro ,\xa0£7.8m, Nov 2015 – Sep 2017, 
ACS Clothing, £25.3m ,Jan 2014 

this is how the dataset looks like, and why in my code I am removing the first comma provided it is at the start of the column. But none of the suggested answers I tried seemed to help with removing the \xa0 part of the dataset, still giving the same result (seen above). If anyone has any clue for how I could make this work, I would be very grateful, Cheers, Jericho

Edit: While I know this would be best dealt with by pre-processing before turning it into txt file, I have no access or control of that process, and I have to work with the data I was given.


Solution

  • I suddenly stuck by this problem today and finally find a quickest and neatest solution. Say your pandas dataframe has a column with values like 'South\xa0Carolina'.

    Use the following code to remove all '\xa0'. Actually I have tried .replace("\xa0"," ") and .replace(r'\xa0',r' '), but none of them worked.

    data['col'].str.split().str.join(' ')