Search code examples
pythonpandasreplacewhere-clausenp

replace function giving extra character at the end


I'm trying to clean the data of a column in a dataframe by using the replace function. The output keeps giving me an extra character at the end. The more I run the same code, the more characters added at the end. Can anyone please help me on this, please?

owner = ['China' 'Chinese' 'Hong Kong' 'Hongkong' 'China (Taiwan)' 'Japan'
 'Sweden' 'Canada' 'HK' 'United States' 'Indian' 'American' 'Japanese'
 'U.S.' 'Taiwanese' 115 'Australia' 'HongKong' 'France' 'Taiwan'
 'Malaysia' 'Taiwan, China' 1380 'Switzerland' 'US' 'Netherlands'
 'Chinese-Hongkong' 208 447 153 151 'Ireland' 'Taiwan China' 'china'
 'Taiwan of China' 'Mainland China' 'Chinese(HK)' 'HONG KONG OF CHINA'
 'USA' 'Korea, Republic of' 'Chinses' 1834 'South Africa' 40 184 190 427
 'German' 'Singapore' 'The philipines' 193 397 'Janpan' 'Japan and Taiwan'
 48 46 1274 'Chines' 1641 89 'Korea' 50 43 380 'Hong Kong, China'
 'China (HK)' 85 'Germany' 'English' 205 'Hongkong, China' 35
 'South Korean' 'British Virgin Islands' 'China (Hong Kong)' 156 490 94 95
 138 319 'Mandarin' 'Spainish' 'South Korea' 'Hongkong,China' 'U.S.A'
 'Hongkong China' 2583 98 'Korean' 5000 'India' 100 38 'chinese' 'The USA'
 'Canadian' 'Taiwan/HK/Macao' 'Chinese(Taiwan)' 'Republic of Korea'
 'China and South Korea' 'South korea' 'China,Korean' 'Denmark and China'
 272 256 235 143 0 'UK' 73 'Sri Lanka' 240 159 275 'Tai Wan' 192
 'China Taiwan' 225 146 78 200 'Amazon report' 'Chian' 'Not provided'
 'China, Hongkong' 'Thailand' 37 97 77 191 2951 897 140 199 636 'Macao']

dict = {'Owner': owner}

df = pd.DataFrame(dict) 

df.replace('China', 'Chinese', regex=True, inplace=True)
df.replace('china', 'Chinese', regex=True, inplace=True)
df.replace('Mainland China', 'Chinese', regex=True, inplace=True)
df.replace('Chinses', 'Chinese', regex=True, inplace=True)
df.replace('Chines', 'Chinese', regex=True, inplace=True)

df['Owner new'] = np.where(df['Owner'] != 'Chinese', 'Foriegn', df['Owner'])

print(df['Owner'].unique())

Output I get:

['Chinesee' 'Hong Kong' 'Hongkong' 'Chinesee (Taiwan)' 'Japan' 'Sweden'
 'Canada' 'HK' 'United States' 'Indian' 'American' 'Japanese' 'U.S.'
 'Taiwanese' 115 'Australia' 'HongKong' 'France' 'Taiwan' 'Malaysia'
 'Taiwan, Chinesee' 1380 'Switzerland' 'US' 'Netherlands'
 'Chinesee-Hongkong' 208 447 153 151 'Ireland' 'Taiwan Chinesee'
 'Taiwan of Chinesee' 'Mainland Chinesee' 'Chinesee(HK)'
 'HONG KONG OF CHINA' 'USA' 'Korea, Republic of' 1834 'South Africa' 40
 184 190 427 'German' 'Singapore' 'The philipines' 193 397 'Janpan'
 'Japan and Taiwan' 48 46 1274 'Chinese' 1641 89 'Korea' 50 43 380
 'Hong Kong, Chinesee' 'Chinesee (HK)' 85 'Germany' 'English' 205
 'Hongkong, Chinesee' 35 'South Korean' 'British Virgin Islands'
 'Chinesee (Hong Kong)' 156 490 94 95 138 319 'Mandarin' 'Spainish'
 'South Korea' 'Hongkong,Chinesee' 'U.S.A' 'Hongkong Chinesee' 2583 98
 'Korean' 5000 'India' 100 38 'chinese' 'The USA' 'Canadian'
 'Taiwan/HK/Macao' 'Chinesee(Taiwan)' 'Republic of Korea'
 'Chinesee and South Korea' 'South korea' 'Chinesee,Korean'
 'Denmark and Chinesee' 272 256 235 143 0 'UK' 73 'Sri Lanka' 240 159 275
 'Tai Wan' 192 'Chinesee Taiwan' 225 146 78 200 'Amazon report' 'Chian'
 'Not provided' 'Chinesee, Hongkong' 'Thailand' 37 97 77 191 2951 897 140
 199 636 'Macao']

Solution

  • In your case you are also replacing substrings (as explained in my comment). As you are trying to replace whole words you should add ^ and $ at the beginning and end of the words respectively. Then, only whole words that match will be replaced. E.g.:

    The case above:

    >>> df = pd.DataFrame(["Chinese"])
    >>> df.replace("Chines", "China", regex=True)
            0
    0  Chinae
    

    Solution 1: using regex ^ and $.

    >>> df.replace("^Chines$", "China", regex=True)
             0
    0  Chinese
    

    Solution 2: set regex=False such that only whole words are matched. (regex is False by default)

    >>> df.replace("Chines", "China")
             0
    0  Chinese