Search code examples
pandasdataframeinteger

What is '\u200d1500'?


I crawled data from a website which was in string format I replaced string character and now data only contains number. But when I want to convert this column to numeric I get that error. I have two columns which first is previous_prices other is now_prices. If now the product is not on sale program fill nas with previous_prices. Previous_prices type is int64, now_prices type is object. Error is: ValueError: invalid literal for int() with base 10: '\u200d1500'.

Actually I saw a similiar question but that question is not relevant to '\u200d1500'.

now_prices_after_fillna
1450
‍1500
700
1700
2090

There are strange situation when When I change now_prices to integer and then fill na with previous_prices general data type was int. But when I want to export that data to excel I get this error. I can not understand problem.

enter image description here


Solution

  • Because \u200d is not printable character, here is solution for remove it and converting to integers:

    df = pd.DataFrame({'now_prices_after_fillna':['1450', u'\u200d1500']})
        
    print (df)
      now_prices_after_fillna
    0                    1450
    1                   ‍1500
    
    #https://stackoverflow.com/a/54451873/2901002
    import sys
    
    # build a table mapping all non-printable characters to None
    NOPRINT_TRANS_TABLE = {
        i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
    }
    
    def make_printable(s):
        """Replace non-printable characters in a string."""
    
        # the translate method on str removes characters
        # that map to None from the string
        return s.translate(NOPRINT_TRANS_TABLE)
    
    
    df['now_prices_after_fillna'] = (df['now_prices_after_fillna'].apply(make_printable)
                                                                  .astype(int))
    print (df)
       now_prices_after_fillna
    0                     1450
    1                     1500
    

    Another idea if mixed numeric with strings values add try with except statement:

    df = pd.DataFrame({'now_prices_after_fillna':['1450', u'\u200d1500', 1000]})
        
    print (df)
    
    #https://stackoverflow.com/a/54451873/2901002
    import sys
    
    # build a table mapping all non-printable characters to None
    NOPRINT_TRANS_TABLE = {
        i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
    }
    
    def make_printable(s):
        """Replace non-printable characters in a string."""
    
        # the translate method on str removes characters
        # that map to None from the string
        try:
            return s.translate(NOPRINT_TRANS_TABLE)
        except AttributeError:
            return s
    
    
    df['now_prices_after_fillna'] = (df['now_prices_after_fillna'].apply(make_printable)
                                                                  .astype(int))
    print (df)
       now_prices_after_fillna
    0                     1450
    1                     1500
    2                     1000
    

    Test your real data:

    df = pd.read_excel('your_updated_file2222.xlsx')
    
    #https://stackoverflow.com/a/54451873/2901002
    import sys
    
    # build a table mapping all non-printable characters to None
    NOPRINT_TRANS_TABLE = {
        i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
    }
    
    def make_printable(s):
        """Replace non-printable characters in a string."""
    
        # the translate method on str removes characters
        # that map to None from the string
        try:
            return s.translate(NOPRINT_TRANS_TABLE)
        except AttributeError:
            return s
    

    df['price'] = df['price'].apply(make_printable).astype(int)
    
    print (df)
         price
    0     1450
    1     1500
    2      700
    3     1700
    4     2090
    ..     ...
    206   1500
    207   1290
    208   1500
    209   1560
    210   1800
    
    [211 rows x 1 columns]