Search code examples
pandasspecial-characterstxt

How to read in dataset containing special characters in pandas


I am trying to read in the following dataset: https://data.opensanctions.org/datasets/20230620/default/names.txt

I have run this code:

filename = "https://data.opensanctions.org/datasets/20230620/default/names.txt"

df = pd.read_csv(filename, encoding='latin1', nrows = 2, header=None)
print(df)

The dataframe looks like this:

                                                   0
0                                SANAVBARI NIKITENKO
1  ÐÐÐÐÐТ Ð ÐÐÐÐÐÐÐÐÐ ÐÐ¥ÐÐÐÐ...

How can I automatically detect the special character types when I read in the file ?


Solution

  • For me working remove encoding='latin1', so is used default encoding='utf-8':

    filename = "https://data.opensanctions.org/datasets/20230620/default/names.txt"
    
    df = pd.read_csv(filename, nrows = 2, header=None)
    print(df)
                                0
    0         SANAVBARI NIKITENKO
    1  АМИНАТ РАМЗАНОВНА АХМАДОВА