Search code examples
pandasdataframeencodingread-csv

Unreadable test with list of encoding in pandas


I am trying to read in this dataset

path = "https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1165013/UK_Sanctions_List.ods"

By using this code (I have seen there are quite a lot of threads/suggested solutions to this around, but the following one seems to be the most reasonable one):

encoding_list = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 'cp500', 'cp720', 'cp737'
                 , 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862'
                 , 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950'
                 , 'cp1006', 'cp1026', 'cp1125', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254'
                 , 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr'
                 , 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2'
                 , 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2'
                 , 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9'
                 , 'iso8859_10', 'iso8859_11', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab'
                 , 'koi8_r', 'koi8_t', 'koi8_u', 'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2'
                 , 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_32'
                 , 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8', 'utf_8_sig']

for encoding in encoding_list:
    worked = True
    try:
        df = pd.read_csv(path, encoding=encoding, nrows=5)
        print(df)
    except:
        worked = False
    if worked:
        print(encoding, ':\n', df.head())
    

But when I print the dataframe the results look unreadable, like this:

ËäjñEÎÜ'g
«sQğøÆÿŞmÿ´;Ğ´³µÇÇm®©sbH«iw...  ¿`Ò­ìş#mxOnBXvFî&ƪPÊz1á3uoj_g
¢x>æi7¸}Z«¤õÔ3ÎílW|ùÍx¡c;PÓ©kê+_ëͪ...                                NaN

                                                          qJ|HfÆzÖ¤c[¨ÿ`ÉŞ` *ª
b¾?]ÔüR~
¾GÌOmxÜ?=v좦Í`                                                     NaN
D¾Å¢
Æ·äÎQ´
ûò£^×%óÒ·$]qÓ´În[l'ß                                        NaN
                                                                                &.
ËäjñEÎ"'g
«sQ}øÆÿ@mÿ´;!´³µ¢¢m®©sbH«iw...  ¿ýÒ­ì¦ÖmxOnBXvFî&ƪPÊz1á3uoj_g
^x>æi7¸ðZ«€õÔ3ÎílW]ùÍx¡c;PÓ©kê+_ëͪ...                                NaN

                                                          qJ]HfÆz#€cǨÿýÉ@ý *ª
b¾?ÐÔ\Rö
¾GÌOmx"?=vì^þÍý                                                     NaN
D¾Å^
Æ·äÎQ´
ûò£¬×%óÒ·ÝÐqÓ´ÎnÇl'ß                                        NaN
cp1140 :
                                                                                 &.
ËäjñEÎ"'g
«sQ}øÆÿ@mÿ´;!´³µ¢¢m®©sbH«iw...  ¿ýÒ­ì¦ÖmxOnBXvFî&ƪPÊz1á3uoj_g
^x>æi7¸ðZ«€õÔ3ÎílW]ùÍx¡c;PÓ©kê+_ëͪ...                                NaN

                                                          qJ]HfÆz#€cǨÿýÉ@ý *ª
b¾?ÐÔ\Rö
¾GÌOmx"?=vì^þÍý                                                     NaN
D¾Å^
Æ·äÎQ´
ûò£¬×%óÒ·ÝÐqÓ´ÎnÇl'ß                                        NaN

Does anybody know how I can read it in by any chance?


Solution

  • This is not a CSV file, but rather an ODS (Open Document Spreadsheet) file.

    You should use pandas.read_excel (ensuring the odpfy module is installed):

    # pip install odfpy
    
    df = pd.read_excel("UK_Sanctions_List_2.ods", skiprows=2)
    

    NB. the process is quite slow, so be patient. The original file wasn't working for me but opening and saving it in LibreOffice did the trick. Another option would be to open the data in LibreOffice and to convert to CSV from there.

    Output (first 5 rows):

      Last Updated Unique ID  OFSI Group ID UN Reference Number                                      Name 6  Name 1  Name 2  Name 3  Name 4  Name 5  ... IMO number  Current owner/operator (s)  \
    0   2022-01-12   AFG0001          12703             TAe.010  HAJI KHAIRULLAH HAJI SATTAR MONEY EXCHANGE     NaN     NaN     NaN     NaN     NaN  ...        NaN                         NaN   
    1   2022-01-12   AFG0001          12703             TAe.010  HAJI KHAIRULLAH HAJI SATTAR MONEY EXCHANGE     NaN     NaN     NaN     NaN     NaN  ...        NaN                         NaN   
    2   2022-01-12   AFG0001          12703             TAe.010  HAJI KHAIRULLAH HAJI SATTAR MONEY EXCHANGE     NaN     NaN     NaN     NaN     NaN  ...        NaN                         NaN   
    3   2022-01-12   AFG0001          12703             TAe.010  HAJI KHAIRULLAH HAJI SATTAR MONEY EXCHANGE     NaN     NaN     NaN     NaN     NaN  ...        NaN                         NaN   
    4   2022-01-12   AFG0001          12703             TAe.010  HAJI KHAIRULLAH HAJI SATTAR MONEY EXCHANGE     NaN     NaN     NaN     NaN     NaN  ...        NaN                         NaN   
    
       Previous owner/operator (s) Current believed flag of ship  Previous flags  Type of ship Tonnage of ship Length of ship Year Built Hull identification number (HIN)  
    0                          NaN                           NaN             NaN           NaN             NaN            NaN        NaN                              NaN  
    1                          NaN                           NaN             NaN           NaN             NaN            NaN        NaN                              NaN  
    2                          NaN                           NaN             NaN           NaN             NaN            NaN        NaN                              NaN  
    3                          NaN                           NaN             NaN           NaN             NaN            NaN        NaN                              NaN  
    4                          NaN                           NaN             NaN           NaN             NaN            NaN        NaN                              NaN