Search code examples
pythonmongodbunicodecharacter-encodingmongoexport

MongoDB - Unexpected character encodings when using mongoexport


I am using mongoexport on a collection that contains foreign characters encoded in utf8 as well as fields with characters mongoexport seems to be encoding (e.g., '&'). What I'm noticing is mongo export does a unicode escape for the '&' characters but leaves characters like 'ü' unescaped. This is posing a problem because I am trying to read this data using Python but am unable to decode it properly since there are two different encodings happening.

For example (mongo query to get record):

db.Military_Handbooks.findOne({_id: ObjectId("5bf61c80e173a2a10b53ad39")}).PRIMARY_AUTHOR

[
  "Dürer, Albrecht",
  [
    [
      "http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=&tm_field_allauthr=Dürer, Albrecht&tm_translator=&tm_editor=&tm_field_short_title=&tm_field_imprint=&tm_field_place=&sm_field_year=&f_sm_field_year=&t_sm_field_year=&sm_field_country=&sm_field_lang=&sm_field_format=&sm_field_digital=&tm_field_class=&tm_field_cit_name=&tm_field_cit_no=&order=",
      " Dürer, Albrecht"
    ]
  ]
]

Running the following mongoexport command (and this is the same if exported to json):

mongoexport--db ustc --collection Military_Handbooks --type=csv -f=PRIMARY_AUTHOR --limit=1
"[""Dürer, Albrecht"",[[""http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=\u0026tm_field_allauthr=Dürer, Albrecht\u0026tm_translator=\u0026tm_editor=\u0026tm_field_short_title=\u0026tm_field_imprint=\u0026tm_field_place=\u0026sm_field_year=\u0026f_sm_field_year=\u0026t_sm_field_year=\u0026sm_field_country=\u0026sm_field_lang=\u0026sm_field_format=\u0026sm_field_digital=\u0026tm_field_class=\u0026tm_field_cit_name=\u0026tm_field_cit_no=\u0026order="","" Dürer, Albrecht""]]]"

When trying to read this into Python:

In [24]: import pandas
In [25]: c = pandas.read_csv('Military_Handbooks2.csv')
In [26]: c.at[1, 'PRIMARY_AUTHOR']
Out[26]: '["Dürer, Albrecht",[["http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=\\u0026tm_field_allauthr=Dürer, Albrecht\\u0026tm_translator=\\u0026tm_editor=\\u0026tm_field_short_title=\\u0026tm_field_imprint=\\u0026tm_field_place=\\u0026sm_field_year=\\u0026f_sm_field_year=\\u0026t_sm_field_year=\\u0026sm_field_country=\\u0026sm_field_lang=\\u0026sm_field_format=\\u0026sm_field_digital=\\u0026tm_field_class=\\u0026tm_field_cit_name=\\u0026tm_field_cit_no=\\u0026order="," Dürer, Albrecht"]]]'
In [27]: c.at[1, 'PRIMARY_AUTHOR'].encode().decode('unicode-escape')
Out[27]: '["Dürer, Albrecht",[["http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=&tm_field_allauthr=Dürer, Albrecht&tm_translator=&tm_editor=&tm_field_short_title=&tm_field_imprint=&tm_field_place=&sm_field_year=&f_sm_field_year=&t_sm_field_year=&sm_field_country=&sm_field_lang=&sm_field_format=&sm_field_digital=&tm_field_class=&tm_field_cit_name=&tm_field_cit_no=&order="," Dürer, Albrecht"]]]'

Specs:
OS: Ubuntu 18.04.1 LTS
Python: 3.6.7
MongoDB shell version v3.6.9


Solution

  • In the end re-encoding the files while ignoring errors seems to have done the trick.

    def encoding():
        for fn in os.listdir('.'):
            if '2' not in fn and 'failed' not in fn and 'decode' not in fn:
                try:
                    with codecs.open(fn, encoding='utf-8') as fd:
                        text = fd.read()
                        text = text.encode('Windows-1252', errors='ignore').decode('utf-8', errors='ignore')
                    with codecs.open(fn[:fn.rfind('.')]+'2.csv', 'w', encoding='utf-8') as fd:
                            fd.write(text)
                except Exception as ex:
                    print(ex)
                    print('*'*50, '\n')
    

    I should also note I was linked to this post which was helpful: how to export correctly accented words with mongoexport.