Search code examples
pythonpandasutf-8to-json

How to fix "OverflowError: Unsupported UTF-8 sequence length when encoding string"


Getting follwoing error while converting pandas dataframe to json

OverflowError: Unsupported UTF-8 sequence length when encoding string

this is code to

        bytes_to_write = data.to_json(orient='records').encode()
        fs = s3fs.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)
        with fs.open(file, 'wb') as f:
            f.write(bytes_to_write)

While data which trying to convert to json contain more utf-8 codes

How to solve this?


Solution

  • As this answer suggests, I converted the data-frame using the function .to_json() and the default_handler parameter, you can find the documentation here.

    You have to pay attention to the default_handler=str parameter so you don't get the mentioned error. You can read the details in the doc above.

    dataframe.to_json('foo.json', default_handler=str) 
    

    Please don't forget to consider that the function can output the json in differents ways, the orient='<option>' parameter specifies that, as the doc says:

    orient: str
    Indication of expected JSON string format.
    ...
    The format of the JSON string:
    
    - ‘split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
    - ‘records’ : list like [{column -> value}, … , {column -> value}]
    - ‘index’ : dict like {index -> {column -> value}}
    - ‘columns’ : dict like {column -> {index -> value}}
    - ‘values’ : just the values array
    - ‘table’ : dict like {‘schema’: {schema}, ‘data’: {data}}
    
    Describing the data, where data component is like orient='records'.