Search code examples
pythonjsoncsvcharacter-encodingfastapi

Python FastAPI: How to return a Response with Unicode or non-ASCII characters encoded into JSON or CSV data?


I am creating a FastAPI application that triggers file downloading through the StreamingResponse class (see FastAPI docs). This part is actually ok.

My problem is that when the file contains accent (e.g., é) or another special character, it seems to not encode it well.

For example, when there is a é, in a CSV it will be transformed to é, and in a JSON to \u00e9.

My code looks something like this:

  • For JSON
# API CONTENT
# ...

return StreamingResponse(io.StringIO(json.dumps(data)), headers={"Content-Disposition": "filename=filename.json")
  • For CSV
# API CONTENT
# ...

return StreamingResponse(io.StringIO(pandas.DataFrame(data).to_csv(index=False)), headers={"Content-Disposition": f"filename=filename.csv"})

In order to fix the encoding, I also tried to:

  • Add media_type="text/csv; charset=utf-8" in the CSV part but without success.

  • Add "Content-Type": "application/octet-stream; charset=utf-8" in the header part but without success too.

  • Tried to replace StreamingResponse by Response .

Has somebody already faced this kind of problem? I would like to note that the content before adding it to StreamingResponse is well encoded.

Here is a sample of code to test:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import pandas as pd
import io
import json

app = FastAPI()

data = [
    {"éète": "test", "age": 10},
    {"éète": "test2", "age": 5},
]

@app.get("/download_json")
async def download_json():
    return StreamingResponse(io.StringIO(json.dumps(data)), headers={"Content-Disposition": "filename=data.json"})

@app.get("/download_csv")
async def download_csv():
    return StreamingResponse(io.StringIO(pd.DataFrame(data).to_csv(index=False)), headers={"Content-Disposition": "filename=data.csv"})```

Solution

  • Python's json module, by default, converts non-ASCII and Unicode characters into the \u escape sequence. To avoid having non-ASCII or Unicode characters converted in that way, when encoding your data into JSON, you could set the ensure_ascii flag of json.dumps() function to False. Similalry, when using Panda's DataFrame to_json()or to_csv() functions, you need to make sure to use them with force_ascii=False and encoding='utf-8' arguments, respectively. (Note that encoding='utf-8' is the default encoding for the to_csv() function regardless; hence, you might omit manually setting it).

    Regarding using StreamingResponse, I would suggest having a look at this answer and all the references included in it, in order to understand whether and when you should use it. In your case, as shown in the example given in your question, is not needed, but you should rather return a custom Response, as explained in the linked answer above. More related answers that you might find helpful can be found here, here, as well as here, here and here. I would also highly suggesting reading this answer, which would clear things up for you, regarding how FastAPI works inder the hood, when returning dictionary/JSON objects from an endpoint. To use faster JSON encoders than the standard json module, have a look at this answer, as well as this answer and this answer.

    Finally, as explained in this answer and this answer, you could define the Content-Disposition header, so that the data are either viewed in the browser or downloaded to the client's device, using either:

    headers = {'Content-Disposition': 'inline; filename="out.json"'}
    

    or

    headers = {'Content-Disposition': 'attachment; filename="out.json"'}
    

    Please have a look at the linked answers for more details.

    Working Example

    from fastapi import FastAPI, Response
    import pandas as pd
    import json
    
    
    app = FastAPI()
    
    
    # Exemple de données avec des caractères spéciaux
    data = [
        {"éète": "test", "age": 10},
        {"éète": "test2", "age": 5},
    ]
    
        
    @app.get("/1")
    def get_json():
        headers = {"Content-Disposition": 'inline; filename="out.json"'}
        return Response(
            json.dumps(data, ensure_ascii=False),
            headers=headers,
            media_type="application/json",
        )
    
    
    @app.get("/2")
    def get_json_from_df():
        headers = {"Content-Disposition": 'inline; filename="out.json"'}
        return Response(
            pd.DataFrame(data).to_json(orient="records", force_ascii=False),
            headers=headers,
            media_type="application/json",
        )
    
    
    # Note: "text/csv" would force the browser to download the data, regardless
    # of specifying `inline` in the `Content-Disposition` header.
    # Use `media_type="text/plain"` instead, in order to view the data in the browser.
    @app.get("/3")
    def get_csv_from_df():
        headers = {"Content-Disposition": 'inline; filename="out.csv"'}
        return Response(
            pd.DataFrame(data).to_csv(index=False, encoding="utf-8"),
            headers=headers,
            media_type="text/csv; charset=utf-8",
        )
    

    UPDATE - Downloading CSV file and making Excel automatically displaying UTF-8/UTF-16 data when double-clicking the file

    As noted in the example provided above, when calling /3 endpoint, while the file that gets downloaded is in utf-8 encoding—one can confirm that by opening the file in Notepad++ and checking the encoding used, as well as finding that all unicode/non-ascii characters are displayed as expected—when, however, double-clicking on the file to open it in Excel, unicode/non-ascii characters, such as é and è in the example above, are not displayed correctly.

    As it turns out, this is a known issue with Excel—have a look at this, this and this for more details—and the way to overcome this is to prepend a Byte Order Mark (BOM) (i.e., \uFEFF) at the beginning of the file, which would result in Excel recognizing the file as UTF-8 or UTF-16. Depending on the Excel version one is using, they should use the appropriate encoding. In some older versions, one would need to use utf-16 encoding (or utf-16-le), as shown in the example below, while in newer versions utf-8 might work as well (in which case, one would need to replace utf-16 with utf-8 in the example below). Finally, make sure to use \t delimiter (i.e., , sep='\t'), in order for the data to be displayed in separate columns in the CSV file; otherwise, using , or ; for instance, all data might end up in a single column.

    Option 1 - Using Pandas DataFrame

    @app.get("/3")
    def get_csv_from_df():
        headers = {"Content-Disposition": 'attachment; filename="out.csv"'}
        return Response(
            (u'\uFEFF' + pd.DataFrame(data).to_csv(index=False, sep='\t', encoding="utf-16")).encode('utf-16'),
            headers=headers,
            media_type="text/csv; charset=utf-16",
        )
    

    Option 2 - Using Python's built-in csv module

    Alternatively, one could use Python's built-in csv module to convert the dictionary or list of dictionaries into csv data and have them sent to the client. To avoid, however, saving the data to a file on the disk, one could use Python's NamedTemporaryFile, as demonstrated here, here and here, which would be much faster, and have it deleted at the end (or in case an exception occurs when processing the data), as shown here and here. Using this option, one wouldn't have to prepend a BOM to the data, as described earlier when using Pandas DataFrame.

    from fastapi import BackgroundTasks, HTTPException
    from fastapi.responses import FileResponse
    from tempfile import NamedTemporaryFile
    import csv
    import os
    
    @app.get("/4")
    def get_csv(background_tasks: BackgroundTasks):
        headers = {"Content-Disposition": 'attachment; filename="out.csv"'}
        temp = NamedTemporaryFile(delete=False, mode='w', encoding='utf-16', newline='')
        try:
            with temp as f:
                keys = data[0].keys()
                w = csv.DictWriter(f, fieldnames=keys, delimiter='\t')
                w.writeheader()
                w.writerows(data)
        except Exception:
            os.remove(temp.name)
            raise HTTPException(detail='There was an error processing the data', status_code=400)
    
        background_tasks.add_task(os.remove, temp.name)
        return FileResponse(temp.name, headers=headers, media_type='text/csv; charset=utf-16') 
    

    or, a variant of the above (remember to call .seek(0) to reset the cursor back to the start of the file, before reading the contents from it—see here for more details):

    from fastapi import HTTPException
    from tempfile import NamedTemporaryFile
    import csv
    import os
    
    @app.get("/5")
    def get_csv():
        headers = {"Content-Disposition": 'attachment; filename="out.csv"'}
        temp = NamedTemporaryFile(delete=False, mode='w+', encoding='utf-16', newline='')
        try:
            keys = data[0].keys()
            w = csv.DictWriter(temp, fieldnames=keys, delimiter='\t')
            w.writeheader()
            w.writerows(data)
            temp.seek(0)
            return Response(temp.read().encode('utf-16'), headers=headers, media_type='text/csv; charset=utf-16')
        except Exception:
            raise HTTPException(detail='There was an error processing the data', status_code=400)
        finally:
            temp.close()
            os.remove(temp.name)
    

    Note 1

    it should also be noted that if one didn't use any of the options provided above, they could still open the file downloaded using the standard Notepad application and change the encoding to Unicode by clicking on "File > Save as > Encoding: Unicode", and then save it. In Notepad++, you could do that by clicking on the "Encoding" tab and selecting UTF-16 LE BOM, then save it. In both cases, double-click on the file after channging the encoding should display unicode characters as expected. There is also the option of opening the Excel application first, and then importing the csv file, which would let you specify the encoding (lokk for guidelines online on how to do that). Regardless, using one of the approaches demonstrated earlier, the contents of the csv file would be shown as expected, when double-clicking the file to open it in Excel.

    Note 2

    In every option given above, the endpoints were defined with normal def, instead of async def, as I/O operations of Pandas DataFrame, Python's built-in json and csv modules, as well as NamedTemporaryFile, are all blocking operations. Thus, depending on the size of data to be written/read, as well as the number of users that might need concurrent access to your API, you might otherwise noticed delays, when you (or some other user) attempted to access the same or a different endpoint, while a request was already being processed. If an endpoint was defined with async def and you had to process (read/write) large size data, and didn't await for some async function inside the endpoint—in order to return control back to the event loop and allow other tasks/requests waiting in the event loop to run—every request to such an endpoint would have to be completely finished (i.e., exit the endpoint), before letting other tasks/requests in the event loop to run.

    When, instead, defining an endpoint with normal def, FastAPI will run that endpoint in an external ThreadPool that is then awaited; hence, avoiding blocking the event loop. Please have a look at this answer for more details and examples on that subject. That answer also provides solutions, when one needs to define their endpoint with async def, as they would have to await for some async function inside it, as well as, at the same time, they would have to execute blocking operations inside that endpoint. Such solutions include running blocking opertions in an external threadpool or processpool and awaiting it. Please have a look at the linked answer above for furhter details. If you would also like using an async version of NamedTemporaryFile, please take a look at Option 2 of this answer, which uses aiofiles.

    If the size of the data you had to process was small, and you didn't expect a large amount of clients/users to use your API at the same time, it would be just fine to define the endpoint with async def (even without having blocking operations, such as json.dumps(), Pandas.DataFrame.to_csv(), etc., run in external threadpool or processpool). It all depends on your needs.