Search code examples
pythondjangocsv

Why does csv.reader with TextIOWrapper include new line characters?


I have two functions, one downloads individual csv files and the other downloads a zip with multiple csv files.

The download_and_process_csv function works correctly with response.iter_lines() which seems to delete new line characters.

'Chicken, water, cornmeal, salt, dextrose, sugar, sodium phosphate, sodium erythorbate, sodium nitrite. Produced in a facility where allergens are present such as eggs, milk, soy, wheat, mustard, gluten, oats, dairy.'

The download_and_process_zip function seems to include new line characters for some reason (\n\n). I've tried newline='' in io.TextIOWrapper however it just replaces it with \r\n.

'Chicken, water, cornmeal, salt, dextrose, sugar, sodium phosphate, sodium erythorbate, sodium nitrite. \n\nProduced in a facility where allergens are present such as eggs, milk, soy, wheat, mustard, gluten, oats, dairy.'

Is there a way to modify download_and_process_zip so that new line characters are excluded/replaced or do I have to iterate over all the rows and manually replace the characters?

@request_exceptions
def download_and_process_csv(client, url, model_class):
    with closing(client.get(url, stream=True)) as response:
        response.raise_for_status()
        response.encoding = 'utf-8'
        reader = csv.reader(response.iter_lines(decode_unicode=True))
        process_copy_from_csv(model_class, reader)


@request_exceptions
def download_and_process_zip(client, url):
    with closing(client.get(url, stream=True)) as response:
        response.raise_for_status()

        with io.BytesIO(response.content) as buffer:
            with zipfile.ZipFile(buffer, 'r') as z:
                for filename in z.namelist():
                    base_filename, file_extension = os.path.splitext(filename)
                    model_class = apps.get_model(base_filename)
                    if file_extension == '.csv':
                        with z.open(filename) as csv_file:
                            reader = csv.reader(io.TextIOWrapper(
                                csv_file,
                                encoding='utf-8',
                                # newline='',
                            ))
                            process_copy_from_csv(model_class, reader)

Solution

  • I've played around with a mock server which serves this CSV file:

    "foo
    bar"
    

    The CSV has a single field, "foo\nbar", in a single row. I call a newline in the data an embedded newline.

    When I use the iter_content method on the Response object:

    print("Getting CSV")
    resp = requests.get("http://localhost:8999/csv")
    x = resp.iter_content(decode_unicode=True)
    
    reader = csv.reader(x)
    for row in reader:
        print(row)
    

    I get the correct output, a single row prints out with a single field of data:

    Getting CSV
    ['foo\nbar']
    

    If I change iter_content to iter_lines, I get the wrong output:

    Getting CSV
    ['foobar']
    

    I suspect, based on the name, that iter_lines looks for any newline-like character sequence and stops there, before handing the line to the csv reader (without the newline), and so the embedded newline is effectively removed. I cannot speak for your result where the newline appeared to be replaced with a space... there's no replacement going on, just effectively deleting.

    This popular SO, Use python requests to download CSV, asks the general question about downloading a CSV with the requests module, but every answer seems tailored to the fact that the CSV in question doesn't contain embedded newlines, and so there are a lot of answers with iter_lines. I don't know when iter_content() was added to requests, but no answer makes mention of it.