I have two functions, one downloads individual csv files and the other downloads a zip with multiple csv files.
The download_and_process_csv
function works correctly with response.iter_lines()
which seems to delete new line characters.
'Chicken, water, cornmeal, salt, dextrose, sugar, sodium phosphate, sodium erythorbate, sodium nitrite. Produced in a facility where allergens are present such as eggs, milk, soy, wheat, mustard, gluten, oats, dairy.'
The download_and_process_zip
function seems to include new line characters for some reason (\n\n
). I've tried newline=''
in io.TextIOWrapper
however it just replaces it with \r\n
.
'Chicken, water, cornmeal, salt, dextrose, sugar, sodium phosphate, sodium erythorbate, sodium nitrite. \n\nProduced in a facility where allergens are present such as eggs, milk, soy, wheat, mustard, gluten, oats, dairy.'
Is there a way to modify download_and_process_zip
so that new line characters are excluded/replaced or do I have to iterate over all the rows and manually replace the characters?
@request_exceptions
def download_and_process_csv(client, url, model_class):
with closing(client.get(url, stream=True)) as response:
response.raise_for_status()
response.encoding = 'utf-8'
reader = csv.reader(response.iter_lines(decode_unicode=True))
process_copy_from_csv(model_class, reader)
@request_exceptions
def download_and_process_zip(client, url):
with closing(client.get(url, stream=True)) as response:
response.raise_for_status()
with io.BytesIO(response.content) as buffer:
with zipfile.ZipFile(buffer, 'r') as z:
for filename in z.namelist():
base_filename, file_extension = os.path.splitext(filename)
model_class = apps.get_model(base_filename)
if file_extension == '.csv':
with z.open(filename) as csv_file:
reader = csv.reader(io.TextIOWrapper(
csv_file,
encoding='utf-8',
# newline='',
))
process_copy_from_csv(model_class, reader)
I've played around with a mock server which serves this CSV file:
"foo
bar"
The CSV has a single field, "foo\nbar"
, in a single row. I call a newline in the data an embedded newline.
When I use the iter_content method on the Response object:
print("Getting CSV")
resp = requests.get("http://localhost:8999/csv")
x = resp.iter_content(decode_unicode=True)
reader = csv.reader(x)
for row in reader:
print(row)
I get the correct output, a single row prints out with a single field of data:
Getting CSV
['foo\nbar']
If I change iter_content to iter_lines, I get the wrong output:
Getting CSV
['foobar']
I suspect, based on the name, that iter_lines looks for any newline-like character sequence and stops there, before handing the line to the csv reader (without the newline), and so the embedded newline is effectively removed. I cannot speak for your result where the newline appeared to be replaced with a space... there's no replacement going on, just effectively deleting.
This popular SO, Use python requests to download CSV, asks the general question about downloading a CSV with the requests module, but every answer seems tailored to the fact that the CSV in question doesn't contain embedded newlines, and so there are a lot of answers with iter_lines. I don't know when iter_content() was added to requests, but no answer makes mention of it.