I regularly get sent on a regular basis a csv containing 100+ columns and millions or rows. These csv files always contain certain set of columns, Core_cols = [col_1, col_2, col_3]
, and a variable number of other columns, Var_col = [a, b, c, d, e]
. The core columns are always there and there could be 0-200 of the variable columns. Sometimes one of the columns in the variable columns will contain a carriage return. I know which columns this can happen in, bad_cols = [a, b, c]
.
When import the csv with pd.read_csv these carriage returns make corrupt rows in the resultant dataframe. I can't re-make the csv without these columns.
How do I either:
My current code looks something like this:
df = pd.read_csv(data.csv, dtype=str)
I've tried things like removing the columns after the import, but the damage seems to already have been done by this point. I can't find the code now, but when testing one fix the error said something like "invalid character u000D
in data". I don't control the source of the data so can't make the edits to that.
Pandas supports multiline CSV files if the file is properly escaped and quoted. If you cannot read a CSV file in Python using pandas or csv modules nor open it in Excel then it's probably a malformed "CSV" file.
Can either process the CSV file line by line and parse the fields with some custom code or will need to fix the CSV file to make it a valid CSV-formatted file. That means values with a delimiter (,) or new lines are quoted with double quotes ("), and embedded quotes are escaped ("").
For either approach, need to first isolate the rows causing the issue then create a small sample CSV file with some of those rows and some normal rows to work out a solution that works for both the good and bad rows.
Recommend to manually edit a sample of the CSV file and get it working so it can open with Excel and Pandas. Then recreate the steps to normalize it programmatically in Python to process the large file.
Use this code to create a sample CSV file copying first ~100 lines into a new file.
with open('bigfile.csv', "r") as csvin, open('test.csv', "w") as csvout:
line = csvin.readline()
count = 0
while line and count < 100:
csvout.write(line)
count += 1
line = csvin.readline()
Now you have a small test file to work with. If the original CSV file has millions of rows and "bad" rows are found much later in the file then you need to add some logic to find the "bad" lines. For example if "\n" appears in value for column "A".
If for example, pandas fails at line 123456 then can try something like this to capture the first 100 lines and the row at line 123456 with row before and after.
with open('bigfile.csv', "r") as csvin, open('test.csv', "w") as csvout:
for idx, line in enumerate(csvin):
if idx < 100:
csvout.write(line)
elif idx in [123455, 123456, 123457]:
csvout.write(line)
if idx == 123457:
break