I need to read a bunch of csv files to pandas dataframes, and I am running into some files which have a missing end-of-line character. It just isn't there at all. pandas.read_csv
won't read them because of the "different number of columns".
Time; A; B; C
12:00; 1; 2; 3
12:01; 4; 5; 6; 12:02; 7; 8; 9
12:03; 10; 11; 12
12:04; 13; 14; 15
Fortunately, all these errors are before the first column, so I could look for the time (there is no other column with time). In this case, I'd just need to insert a CR/LF before 12:02 and then read to the dataframe:
Time; A; B; C
12:00; 1; 2; 3
12:01; 4; 5; 6;
12:02; 7; 8; 9
12:03; 10; 11; 12
12:04; 13; 14; 15
Is there an efficient way to do this before read_csv()
? Or do I need to first open each file, look through each line, replace, close, then read with pandas?
Using: python3.5.2 , pandas 0.22.0
You can try pre-processing the csv if you have irregularities
Ex:
import pandas as pd
res = []
with open(filename) as infile: #Read CSV
header = next(infile).split(";") #Get Header
for line in infile: #Iterate each line
val = line.split(";")
if len(val) == 4: #Check if 4 elements in each line
res.append(val)
else:
res.extend( [val[i:i+4] for i in range(0, len(val), 4)] ) #Else split it.
df = pd.DataFrame(res, columns=header)
print(df)
Output:
Time A B C\n
0 12:00 1 2 3\n
1 12:01 4 5 6
2 12:02 7 8 9\n
3 12:03 10 11 12\n
4 12:04 13 14 15