Search code examples
pythonpandascsvdataframeeol

Read CSV File with some EOL missing


I need to read a bunch of csv files to pandas dataframes, and I am running into some files which have a missing end-of-line character. It just isn't there at all. pandas.read_csv won't read them because of the "different number of columns".

Time; A; B; C
12:00; 1; 2; 3
12:01; 4; 5; 6; 12:02; 7; 8; 9
12:03; 10; 11; 12
12:04; 13; 14; 15

Fortunately, all these errors are before the first column, so I could look for the time (there is no other column with time). In this case, I'd just need to insert a CR/LF before 12:02 and then read to the dataframe:

Time; A; B; C
12:00; 1; 2; 3
12:01; 4; 5; 6; 
12:02; 7; 8; 9
12:03; 10; 11; 12
12:04; 13; 14; 15

Is there an efficient way to do this before read_csv()? Or do I need to first open each file, look through each line, replace, close, then read with pandas?


Using: python3.5.2 , pandas 0.22.0


Solution

  • You can try pre-processing the csv if you have irregularities

    Ex:

    import pandas as pd
    
    res = []
    with open(filename) as infile:             #Read CSV
        header = next(infile).split(";")       #Get Header
        for line in infile:                    #Iterate each line
            val = line.split(";")
            if len(val) == 4:                  #Check if 4 elements in each line
                res.append(val)
            else:
                res.extend( [val[i:i+4] for i in range(0, len(val), 4)] )     #Else split it. 
    
    df = pd.DataFrame(res, columns=header)
    print(df)
    

    Output:

         Time    A    B    C\n
    0   12:00    1    2    3\n
    1   12:01    4    5      6
    2   12:02    7    8    9\n
    3   12:03   10   11   12\n
    4   12:04   13   14     15