Search code examples
pythonpandasdataframecsv

Why use read_fwf in Pandas if I can just use read_csv with a custom separator?


I don't see the point of using read_fwf in Pandas. Why would I ever use this instead of read_csv, which supports custom separators ? I tried testing both in terms of speed for a large fixed column width file, and read_csv is way faster on my machine:

data = ("colum1    column2222   column3333   column4\n"
        "id8141    360.242940   149.910199   11950.7\n"
        "id1594    444.953632   166.985655   11788.4\n"
        )

colspecs = [(0, 6), (8, 20), (21, 33), (34, 43)]
data = data * 10000000

with open("big_file.txt", "w") as f:
    f.write(data)
start_time = time.time()
df = pd.read_csv("big_file.txt", header=None, dtype={"colum1": str, "column2222": float, "column3333": float, "column4":float}, sep="\s+")
print(f"--- {time.time() - start_time} seconds ---")
--- 4.0295188426971436 seconds ---
start_time = time.time()
df = pd.read_fwf("big_file.txt", header=None, colspecs=colspecs, 
                 dtype={"colum1": str, "column2222": float, "column3333": float, "column4":float})
print(f"--- {time.time() - start_time} seconds ---")
--- 77.41955280303955 seconds ---

Solution

  • I don't believe that the existence of the read_fwf function in the pandas API is for aesthetic purposes. The core developers have made this function accessible to us with the understanding that, in certain circumstances, it is the only effective means of properly reading a text file as a DataFrame.

    One of the examples that I can see (where read_fwf comes in handy) is the (.txt) file below :

    87        foo
         341  5
    bar  1    
     
    

    Unless you come up with a magic/regex separator, read_csv can't parse the 3 columns correctly.

    df = pd.read_fwf("file.txt", widths=[4, 5, 4], names=["A", "B", "C"])
    
    print(df)
    
         A      B    C
    0   87    NaN  foo
    1  NaN  341.0  5.0
    2  bar    1.0  NaN