I don't see the point of using read_fwf
in Pandas. Why would I ever use this instead of read_csv
, which supports custom separators ? I tried testing both in terms of speed for a large fixed column width file, and read_csv
is way faster on my machine:
data = ("colum1 column2222 column3333 column4\n"
"id8141 360.242940 149.910199 11950.7\n"
"id1594 444.953632 166.985655 11788.4\n"
)
colspecs = [(0, 6), (8, 20), (21, 33), (34, 43)]
data = data * 10000000
with open("big_file.txt", "w") as f:
f.write(data)
start_time = time.time()
df = pd.read_csv("big_file.txt", header=None, dtype={"colum1": str, "column2222": float, "column3333": float, "column4":float}, sep="\s+")
print(f"--- {time.time() - start_time} seconds ---")
--- 4.0295188426971436 seconds ---
start_time = time.time()
df = pd.read_fwf("big_file.txt", header=None, colspecs=colspecs,
dtype={"colum1": str, "column2222": float, "column3333": float, "column4":float})
print(f"--- {time.time() - start_time} seconds ---")
--- 77.41955280303955 seconds ---
I don't believe that the existence of the read_fwf
function in the pandas API is for aesthetic purposes. The core developers have made this function accessible to us with the understanding that, in certain circumstances, it is the only effective means of properly reading a text file as a DataFrame.
One of the examples that I can see (where read_fwf
comes in handy) is the (.txt
) file below :
87 foo
341 5
bar 1
Unless you come up with a magic/regex separator, read_csv
can't parse the 3 columns correctly.
df = pd.read_fwf("file.txt", widths=[4, 5, 4], names=["A", "B", "C"])
print(df)
A B C
0 87 NaN foo
1 NaN 341.0 5.0
2 bar 1.0 NaN