I've got huge csv files and a few thousands of files (each file running into Gbs and some running into Mbs). However, my interest is only the last n rows (say 50 records) of each of these files. My question is a general one about speed and efficiency: would it be faster if I read_csv all files using skiprows, or slower, or would it make no difference in terms of speed, thanks?
You can use the timeit module to measure how long your code takes to run. It looks like read_csv() is slightly faster if you use skiprows.
import timeit
import pandas as pd
def test():
df = pd.read_csv('large.csv')
def test2():
df = pd.read_csv('large.csv', skiprows=range(0,10000))
if __name__ == "__main__":
print(timeit.timeit("test()", globals=globals(), number=500))
print(timeit.timeit("test2()", globals=globals(), number=500))
# iterations | without skiprows | with skiprows |
---|---|---|
100 | 4.880708541997592 | 4.318660000004456 |
500 | 23.931738541999948 | 21.48539920800249 |