I see the paramter npartitions
in many functions, but I don't understand what it is good for / used for.
http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_csv
head(...)
Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions.
repartition(...)
Number of partitions of output, must be less than npartitions of input. Only used if divisions isn’t specified.
Is the number of partitions probably 5 in this case:
(Image source: http://dask.pydata.org/en/latest/dataframe-overview.html )
The npartitions
property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways.
Generally you want a few times more partitions than you have cores. Every task takes up a few hundred microseconds in the scheduler.
You can determine the number of partitions either at data ingestion time using the parameters like blocksize=
in read_csv(...)
or afterwards by using the .repartition(...)
method.