I find that by default, Spark seem to write many small parquet files. I think it maybe better if I use partitioning to reduce this?
But how do I choose a partition key? For example, for a users dataset which I frequently query by ID do I partition by id
? But I am thinking, will it create 1 parquet file for 1 user in that case?
What if I frequently query by 2 keys but only 1 or the other not both at the same time, is it useful to partition by both keys? For example, lets say I query usually by id
and country
, do I use partitionBy('id', 'country')
?
If there is no specific pattern in which I query the data but want to limit the number of files, do I use repartition
then?
Partitions create a subdirectory for each value of the partition field, so if you are filtering by that field, instead of reading every file it will read only the files in the appropiate subdirectory.
You should partition when your data is too large and you usually work with a subset of the data at a time.
You should partition by a field that you both need to filter by frequently and that has low cardinality, i.e: it will create a relatively small amount of directories with relatively big amount of data on each directory.
You don't want to partition by a unique id, for example. It would create lots of directories with only one row per directory; this is very inefficient the moment you need to select more than one id.
Some typical partition fields could be dates if you are working with time series (daily dumps of data for instance), geographies (country, branches,...) or taxonomies (types of object, manufacturer, etc).