How to work around "Too many open files error" when writing arrow dataset with pyarrow?

import pyarrow as pa
f = 'my_partitioned_big_dataset'
ds = dataset.dataset(f, format='parquet', partitioning='hive')
s = ds.scanner()
pa.dataset.write_dataset(s.head(827981), 'here', format="arrow", partitioning=ds.partitioning)  # is ok
pa.dataset.write_dataset(s.head(827982), 'here', format="arrow", partitioning=ds.partitioning)  # fails
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-54-9160d6de8c45> in <module>
----> 1 pa.dataset.write_dataset(s.head(827982), 'here', format="arrow", partitioning=ds.partitioning)
...
OSError: [Errno 24] Failed to open local file '...'. Detail: [errno 24] Too many open files

Am on linux (ubuntu). ulimit seems ok?

$ ulimit -Hn
524288
$ ulimit -Sn
1024
$ cat /proc/sys/fs/file-max
9223372036854775807

ulimit -Ha
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 128085
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 524288
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 128085
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 128085
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 128085
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Any ideas on how to work around this? I have a feeling I have already set my ulimit quite high but maybe I could adjust that. It is likely pyarrow has some feature to release open files on the fly?

Solution

There is no way to control this in the current code. This feature (max_open_files) was recently added to the C++ library and ARROW-13703 tracks adding it to the python library. I'm not certain if it will make the cutoff for 6.0 or not (6.0 should be releasing quite soon).

In the meantime, your limit for open files ((-n) 1024) is the default and is a bit conservative. You should be able to pretty safely increase the limit by a couple thousand. See this question for more discussion.