I'm trying to develop an application that will be running on multiple computers linked to a shared Lustre storage, performing various actions, including but not limited to:
As you can see, the basic I/O one can wish for.
Since it's concurrent for most of that, I ought to need some kind of locking to allow safely doing the different writings, but I've seen Lustre doesn't support flock(2)s by default (and I'm not sure I want to use it over fcntl(2), I guess I will if it comes to it), and I haven't seen anything about fcntl(2) to confirm its support.
Researching it mostly resulted in me reading lot of papers about I/O optimization using Lustre, but those usually explain how the structure of their hardware / software / network works rather than explaining how it's done in the code.
So, can I use fcntl(2) with Lustre? Should I use it? If not, what are other alternatives to allow different clients to perform concurrent modifications of the data?
Or is it even possible ? (I've seen in Lustre tickets that mmap is possible, so fcntl should work too (no logic behind statement), but there might be limitations I would want to be aware of.)
I'll keep on writing a test application to check it out, but I figured I should still ask in case there are better alternatives (or if there are limitations to its functionalities that I should be aware of, since my test will be limited and we don't want unknown limitations to become an issue later in the development process).
Thanks,
Edit: The base question has been properly answered by LustreOne, here I give more specific informations about my use case to allow people to add pertinent additional informations about Lustre concurrent access.
The Lustre clients will be server to other applications. Clients of those applications will each have their own set of files, but we want to support allowing clients to log to their client space from multiple machines at the same time and, for that purpose, we need to allow concurrent file read and write.
These, however, will always be a pretty small percentage of total I/O operations.
While really interesting insights were given in LustreOne's answer, not many of them apply to this use case (or rather, they do apply, but adding the complexity to the overall system might not be desired for the impact on performances).
That is, for the use case considered at present, I'm sure it can be of much help to some, and ourselves later on. However, what we are seeking right now is more of a way to easily allow two nodes or threads a node responding to two request to modify data to let one pass and detect the conflict, effectively preventing concerned client.
I believed file locking would be enough for that use case, but had a preference for byte locking since some of the most concerned file are getting appended non-stop by some clients, and read/modified up to the end by others.
However, judging from what I understood from LustreOne's answer:
That said, there is no strict requirement for this if your application knows what it is doing. Lustre will already keep non-overlapping writes consistent, and can handle concurrent O_APPEND writes as well.
The later case is already managed by Lustre out of the box.
Any opinion on what could be the best alternatives ? Will using simple flock() on complete file be enough ?
Note that some file will also have index, which can be used to determine availability of data without locking any of the data file, shall that be used or are bytes lock quick enough for us to avoid increasing codebase size to support both case?
A final mention on mmap
. I'm pretty sure it doesn't fit our use case much since we got so many files and many clients, so OST might not be able to cache much, but to be sure... shall it be used, and if so, how? ^^
Sorry for being so verbose, it's one of my bad traits. :/
Have a nice day,
You should mount all clients with the "-o flock" mount option to enable globally coherent locking. Then flock() (and I think fcntl() locking) will work.
That said, there is no strict requirement for this if your application knows what it is doing. Lustre will already keep non-overlapping writes consistent, and can handle concurrent O_APPEND writes as well. However, since Lustre has to do internal locking for appends, this can hurt write performance significantly if there are a lot of different clients appending to the same file concurrently. (Note this is not a problem if only a single client is appending).
If you are writing the application yourself, then there are a lot of things you can do to make performance better: - have some central thread assign a "write slot number" to each writer (essentially an incrementing integer), and then the client writes to offset = recordsize * slot number. Beyond assigning the slot number (which could be done in batches for better performance), there is no contention between clients. In most HPC applications the threads use the MPI rank as the slot number, since it is unique, and threads on the same node will typically be assigned adjacent slots so Lustre can further aggregate the writes. That doesn't work if you use a producer/consumer model where threads may produce variable numbers of records. - make the IO recordsize a multiple of 4KiB in size to avoid contention between threads. Otherwise, the clients or servers will be forced to do read-modify-write for the partial records in a disk block, which is inefficient. - Depending on whether your workflow allows it or not, rather than doing read and write into the same file, it will probably be more efficient to write a bunch of records into one file, then process the file as a whole and write into a second file. Not that Lustre can't do concurrent read and write to a single file, but this causes unnecessary contention that could be avoided.