I'm encountering an odd issue on an NFS v3 file system (I feel this is important) running two processes in parallel doing (following the comment below and my own knowledge in the matter I don't think the language should matter, and I think this is readable enough):
if { ! [file isdirectory $dir]} {
if {[catch { file mkdir $dir} err]} {
error "-E- failed to mkdir $dir: $err"
}
}
For those not familiar, file mkdir
in tcl behaves much like mkdir -p
- it should only fail if the directory exists and is not a directory. I'm nearly 100% (there is no 100% ever) that nothing is creating that file in any process, only file mkdir
. The problem does not happen always, but often enough while running our regressions we might hit a:
Error: can't create directory "$dir": file already exists
This should only happen if during the file mkdir
processing $dir
is an existing non-directory file. Two questions, the first is more important for me:
exec mkdir -p
, but if I'm right this will suffer from the same problem.It's hard enough to reproduce this so I'd rather be as sure as I can before I attempt a fix. I came here after following a hint that says the nfs
FS maybe the issue, but I need more expert advice. I don't care if both succeed, I just don't want them to fail (on first try).
Final note
I circled back to this after a long while - and this is indeed a tcl issue, but not only on nfs, though nfs seems to make it worse!.
Still looking for answers explaining why I'm seeing what I'm seeing - see answer.
Opened this as a bug
https://core.tcl.tk/tcl/tktview/270f78ca95b642fbed81ed03ad381d64a0d0f7df
Bug already fixed!
The people at tcl core are fast!
The guys and girls at TCL core have fixed this a day after I posted the bug!
https://core.tcl.tk/tcl/tktview/270f78ca95b642fbed81ed03ad381d64a0d0f7df
Fixed in 1c12ee9e45222d6c.
A thanks to mrcalvin for the suggestion.
The old testing attempts:
After a long while I circled back to this, and made the following tests (on ext4
):
Two terminals with tclsh
:
1: while {1} {file mkdir bla}
2: while {1} {file mkdir bla; file delete bla}
Error eventually on 1:
:
can't create directory "bla": no such file or directory
Two terminals with tclsh
:
1: while {1} {exec mkdir -p bla}
2: while {1} {exec mkdir -p bla; file delete bla}
No error.
One terminal Bash one tclsh
:
1: while [ 1 ]; do mkdir -p bla; done
2: while {1} {file mkdir bla; file delete bla}
eventually I get on 1:
:
mkdir: cannot create directory ‘bla’: File exists
but oddly enough
1: while [ 1 ]; do mkdir -p bla; rm -rf bla; done
2: while {1} {file mkdir bla}
no error (delete is the culprit?) and
1: while [ 1 ]; do mkdir -p bla; done
2: while {1} {exec mkdir -p bla; file delete bla}
much less chance of error (so delete not as bad?). Of course two bash
shells do not conflict:
1: while [ 1 ]; do mkdir -p bla; rm -rf bla; done
2: while [ 1 ]; do mkdir -p bla; done
On NFS but not on EXT4
1: while {1} {file mkdir bla; exec rm -rf bla}
2: while {1} {file mkdir bla}
fails with
can't create directory "bla": file already exists
on both 1:
2:
(randomly).
Conclusion
file mkdir
is not as "thin" a layer as I thought and can produce race conditions where one mkdir
thinks a directory being made is a file. file delete
may also have this or a similar issue. It may be also contributing in my tests to failures, but not in my original question - the matter is worsened for NFS systems, where file mkdir
alone is easily reproducing the error.
The solution is to use exec mkdir -p
. So far this is working for us across the board.