Search code examples
shellparallel-processingfindgnu-parallellocate

Modify gupdatedb (GNU updatedb command) to insert parallel command


I am working on MacOS 10.15 with the tool glocate and gupdatedb from findutils package installed with brew.

I would like to integrate the shell command "parallel" into the script gupdatedb into order to build more fastly the database.

In the original version of script gupdatedb command, I get :

: ${find:=${BINDIR}/gfind}

1) I tried to insert the parallel command in this command above.

Usually, with gfind, we can use parallel command like this :

parallel --lb -j32 gfind ::: /*

the option '/*' is used to find all files from root directory and all its subdiretories.

So I tried to do (for the gupdatedb script) :

: ${find:=/usr/local/bin/parallel -j32 ${BINDIR}/gfind}

But at the execution, I get the following error and I can't explain it :

updatedb needs to be able to execute -j32, but cannot.

2) I tried also to pass by variable :

    num_threads=-j32
    ${parallel:=${BINDIR}/parallel --lb $num_threads}
    : ${find:=${parallel} ${BINDIR}/gfind \{\} ::: }
    : ${frcode:=${LIBEXECDIR}/gfrcode}

But the code remains locked and database is not generated.

How can I overcome this issue to be able to execute gfind on multiple threads (here 8 threads) ?

PS1 : in this post, I make reference to another link : parallel with find explaining how to combine find and parallel commands.

PS2 : the script gupdatedb is relatively long, so I give below relevant sections, at least I think (I stopped the program hanging with CMD+C) :

# The database file to build.
: ${LOCATE_DB=/usr/local/var/locate/locatedb}

# Directory to hold intermediate files.
if test -z "$TMPDIR"; then
  if test -d /var/tmp; then
    : ${TMPDIR=/var/tmp}
  elif test -d /usr/tmp; then
    : ${TMPDIR=/usr/tmp}
  else
    : ${TMPDIR=/tmp}
  fi
fi
export TMPDIR

# The user to search network directories as.
: ${NETUSER=daemon}

# The directory containing the subprograms.
if test -n "$LIBEXECDIR" ; then
    : LIBEXECDIR already set, do nothing
else
    : ${LIBEXECDIR=/usr/local/Cellar/findutils/4.7.0/libexec}
fi

# The directory containing find.
if test -n "$BINDIR" ; then
    : BINDIR already set, do nothing
else
    : ${BINDIR=/usr/local/bin}
fi

# DEV : parallel prefix command
num_threads=-j32
${parallel:=${BINDIR}/parallel --lb $num_threads}
# The names of the utilities to run to build the database.
: ${find:=${parallel} ${BINDIR}/gfind \{\} ::: }
: ${frcode:=${LIBEXECDIR}/gfrcode}

UPDATE 1: From my results, If I comment the line # checkbinary $binary and if I apply my second method (see 2) I tried...), I get the following error message (I have activated set -x for debug :

+ version='
updatedb (GNU findutils) 4.7.0
Copyright (C) 1994-2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Eric B. Decker, James Youngman, and Kevin Dalley.
'
+ LC_ALL=C
+ export LC_ALL
+ usage='Usage: /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb [--findoptions='\''-option1 -option2...'\'']
       [--localpaths='\''dir1 dir2...'\''] [--netpaths='\''dir1 dir2...'\'']
       [--prunepaths='\''dir1 dir2...'\''] [--prunefs='\''fs1 fs2...'\'']
       [--output=dbfile] [--netuser=user] [--localuser=user]
       [--dbformat] [--version] [--help]

Please see also the documentation at http://www.gnu.org/software/findutils/.
Report (and track progress on fixing) bugs in the updatedb
program via the GNU findutils bug-reporting page at
https://savannah.gnu.org/bugs/?group=findutils or, if
you have no web access, by sending email to <[email protected]>.
'
+ changeto=/
+ frcode_options=
+ case "$dbformat" in
+ true
+ sort='/usr/bin/sort -z'
+ print_option=-print0
+ frcode_options=' -0'
+ :
+ : /usr/local/bin/zsh
+ : /
+ :
+ : '
/afs
/amd
/proc
/sfs
/tmp
/usr/tmp
/var/tmp
'
+ for p in '$PRUNEPATHS'
+ case "$p" in
+ for p in '$PRUNEPATHS'
+ case "$p" in
+ for p in '$PRUNEPATHS'
+ case "$p" in
+ for p in '$PRUNEPATHS'
+ case "$p" in
+ for p in '$PRUNEPATHS'
+ case "$p" in
+ for p in '$PRUNEPATHS'
+ case "$p" in
+ for p in '$PRUNEPATHS'
+ case "$p" in
+ test -z ''
++ echo /afs /amd /proc /sfs /tmp /usr/tmp /var/tmp
++ sed -e 's,^,\\(^,' -e 's, ,$\\)\\|\\(^,g' -e 's,$,$\\),'
+ PRUNEREGEX='\(^/afs$\)\|\(^/amd$\)\|\(^/proc$\)\|\(^/sfs$\)\|\(^/tmp$\)\|\(^/usr/tmp$\)\|\(^/var/tmp$\)'
+ : /usr/local/var/locate/locatedb
+ test -z ''
+ test -d /var/tmp
+ : /var/tmp
+ export TMPDIR
+ : daemon
+ test -n ''
+ : /usr/local/Cellar/findutils/4.7.0/libexec
+ test -n ''
+ : /usr/local/bin
+ num_threads=-j32
+ /usr/local/bin/parallel --lb -j32
Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:

  Tange, O. (2020, July 22). GNU Parallel 20200722 ('Privacy Shield').
  Zenodo. https://doi.org/10.5281/zenodo.3956817

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#Citation-notice

To silence this citation notice: run 'parallel --citation' once.

Come on: You have run parallel 15 times. Isn't it about time
you run 'parallel --citation' once to silence the citation notice?

parallel: Warning: Input is read from the terminal. You are either an expert
parallel: Warning: (in which case: YOU ARE AWESOME!) or maybe you forgot
parallel: Warning: ::: or :::: or -a or to pipe data into parallel. If so
parallel: Warning: consider going through the tutorial: man parallel_tutorial
parallel: Warning: Press CTRL-D to exit.
^C+ : /usr/local/bin/parallel --lb -j32 /usr/local/bin/gfind '{}' :::
+ : /usr/local/Cellar/findutils/4.7.0/libexec/gfrcode
+ : '
9P
NFS
afs
autofs
cifs
coda
devfs
devpts
ftpfs
iso9660
mfs
ncpfs
nfs
nfs4
proc
shfs
smbfs
sysfs
'
+ test -n '
9P
NFS
afs
autofs
cifs
coda
devfs
devpts
ftpfs
iso9660
mfs
ncpfs
nfs
nfs4
proc
shfs
smbfs
sysfs
'
++ echo 9P NFS afs autofs cifs coda devfs devpts ftpfs iso9660 mfs ncpfs nfs nfs4 proc shfs smbfs sysfs
++ sed -e 's/\([^ ][^ ]*\)/-o -fstype \1/g' -e 's/-o //' -e 's/$/ -o/'
+ prunefs_exp='-fstype 9P -o -fstype NFS -o -fstype afs -o -fstype autofs -o -fstype cifs -o -fstype coda -o -fstype devfs -o -fstype devpts -o -fstype ftpfs -o -fstype iso9660 -o -fstype mfs -o -fstype ncpfs -o -fstype nfs -o -fstype nfs4 -o -fstype proc -o -fstype shfs -o -fstype smbfs -o -fstype sysfs -o'
+ rm -f /usr/local/var/locate/locatedb.n
+ trap 'rm -f $LOCATE_DB.n; exit' HUP TERM
+ cd /
+ test -n /
+ '[' '' '!=' '' ']'
+ /usr/bin/sort -z
+ /usr/local/Cellar/findutils/4.7.0/libexec/gfrcode -0
+ : OK so far
+ true
+ test -s /usr/local/var/locate/locatedb.n
+ chmod 644 /usr/local/var/locate/locatedb.n
+ mv /usr/local/var/locate/locatedb.n /usr/local/var/locate/locatedb
+ exit 0

UPDATE 2:

@MarkStechell. I simply do a sudo gupdatedb in a directory.

Could you give please the full command to apply : you suggested me parallel -j 32 --lb gfind {} $FINDOPTIONS ... ::: BUNCH_OF_PATHS but this doesn't seem to work.

What I have tried is : parallel -j32 --lb find {} $FINDOPTIONS * ::: */* but after a while, I get the following error : gfind: failed to read file names from file system at or below '/': No such file or directory :

I would like to index all files from main root / but / and /System/Volume/Data/ are duplicated.

UPDATE 3: if the number of subdiretories is lower than the number of threads I use when I launch with parallel -j32 ..., is there a way to indicate to the parallel command to explore all the sub-sub etc sub-sub etc directories ?

It seems that make -j32 has this kind of behavior (maybe I am wrong) but this is very interesting to not have only one single process on a subdirectory whereas this subdirectory could contain a lot of number of sub-sub directories to explore and then benefit from all 32 processes launched by parallel -j32 .... Then, this would avoid wasting time to not parallelize all these sub-sub directories or even deeper.

UPDATE 4: I don't know what to do in the command suggested by @MarkSetchell ; for example, if I have 3 subdirectories in current directory :

# : A2
parallel -j 32 --lb  gfind {} $FINDOPTIONS ... ::: BUNCH_OF_PATHS

especially, what to put for BUNCH_OF_PATHS ?

Have I got to put for this the option --localpaths dir1/ dir2/ dir3/ instead of BUNCH_OF_PATHS ? and what about the terms $FINDOPTIONS ... with the 3 dots ?


Solution

  • Updated Answer

    The problem is on the line after the line containing A2 in the file /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb. Currently, it is of the form:

    # : A2
    $find $SEARCHPATHS $FINDOPTIONS \( $prunefs_exp  -type d -regex "$PRUNEREGEX" \) -prune -o $print_option
    

    whereas you want it to be of the form:

    # : A2
    parallel -j 32 --lb  gfind {} $FINDOPTIONS ... ::: BUNCH_OF_PATHS
    

    As you haven't given the paths you wish to search in parallel, the paths at the moment are just / which means nothing can be done in parallel. You will need to run with --localpaths set to a bunch of places that are worth searching parallel or hack the script even more extensively. Though, to be honest, I am not sure why you would want to speed this up because it should only be run relatively rarely and then only at times when the system is quiet.

    Original Answer

    Go to around line 250 of file /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb and comment it out with a hash sign so it looks like this:

    for binary in $find $frcode
    do
      #checkbinary $binary
    done