I'm trying to perform a search in my directory, to count the total number of times each individual file is referenced within the contents of all the files in the directory.
Essentially, I'm trying to more efficiently recreate the copy and paste of each 'filename' into the 'search in this folder' that I am currently doing, as there are around 400 files. As an output, I think the most useful format would be a list of each search term (filename), and the number of unique files that it occurs in. I am most interested in the files that have no occurrences, as these are likely to be able to be deleted, as they are now redundant.
My current thinking is to save a list of the filenames to a file called searchterms, and use grep -r -f searchterms
to find all occurrences of the file. I've not had much luck with this however, as my use of -c
so far has just resulted in the file being listed, not the search term.
Thanks in advance!
Example of usage:
file1
include file3
include file3
file2
content
file3
content
file4
include file3
Search terms would be: file1, file2, file3, file4.
Returned output (in some similar form):
file1: occurs in 0 files
file2: occurs in 0 files
file3: occurs in 2 files
file4: occurs in 0 files
for f1 in *; do cnt=0; for f2 in *; do grep -qw "$f1" "$f2" && ((++cnt)); done; echo "$cnt $f1"; done
1 abc-file
0 abc.lst
1 abc0-file
1 abc_-file
0 def-file
0 fixedlen
0 num1000000
0 num128
0 num30000
0 num8
0 num_%header
0 par-test.sh
0 tsv-file.tsv
Human readable:
for f1 in *
do
cnt=0
for f2 in *
do
grep -qw "$f1" "$f2" && ((++cnt))
done
echo "$cnt $f1"
done
Putting the hit counter first in the output makes for a simpler sort -n command. For high numbers of matches (>9) printf would help providing a clear tabular format.
Grep -m 1 stops searching after the first hit but is implied by -q. To not match file31 when looking for file3, -w is used. For all the misses, the files get searched from begin to end over and over again. Depending on the number of files, this might be a significant amount of time, making better optimization necessary.