Search code examples
regexgrepquoting

One parameter for multiple patterns - grep


I'm trying to search pdf files from terminal. My attempt is to provide the search string from terminal. The search string can be one word, multiple words with (AND,OR) or an exact phrase. I would like to keep only one parameter for all search queries. I'll save the following command as a shell script and will call shell script as an alias from .aliases in zsh or bash shell.

Following from sjr's answer, here: search multiple pdf files.

I've used sjr's answer like this:

find ${1} -name '*.pdf' -exec sh -c 'pdftotext "{}" - |
      grep -E -m'${2}' --line-buffered --label="{}" '"${3}"' '${4}'' \;

$1 takes path

$2 limits the number of results

$3 is context parameter (it is accepting -A , -B , -C , either individually or jointly)

$4 takes search string

The issue I am facing is with $4 value. As I said earlier I want this parameter to pass my search string which can be a phrase or one word or multiple words with AND / OR relation.

I am not able to get desired results, till now I was not getting search results for phrase search until I followed Robin Green's Comment. But still phrase results are not accurate.

Edit Text from judgments:

The original rule was that you could not claim for psychiatric injury in 
negligence. There was no liability for psychiatric injury unless there was also 
physical injury (Victorian Rly Commrs v Coultas [1888]). The courts were worried 
both about fraudulent claims and that if they allowed claims, the floodgates would 
open. 

The claimant was 15 metres away behind a tram and did not see the accident but 
later saw blood on the road. She suffered nervous shock and had a miscarriage. She 
sued for negligence. The court held that it was not reasonably foreseeable that 
someone so far away would suffer shock and no duty of care was owed.

White v Chief Constable of South Yorkshire [1998] The claimants were police
officers who all had some part in helping victims at Hillsborough and suffered 
psychiatric injury. The House of Lords held that rescuers did not have a special 
position and had to follow the normal rules for primary and secondary victims. 
They were not in physical danger and not therefore primary victims. Neither could 
they establish they had a close relationship with the injured so failed as 
secondary victims. It is necessary to define `nervous shock' which is the rather 
quaint term still sometimes used by lawyers for various kinds of 
psychiatric injury...rest of para

word1 can be: shock, (nervous shock)

word2 can be: psychiatric

exact phrase: (nervous shock)

Commands

alias s='sh /path/shell/script.sh'
export p='path/pdf/files'

In terminal:

s "$p" 10 -5 "word1/|word2"          #for OR search
s "$p" 10 -5 "word1.*word2.*word3"   #for AND search
s "$p" 10 -5  ""exact phrase""       #for phrase search

Second Test Sample: An example pdf file, since command runs on pdf document: Test-File. Its 4 pages (part of 361 pg file)

If we run the following command on it, as the solution mentions:

s "$p" 10 -5 'doctrine of basic structure' > ~/desktop/BSD.txt && open ~/desktop/BSD.txt

we'll get the relevant text and 'll avoid going through entire file. Thought it would be a cool way to read what we want rather than going traditional approach.


Solution

  • You need to:

    • pass a double-quoted command string to sh -c in order for the embedded shell-variable references to be expanded (which then requires escaping embedded " instances as \").

    • quote the regex with printf %q for safe inclusion in the command string - note that this requires bash, ksh, or zsh as the shell.

    dir=$1
    numMatches=$2
    context=$3
    regexQuoted=$(printf %q "$4")
    
    find "${dir}" -type f -name '*.pdf' -exec sh -c "pdftotext \"{}\" - |
      grep -E -m${numMatches} --with-filename --label=\"{}\" ${context} ${regexQuoted}" \;
    

    The 3 invocation scenarios would then be:

    s "$p" 10 -5 'word1|word2'          #for OR search
    s "$p" 10 -5 'word1.*word2.*word3'  #for AND search
    s "$p" 10 -5 'exact phrase'         #for phrase search
    

    Note that there's no need to escape | and no need to add an extra layer of double quotes around exact phrase.

    Also note that I've replaced --line-buffered with --with-filename, as I assume that's what you meant (to have the matching lines prefixed with the PDF file path).


    Note that with the above approach a shell instance must be created for every input path, which is inefficient, so consider rewriting your command as follows, which also obviates the need for printf %q (assume regex=$4):

    find "${dir}" -type f -name '*.pdf' | 
      while IFS= read -r file; do
        pdftotext "$f" - |
          grep -E -m${numMatches} --with-filename --label="$f" ${context} "${regex}"
      done
    

    The above assumes that your filenames have no embedded newlines, which is rarely a real-world concern. If it is, there a ways to solve the problem.

    An additional advantage of this solution is that it uses only POSIX-compliant shell features, but note that the grep command uses nonstandard options.