Search code examples
regexbashfile-rename

Shell script to rename file with string from inside file


I have been searching for this in forums and on stackoverflow; it must be here somewhere but I couldn't find it.
I'm on a Mac, using the terminal to run a shell script to rename some pdf files based on file content.

I have a directory full of pdfs that I'm exporting to text files using the opensource pdfbox. The resulting files have the same name as the pdf file but end in .txt. I created the text files so that I could find a string inside the file with the format Page xx Question xx; for example Page 43 Question 2. Given this example, I would like to rename the pdf file as pg43_q2.pdf

I think the regular expression I want is this: /Page\s+(\d+)Question\s+(\d+) but I'm not sure how to read the two captured numbers and save them into a string that I can use as a filename.

The script I have so far is:

#!/bin/sh
PDF_FILE_PATH=$1
echo "Converting pdfs at $PDF_FILE_PATH"

find "$PDF_FILE_PATH" -name '*.pdf' -print0 | while IFS= read -r -d '' filename; do
   echo $filename
   java -jar pdfbox-app-1.6.0.jar ExtractText "$filename" "$filename.txt"
   NEWNAME=$(sed -n -e '/Page/s/Page\s+\(\d+\)\s+Question\s+\(\d+\).*$/pg\1_q\2/p' "$filename.txt")
   echo "Renaming pdf $filename to $NEWNAME"
   # I would do this next but the $NEWNAME is empty
   # mv "filename" "PDF_FILE_PATH$NEWNAME"
done

... but the sed command is not putting anything into the NEWNAME variable.

I'm not particularly attached to sed, any suggestions would be appreciated

Latest edit to script uses the following sed command:

newname=$(sed -nE -e '/Page/s/^.*Page[[:blank:]]+([0-9]+)[[:blank:]]+Question[[:blank:]]+([0-9]+).*$/pg\1_q\2.pdf/p' "$filename.txt")

This works about 50% of the time, but the rest of the time the newname variable is empty when I go to rename the file.

The third line of a converted file that does work:

Unit 2 Review Page 257 Question 9  a)  12 (2)(2)(3)

The third line of a converted file that doesn't work:

Unit 2 Review Page 258 Question 16  a)  (a – 4)(a + 7) = a(a + 7) – 4(a + 7)                             = a2 + 7a – 4a – 28                              = a2 + 3a – 28   b)  (2x + 3)(5x + 2) = 2x(5x + 2) + 3(5x + 2)                                 = 10x2 + 4x + 15x + 6                                 = 10x2 + 19x + 6  c)  (–x + 5)(x + 5) = –x(x + 5) + 5(x + 5)                              = –x2 – 5x + 5x + 25                              = –x2 + 25  d)  (3y + 4)2 = (3y + 4)(3y + 4)                     = 3y(3y + 4) + 4(3y + 4)                     = 9y2 + 12y + 12y + 16                     = 9y2 + 24y + 16  e)  (a – 3b)(4a – b) = a(4a – b) – 3b(4a – b)                                = 4a2 – ab – 12ab + 3b2                                = 4a2 – 13ab + 3b2  f)  (v – 1)(2v2 – 4v – 9) = v(2v2 – 4v – 9) – 1(2v2 – 4v – 9)                                      = 2v3 – 4v2 – 9v – 2v2 + 4v + 9                                      = 2v3 – 6v2 – 5v + 9

Solution

  • Removed unhelpful original answer

    echo 'Unit 2 Review Page 257 Question 9  a)  12 (2)(2)(3)'\
    | sed -n '/Page/{s/.*Page[ ][ ]*\([0-9][0-9]*\)[ ][ ]*Question[ ][ ]*\([0-9][0-9]*\).*$/pg\1_q\2/;p;q;}'
    

    output

    pg257_q9
    
    echo 'Unit 2 Review Page 258 Question 16  a)  (a  4)(a + 7) = a(a + 7)  4(a + 7)'\
    | sed -n '/Page/{s/.*Page[ ][ ]*\([0-9][0-9]*\)[ ][ ]*Question[ ][ ]*\([0-9][0-9]*\).*$/pg\1_q\2/;p;q;}'
    

    output

    pg258_q16
    

    Otherwise, you had it right!

    (Note that the sed processing is the same for both cases).

    I've included a trailing ;p;q}, and an initial { so the sed script will just process the line with 'Page' and then quit.

    I've expanded the posix char classes to the basic terms, ie [[:digit:]] = [0-9], and replaced the +, with a repetition of the intitial char class followed by the 'zero-or-more' char '*', making [0-9][0-9]*. My personal experience, having learned sed on Sun 3 from OReilly's 2nd edition Sed and Awk (with the comb-binding!), is that all the posix stuff is a distraction and a further source of errors. I'm clearly in the minority on this here on S.O ;-), but I'm willing to admit that newer seds have some great features and in any case .....

    I hope this helps.