Search code examples
linuxbashpdfrenamebatch-rename

bulk rename pdf files with name from specific line of its content in linux


I have multiple pdf files which I want to rename. new name should be taken from pdf's file content on specific(lets say 5th) line. for example, if file's 5th line has content some string <-- this string should be name of file. and same thing goes to the rest of files. each file should be renamed with content's 5th line. I tried this in terminal

for pdf in *.pdf
do
   filename=`basename -s .pdf "${pdf}"`
   newname=`awk 'NR==5' "${filename}.pdf"`
   mv "${pdf}" "${newname}"
done

it copies the files, but name is invalid string. I know the system doesn't see the file as plain text and images, there are metadata, xml tags and so on.. but is there way to take content from that line?


Solution

  • Out of the box, bash and its usual utilities are not able to read pdf files. However, less is able to recover the text from a pdf file. You could change your script as follow :

    for pdf in *.pdf
    do
        mv "$pdf" "$(less $pdf | sed '5q;d').pdf"
    done
    

    Explanation :

    • less "$pdf" : display the text part of the pdf file. Will take spacing into account
      • make some tests to see if less returns the desired output
    • sed '5q;d' : extracts the 5th line of the input file

    Optionally, you could use the following script to remove blank lines and exceeding spaces :

    mv "$pdf" "$(less "$pdf" | sed -e '/^\s*$/d' -e 's/ \+/ /g' | sed '5q;d').pdf"