Search code examples
bashterminalpowerpoint

Using shell script to extract the number of slides in a .ppt file


I found this implementation in JAVA, but I was wondering if it is possible to get the number of slides in a ppt file? If so, would it be similar to doing the same operation in pptx files?

-Look through the directory that the script file is in -Detect and count the number of slides in a ppt file -Take that number and append it to a CSV file

I found a bash script that will do something similar but for PDF files

#!/bin/bash 
saveIFS=$IFS
IFS=$(echo -en "\n\b")

myFiles=($(find . -name "*.pdf"))
totalPages=0

echo "file path, number of pages" > log_3.csv
for eachFile in ${myFiles[*]}; do
  pageCount=$(mdls $eachFile | grep kMDItemNumberOfPages | awk -F'= ' '{print $2}')
  size=${#pageCount}

  if [ $size -eq 0 ]
  then
    # these files had no entry for kMDItemNumberOfPages
    # comment out the next line to not list these files
    echo $eachFile : \*\* Skipped - no page count \*\*
  else
    # comment out the next line if you don't want to see a count for each file
    echo $eachFile, $pageCount >> log_3.csv
    totalPages=$(($totalPages + $pageCount))

  fi
done

echo "Total number of pages, ${totalPages}" >> log_3.csv
echo Total pages: $totalPages

IFS=$saveIFS

Could we refractor this code to make it work with ppt files?

Thanks!


Solution

  • Let me answer half of your question.
    Regarding the pptx files, you can get number of slides with:

    #!/bin/bash
    
    function pagecount() {
        local pptx=$1
        local pagecount line
        while read -r line || [[ -n "$line" ]]; do
            if [[ "$line" =~ \<Slides\>([0-9]+)\</Slides\> ]]; then
                pagecount="${BASH_REMATCH[1]}"
            fi
        done < <(unzip -j -p "$pptx" "docProps/app.xml")
        echo "$pagecount"
    }
    
    for file in *.pptx; do
        count=$(pagecount "$file")
        echo "${file} : ${count} pages"
    done
    

    As with other MS Office 2007+ files (docx, xlsx, ...), pptx file format is just a zip-compressed XML files. You can find the slide count in the docProps/app.xml file in the form of <Slides>n</Slides>.
    The code above works to uncompress docProps/app.xml to stdout then parse it for the Slides property.

    Regarding ppt files, the file format is totally different from that of pptx and you may need to introduce some external tool(s) (wvWare or something like that) to process it.