Search code examples
pythonhtmlbashweb-scrapingrename

Renaming HTML files using <title> tags


I'm a relatively new to programming. I have a folder, with subfolders, which contain several thousand html files that are generically named, i.e. 1006.htm, 1007.htm, that I would like to rename using the tag from within the file.

For example, if file 1006.htm contains Page Title , I would like to rename it Page Title.htm. Ideally spaces are replaced with dashes.

I've been working in the shell with a bash script with no luck. How do I do this, with either bash or python?

this is what I have so far..

#!/usr/bin/env bashFILES=/Users/Ben/unzipped/*
for f in $FILES
do
   if [ ${FILES: -4} == ".htm" ]
      then
    awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' $FILES
   fi
done

I've also tried

#!/usr/bin/env bash
for f in *.html;
   do
   title=$( grep -oP '(?<=<title>).*(?=<\/title>)' "$f" )
   mv -i "$f" "${title//[^a-zA-Z0-9\._\- ]}".html   
done

But I get an error from the terminal exlaing how to use grep...


Solution

  • use awk instead of grep in your bash script and it should work:

    #!/bin/bash   
    for f in *.html;
       do
       title=$( awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' "$f" )
       mv -i "$f" "${title//[^a-zA-Z0-9\._\- ]}".html   
    done
    

    don't forget to change your bash env on the first line ;)

    EDIT full answer with all the modifications

    #!/bin/bash
    for f in `find . -type f | grep \.html`
       do
       title=$( awk 'BEGIN{IGNORECASE=1;FS="<title>|</title>";RS=EOF} {print $2}' "$f" )
       mv -i "$f" "${title//[ ]/-}".html
    done