Search code examples
pythonregexcodaletterscapitalize

Python search HTML document for capital letters


So I have all these html documents that have strings of capital letter in various places in alt tags, title tage, link text...etc.

<li><a title='BUY FOOD' href="http://www.example.com/food.html'>BUY FOOD</a></li>

What I need to do is replace all letters except the first letter with lowercase letting. Like so:

<li><a title='Buy Food' href="http://www.example.com/food.html'>Buy Food</a></li>

Now how can I do this either in python or some form of regex. I was told that my editor Coda could do something like this. But I can't seem to find any documentation on how to do something like this.


Solution

  • I suggest you use Beautiful Soup to parse your HTML into a tree of tags, then write Python code to walk the tree of tags and body text and change to title case. You could use a regexp to do that, but Python has a built-in string method that will do it:

    "BUY FOOD".title()  # returns "Buy Food"
    

    If you need a pattern to match strings that are all caps, I suggest you use: "[^a-z]*[A-Z][^a-z]*"

    This means "match zero or more of anything except a lower-case character, then a single upper-case character, then zero or more of anything except a lower-case character".

    This pattern will correctly match "BUY 99 BEERS", for example. It would not match "so very quiet" because that does not have even a single upper-case letter.

    P.S. You can actually pass a function to re.sub() so you could potentially do crazy powerful processing if you needed it. In your case I think Python's .title() method will do it for you, but here is another answer I posted with information about passing in a function.

    How to capitalize the first letter of each word in a string (Python)?