Search code examples
pythonhtmlparsingtagslanguagetool

How to perform a tag-agnostic text string search in an html file?


I'm using LanguageTool (LT) with the --xmlfilter option enabled to spell-check HTML files. This forces LanguageTool to strip all tags before running the spell check.

This also means that all reported character positions are off because LT doesn't "see" the tags.

For example, if I check the following HTML fragment:

<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>

LanguageTool will treat it as a plain text sentence:

    This is kind of a stupid question.

and returns the following message:

<error category="Grammar" categoryid="GRAMMAR" context="                This is kind of a stupid question.    " contextoffset="24" errorlength="9" fromx="8" fromy="8" locqualityissuetype="grammar" msg="Don't include 'a' after a classification term. Use simply 'kind of'." offset="24" replacements="kind of" ruleId="KIND_OF_A" shortmsg="Grammatical problem" subId="1" tox="17" toy="8"/>

(In this particular example, LT has flagged "kind of a.")

Since the search string might be wrapped in tags and might occur multiple times I can't do a simple index search.

What would be the most efficient Python solution to reliably locate any given text string in an HTML file? (LT returns an approximate character position, which might be off by 10-30% depending on the number of tags, as well as the words before and after the flagged word(s).)

I.e. I'd need to do a search that ignores all tags, but includes them in the character position count.

In this particular example, I'd have to locate "kind of a" and find the location of the letter k in:

kin<b>d</b> o<i>f</i>a

Solution

  • This may not be the speediest way to go, but pyparsing will recognize HTML tags in most forms. The following code inverts the typical scan, creating a scanner that will match any single character, and then configuring the scanner to skip over HTML open and close tags, and also common HTML '&xxx;' entities. pyparsing's scanString method returns a generator that yields the matched tokens, the starting, and the ending location of each match, so it is easy to build a list that maps every character outside of a tag to its original location. From there, the rest is pretty much just ''.join and indexing into the list. See the comments in the code below:

    test = "<p>This &nbsp;is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>"
    
    from pyparsing import Word, printables, anyOpenTag, anyCloseTag, commonHTMLEntity
    
    non_tag_text = Word(printables+' ',  exact=1).leaveWhitespace()
    non_tag_text.ignore(anyOpenTag | anyCloseTag | commonHTMLEntity)
    
    # use scanString to get all characters outside of tags, and build list
    # of (char,loc) tuples
    char_locs = [(t[0], loc) for t,loc,endloc in non_tag_text.scanString(test)]
    
    # imagine a world without HTML tags...
    untagged = ''.join(ch for ch, loc in char_locs)
    
    # look for our string in the untagged text, then index into the char,loc list
    # to find the original location
    search_str = 'kind of a'
    orig_loc = char_locs[untagged.find(search_str)][1]
    
    # print the test string, and mark where we found the matching text
    print(test)
    print(' '*orig_loc + '^')
    
    """
    Should look like this:
    
    <p>This &nbsp;is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>
                     ^
    """