Search code examples
pythonhtmlparsingweb-crawlertokenize

Substring any kind of HTML String


i need to divide any kind of html code (string) to a list of tokens. For example:

"<abc/><abc/>" #INPUT
["<abc/>", "<abc/>"] #OUTPUT

or

"<abc comfy><room /></abc> <br /> <abc/> " # INPUT
 ["<abc comfy><room /></abc>", "<br />", "<abc/>"] # OUTPUT

or

"""<meta charset="utf-8" /><title> test123 </title><meta name="test" content="index,follow" /><meta name="description" content="Description" /><link rel="stylesheet" href="../layout/css/default.css" />""" # INPUT
[
     '<meta charset="utf-8" />',
     "<title> test123 </title>",
     '<meta name="test" content="index,follow" />',
     '<meta name="description" content="Description123" />',
     '<link rel="stylesheet" href="../xx/css/default.css" />',
 ] # OUTPUT

What i tried to do :

def split(html: str) -> List[str]:
     if html == "":
         return []

     delimiter = "/>"
     split_name = html.split(" ", maxsplit=1)[0]
     name = split_name[1:]

     delimited_list = [character + delimiter for character in html.split(delimiter) if character]

     rest = html.split(" ", maxsplit=1)[1]

     char_delim = html.find("</")

     ### Help
     print(delimited_list)
     return delimited_list

My output:

['<abc/>', '<abc/>']
['<abc comfy><room />', '</abc> <br />', ' <abc/>', ' />']

['<meta charset="utf-8" />', '<title> test123</title><meta name="test" content="index,follow" />', '<meta name="description" content="Description123" />', '<link rel="stylesheet" href="../xx/css/default.css" />']

So i tried to split at "/>" which is working for the first case. Then i tried several things. Tried to identify the "name", so the first identifier of the html string like "abc".

Do you guys have any idea how to continue?

Thanks!

Greetings Nick


Solution

  • You will need a stack data structure and iterate over the string, push the position of opening tags onto the stack, and then when you encounter a closing tag, we assume either:

    1. its name matches the name of the tag beginning at the position on the top of the stack

    2. it is a self-closing tag

    We also maintain a result list to save the parsed substrings.

    For 1), we simply pop the position on the top of the stack, and save the substring sliced from this popped position until to the end of the closing tag to the result list.

    For 2), we do not modify the stack, and only save the self-closing tag substring to the result list.

    After encountering any tag (opening, closing, self-closing), we walk the iterator (a.k.a. current position pointer) forward by the length of that tag (from < to corresponding >).

    If the html string sliced from the iterator onward does not match (from the beginning) any tag, then we simply walk the iterator forward by one (we crawl until we can again match a tag).

    Here is my attempt:

    import re
    
    def split(html):
        if html == "":
            return []
    
        openingTagPattern = r"<([a-zA-Z]+)(?:\s[^>]*)*(?<!\/)>"
        closingTagPattern = r"<\/([a-zA-Z]+).*?>"
        selfClosingTagPattern = r"<([a-zA-Z]+).*?\/>"
    
        result = []
        stack = []
    
        i = 0
        while i < len(html):
            match = re.match(openingTagPattern, html[i:])
            if match: # opening tag
                stack.append(i) # push position of start of opening tag onto stack
        
                i += len(match[0])
                continue
            
            match = re.match(closingTagPattern, html[i:])
            if match: # closing tag
                i += len(match[0])
                result.append(html[stack.pop():i]) # pop position of start of corresponding opening tag from stack
                continue
            
            match = re.match(selfClosingTagPattern, html[i:])
            if match: # self-closing tag
                start = i
                i += len(match[0])
                result.append(html[start:i])
                continue
            
            i+=1 # otherwise crawl until we can match a tag
            
        return result # reached the end of the string
    

    Usage:

    delimitedList = split("""<meta charset="utf-8" /><title> test123 </title><meta name="test" content="index,follow" /><meta name="description" content="Description" /><link rel="stylesheet" href="../layout/css/default.css" />""")
    
    for item in delimitedList:
        print(item)
    

    Output:

    <meta charset="utf-8" />
    <title> test123 </title>
    <meta name="test" content="index,follow" />
    <meta name="description" content="Description" />
    <link rel="stylesheet" href="../layout/css/default.css" />
    

    References:

    The openingTagPattern is inspired from @Kobi 's answer here: https://stackoverflow.com/a/1732395/12109043