python html parsing web-crawler tokenize

Substring any kind of HTML String

i need to divide any kind of html code (string) to a list of tokens. For example:

"<abc/><abc/>" #INPUT
["<abc/>", "<abc/>"] #OUTPUT

"<abc comfy><room /></abc> <br /> <abc/> " # INPUT
 ["<abc comfy><room /></abc>", "<br />", "<abc/>"] # OUTPUT

"""<meta charset="utf-8" /><title> test123 </title><meta name="test" content="index,follow" /><meta name="description" content="Description" /><link rel="stylesheet" href="../layout/css/default.css" />""" # INPUT
[
     '<meta charset="utf-8" />',
     "<title> test123 </title>",
     '<meta name="test" content="index,follow" />',
     '<meta name="description" content="Description123" />',
     '<link rel="stylesheet" href="../xx/css/default.css" />',
 ] # OUTPUT

What i tried to do :

def split(html: str) -> List[str]:
     if html == "":
         return []

     delimiter = "/>"
     split_name = html.split(" ", maxsplit=1)[0]
     name = split_name[1:]

     delimited_list = [character + delimiter for character in html.split(delimiter) if character]

     rest = html.split(" ", maxsplit=1)[1]

     char_delim = html.find("</")

     ### Help
     print(delimited_list)
     return delimited_list

My output:

['<abc/>', '<abc/>']
['<abc comfy><room />', '</abc> <br />', ' <abc/>', ' />']

['<meta charset="utf-8" />', '<title> test123</title><meta name="test" content="index,follow" />', '<meta name="description" content="Description123" />', '<link rel="stylesheet" href="../xx/css/default.css" />']

So i tried to split at "/>" which is working for the first case. Then i tried several things. Tried to identify the "name", so the first identifier of the html string like "abc".

Do you guys have any idea how to continue?

Thanks!

Greetings Nick

Solution

You will need a stack data structure and iterate over the string, push the position of opening tags onto the stack, and then when you encounter a closing tag, we assume either:

its name matches the name of the tag beginning at the position on the top of the stack
it is a self-closing tag

We also maintain a result list to save the parsed substrings.

For 1), we simply pop the position on the top of the stack, and save the substring sliced from this popped position until to the end of the closing tag to the result list.

For 2), we do not modify the stack, and only save the self-closing tag substring to the result list.

After encountering any tag (opening, closing, self-closing), we walk the iterator (a.k.a. current position pointer) forward by the length of that tag (from < to corresponding >).

If the html string sliced from the iterator onward does not match (from the beginning) any tag, then we simply walk the iterator forward by one (we crawl until we can again match a tag).

Here is my attempt:

import re

def split(html):
    if html == "":
        return []

    openingTagPattern = r"<([a-zA-Z]+)(?:\s[^>]*)*(?<!\/)>"
    closingTagPattern = r"<\/([a-zA-Z]+).*?>"
    selfClosingTagPattern = r"<([a-zA-Z]+).*?\/>"

    result = []
    stack = []

    i = 0
    while i < len(html):
        match = re.match(openingTagPattern, html[i:])
        if match: # opening tag
            stack.append(i) # push position of start of opening tag onto stack
    
            i += len(match[0])
            continue
        
        match = re.match(closingTagPattern, html[i:])
        if match: # closing tag
            i += len(match[0])
            result.append(html[stack.pop():i]) # pop position of start of corresponding opening tag from stack
            continue
        
        match = re.match(selfClosingTagPattern, html[i:])
        if match: # self-closing tag
            start = i
            i += len(match[0])
            result.append(html[start:i])
            continue
        
        i+=1 # otherwise crawl until we can match a tag
        
    return result # reached the end of the string

Usage:

delimitedList = split("""<meta charset="utf-8" /><title> test123 </title><meta name="test" content="index,follow" /><meta name="description" content="Description" /><link rel="stylesheet" href="../layout/css/default.css" />""")

for item in delimitedList:
    print(item)

Output:

<meta charset="utf-8" />
<title> test123 </title>
<meta name="test" content="index,follow" />
<meta name="description" content="Description" />
<link rel="stylesheet" href="../layout/css/default.css" />

References:

The openingTagPattern is inspired from @Kobi 's answer here: https://stackoverflow.com/a/1732395/12109043