i need to divide any kind of html code (string) to a list of tokens. For example:
"<abc/><abc/>" #INPUT
["<abc/>", "<abc/>"] #OUTPUT
or
"<abc comfy><room /></abc> <br /> <abc/> " # INPUT
["<abc comfy><room /></abc>", "<br />", "<abc/>"] # OUTPUT
or
"""<meta charset="utf-8" /><title> test123 </title><meta name="test" content="index,follow" /><meta name="description" content="Description" /><link rel="stylesheet" href="../layout/css/default.css" />""" # INPUT
[
'<meta charset="utf-8" />',
"<title> test123 </title>",
'<meta name="test" content="index,follow" />',
'<meta name="description" content="Description123" />',
'<link rel="stylesheet" href="../xx/css/default.css" />',
] # OUTPUT
What i tried to do :
def split(html: str) -> List[str]:
if html == "":
return []
delimiter = "/>"
split_name = html.split(" ", maxsplit=1)[0]
name = split_name[1:]
delimited_list = [character + delimiter for character in html.split(delimiter) if character]
rest = html.split(" ", maxsplit=1)[1]
char_delim = html.find("</")
### Help
print(delimited_list)
return delimited_list
My output:
['<abc/>', '<abc/>']
['<abc comfy><room />', '</abc> <br />', ' <abc/>', ' />']
['<meta charset="utf-8" />', '<title> test123</title><meta name="test" content="index,follow" />', '<meta name="description" content="Description123" />', '<link rel="stylesheet" href="../xx/css/default.css" />']
So i tried to split at "/>" which is working for the first case. Then i tried several things. Tried to identify the "name", so the first identifier of the html string like "abc".
Do you guys have any idea how to continue?
Thanks!
Greetings Nick
You will need a stack data structure and iterate over the string, push the position of opening tags onto the stack, and then when you encounter a closing tag, we assume either:
its name matches the name of the tag beginning at the position on the top of the stack
it is a self-closing tag
We also maintain a result
list to save the parsed substrings.
For 1), we simply pop the position on the top of the stack, and save the substring sliced from this popped position until to the end of the closing tag to the result
list.
For 2), we do not modify the stack, and only save the self-closing tag substring to the result
list.
After encountering any tag (opening, closing, self-closing), we walk the iterator (a.k.a. current position pointer) forward by the length of that tag (from <
to corresponding >
).
If the html
string sliced from the iterator onward does not match (from the beginning) any tag, then we simply walk the iterator forward by one (we crawl until we can again match a tag).
Here is my attempt:
import re
def split(html):
if html == "":
return []
openingTagPattern = r"<([a-zA-Z]+)(?:\s[^>]*)*(?<!\/)>"
closingTagPattern = r"<\/([a-zA-Z]+).*?>"
selfClosingTagPattern = r"<([a-zA-Z]+).*?\/>"
result = []
stack = []
i = 0
while i < len(html):
match = re.match(openingTagPattern, html[i:])
if match: # opening tag
stack.append(i) # push position of start of opening tag onto stack
i += len(match[0])
continue
match = re.match(closingTagPattern, html[i:])
if match: # closing tag
i += len(match[0])
result.append(html[stack.pop():i]) # pop position of start of corresponding opening tag from stack
continue
match = re.match(selfClosingTagPattern, html[i:])
if match: # self-closing tag
start = i
i += len(match[0])
result.append(html[start:i])
continue
i+=1 # otherwise crawl until we can match a tag
return result # reached the end of the string
Usage:
delimitedList = split("""<meta charset="utf-8" /><title> test123 </title><meta name="test" content="index,follow" /><meta name="description" content="Description" /><link rel="stylesheet" href="../layout/css/default.css" />""")
for item in delimitedList:
print(item)
Output:
<meta charset="utf-8" />
<title> test123 </title>
<meta name="test" content="index,follow" />
<meta name="description" content="Description" />
<link rel="stylesheet" href="../layout/css/default.css" />
References:
The openingTagPattern
is inspired from @Kobi 's answer here: https://stackoverflow.com/a/1732395/12109043