Search code examples
pythonbeautifulsouplxmlpython-re

re.search only return part of strings


I want to use the Python re module to get contents between <script>...</script> tags. I use re.search(r'<script>[\S\s]*</script>', myhtml) to search the content where [\S\s]* means search any string. But this function behaves strangely, it only returns part of the desired content. so I make a small example to show what I mean.

import re
re.search('[\S\s]*','<!DOCTYPE HTML PUBLIC "-<!DO/W3C//DTD C 1.0Traitional//E')

the desired result should be '<!DOCTYPE HTML PUBLIC "-<!DO/W3C//DTD C 1.0Traitional//E' which is the original input string . however, it prints <_sre.SRE_Match object; span=(0, 56), match='<!DOCTYPE HTML PUBLIC "-<!DO/W3C//DTD C 1.0Traiti>. as can be seen , the last part of the string ,ie.'onal//E' is missing.

Why is that? How can I extract contents between tags?

Also, some may suggest I should use lxml and BeautifulSoup because I found strange things as well:

With this code:

from lxml import etree
rr='''
<script>
<div>
im here
</div>
</script>

'''
html = etree.HTML(rr, etree.HTMLParser())
print(html.xpath('//div//text()'))

The above code prints nothing. If I change <script> to <script1>, then it prints im here as expected, and BeautifulSoup has the same behavior.


Solution

  • To add to @Selcuk's comment, you are doing .search, which returns the re.Match class, which has 0 to many groups of individual matches. These have more data, like the starting index of the match, and its length, than .findall.

    >>> match = re.search('[\S\s]*','<!DOCTYPE HTML PUBLIC "-<!DO/W3C//DTD C 1.0Traitional//E')
    >>> match
    <re.Match object; span=(0, 56), match='<!DOCTYPE HTML PUBLIC "-<!DO/W3C//DTD C 1.0Traiti>
    >>> dir(match)
    ['__class__', '__class_getitem__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'end', 'endpos', 'expand', 'group', 'groupdict', 'groups', 'lastgroup', 'lastindex', 'pos', 're', 'regs', 'span', 'start', 'string']
    

    You can also use .findall(), which will return a list.

    I would also suggest always using r'' strings when passing regex patterns.

    >>> re.search(r'[\S\s]*','<!DOCTYPE HTML PUBLIC "-<!DO/W3C//DTD C 1.0Traitional//E').group(0)
    '<!DOCTYPE HTML PUBLIC "-<!DO/W3C//DTD C 1.0Traitional//E'
    >>> re.findall(r'[\S\s]*','<!DOCTYPE HTML PUBLIC "-<!DO/W3C//DTD C 1.0Traitional//E')
    ['<!DOCTYPE HTML PUBLIC "-<!DO/W3C//DTD C 1.0Traitional//E', '']