Search code examples

re.findall outputs blanks along with correct

I'm trying to get the list output to not have subgroups or empty spaces. I'm trying to stick with a RegEx only solution due to my re.split and array manipulation method is really janky and sort of slow.

HTML file: (Notice that thing 3 & 4 have /b/ before instead of /a/.)

<!DOCTYPE html>
        <a href=""></a>
        <a href=""></a>
        <a href=""></a>
        <a href="" ><img src="/thing4.png"></a>

Python file:

import re

html = open("help.html", "r").read()
links = re.findall('((?<=\.com\/a\/).*(?="))|((?<=\.com\/b\/).*(?=" ><))|((?<=\.com\/b\/).*(?="><\/a))',html)


What will output when I run the above py file:

[('thing1', '', ''), ('thing2', '', ''), ('', '', 'thing3'), ('', 'thing4', '')]

What I want it to output:

[thing1, thing2, thing3, thing4]


  • You just have to remove the capturing groups. As stated in re.findall:

    Empty matches are included in the result.

    The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.

    An example of capturing group is ((?<=\.com\/a\/).*(?=")), so the most external brackets shall be removed, same for the other 2 groups:

    links = re.findall('(?<=\.com\/a\/).*(?=")|(?<=\.com\/b\/).*(?=" ><)|(?<=\.com\/b\/).*(?="><\/a)',HTML)


    ['thing1', 'thing2', 'thing3', 'thing4']