Search code examples
pythonpython-3.xregexpython-re

Python re.findall Not Matching JS Variables in HTML


I am trying to extract integers and variable values defined in JavaScript in an HTML file using Python 3 re.findall method.

However, I am having a little difficulty matching digits enclosed in " with \d*, and matching an alphanumeric string enclosed in " too.

Case 1:

s = """
   <script>
    var i = 1636592595;
        var j = i + Number("6876" + "52907");
   </script>
"""
pattern = r'var j = i + Number(\"(\d*)\" + \"(\d*)\");'
m = re.findall(pattern, s)
print(m) # Output: []

The desired output should contain 6876 and 52907, but an empty list [] was obtained.

Case 2:

s = """
       xhr.send(JSON.stringify({
              "bm-foo": "AAQAAAAE/////4ytkgqq/oWI",
              "pow": j
          }));
"""
pattern = r'"bm-foo": \"(\w*)\",'
m = re.findall(pattern, s)
print(m) # Output: []

The desired output should contain AAQAAAAE/////4ytkgqq/oWI, but an empty list [] was obtained.

Can I have some help explaining why my regex patterns are not matching it?


Solution

  • In the first regexp you need to escape +, (, and ).

    In the second regexp, use [^"]* instead of \w*, since \w doesn't match punctuation like /.

    import re
    
    s = """
       <script>
        var i = 1636592595;
            var j = i + Number("6876" + "52907");
       </script>
    """
    pattern = r'var j = i \+ Number\("(\d*)" \+ \"(\d*)\"\);'
    m = re.findall(pattern, s)
    print(m)
    
    s = """
           xhr.send(JSON.stringify({
                  "bm-foo": "AAQAAAAE/////4ytkgqq/oWI",
                  "pow": j
              }));
    """
    pattern = r'"bm-foo": "([^"]*)",'
    m = re.findall(pattern, s)
    print(m)
    

    DEMO