Search code examples
regexspacymatcherphrase

How to Extract IP address using regex using spacy Phrase matcher


text="Link down , Bypass (92.33.2222.88)  is not pinging"

doc=nlp(text)


pattern= [ {"TEXT": {"REGEX": "[\(][0-9]+[\.][0-9]+[\.][0-9]*[\.][0-9]*[\)]"}}]
matcher=Matcher(nlp.vocab)
matcher.add("ip",None, pattern)
matches=matcher(doc)
matches
[]    
# no match found!!

The regex is working fine otherwise:

re.findall("[\(][0-9]+[\.][0-9]+[\.][0-9]*[\.][0-9]*[\)]" ,text)

Output: ['(92.33.2222.88)']


Solution

  • First of all, (92.33.2222.88) is not a valid IP.

    If you do not care about IP validity, the next problem is that ( and ) are not part of the IP token, the print([(t.text, t.pos_) for t in doc]) command shows ('92.33.222.88', 'NUM'), so your pattern is invalid here because you included ( and ) into the pattern.

    If you plan to match any chunks of digit.digits.digits.digits, you may use

    pattern= [ {"TEXT": {"REGEX": r"^\d+(?:\.\d+){3}$"}}]
    matcher.add("ip", None, pattern)
    

    If you want to only match valid IPv4 strings use

    octet_rx = r'(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'
    pattern= [ {"TEXT": {"REGEX": r"^{0}(?:\.{0}){{3}}$".format(octet_rx)}}]
    matcher.add("ip", None, pattern)
    

    Complete test snippet:

    import spacy
    from spacy.matcher import Matcher
    
    nlp = spacy.load("en_core_web_sm")
    
    matcher = Matcher(nlp.vocab)
    octet_rx = r'(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'
    pattern= [ {"TEXT": {"REGEX": r"^{0}(?:\.{0}){{3}}$".format(octet_rx)}}]
    matcher.add("ip", None, pattern)
    
    doc = nlp("Link down , Bypass (92.33.222.88)  is not pinging")
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        span = doc[start:end]
        print(match_id, string_id, start, end, span.text)
    # => 1699727618213446713 ip 5 6 92.33.222.88