Search code examples
regexlistnumbered

IP Address regex vs Numbered List


I am using Trellix DLP solution and have IP Address classification to block outgoing IP Address information.

My regex is \b(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})\b

However, this also block documents which have 4 level numbered lists, like:

 1.blah
    1.1 blah blah
           1.1.1 blah blah blah
                1.1.1.1 blah blah blah blah (DLP thinks this is an IP Address and block the document)

is there any way to bypass this.


Solution

  • Regexes sometimes feel like magic, but unfortunatelly they are not. A regex cannot distinguish between an ip address versus a numbered footnote or article.

    You can try to add some sort of intelligence (to say) to the regex, but you'll always end up having false positives/negatives. This sort of intelligence comes from inspecting previous or next characters.

    If you try to go this way, start to use a regular expression that matches just valid ip addresses (your regex can match 300.1.2.3, which is not valid)

    Also determine what ip address are you trying to avoid. Because if you are trying to avoid just private ip addresses, then you have less chances to get a false positive if you craft a regex that matches only private ip addresses.

    If you try to get whatever ip address, then try to avoid matches that have 4 or more spaces before the match (or less than 4 and a begin of line). This is to try to avoid numbered titles.

    (?<!^\h)(?<!^\h\h)(?<!^\h\h\h)(?<!\h\h\h\h)\b(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})\b
    

    note: Use m modifier. If you cannot specify flags, try to use the regex like this:

    (?m)(?<!^\h)(?<!^\h\h)(?<!^\h\h\h)(?<!\h\h\h\h)\b(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})\b
    

    NOTE: if your tool does not support \h, change them for [\t\p{Zs}] or [ \t]

    You have a very basic demo here. Please, keep on reading before using that for production :-)

    Of course, since negative lookbehind usually cannot be variable length (unless some specific programming languages/tools), the more cases you add to the negative lookbehind with extra spaces, the more probable to skip those articles and not getting a false negative.

    Also the tool must support negative lookbehinds, of course.

    You could even combine both cases: a regex that matches 172.x.x.x and 192.x.x.x private addresses (not including 10.x.x.x private addresses because they are pretty low), in which case it may not take into account extra constraints, or any other valid ip address with extra constraints (like the spaces)

    Are there any more false positives that you detected? Try to stablish similar rules for them. For example, consider that you could match footnotes like these: <<See 1.2.3.4>> or *1.2.3.4. Try to add exceptions for ip-address-like strings that start by * or end with >>, for example.

    To sum up: "You cannot", but if you insist or try to...

    • Add extra 'logic' to the regex according to your found false positives

    • Check if the tool lacks needed regex features (like positive/negative lookbehinds)

    • The logic may be very specific to the document that you specified on your example. If there are other documents with other different formats, it may not be possible to have a generic solution for any kind of document

    • Even if you just have a single type of document to inspect, you may still have false positives/negatives, in which case, go to step 1 and repeat