Search code examples
c#sqlregexnegative-lookbehind

Match string only if a word in the string does not exist


I am trying to extract table names from a SQL script. I have the following string:

from db1.tableX tbx1 --quantity table
inner join tableY tbx2  on tbx1.xyz=tbx2.xyz

I don't want to match this string because tabley is not prefixed with a database name. The way I am trying to do this is by detecting the word "on" in the string before the database name.

My regex is matching tablex but also matches tbx1, which I dont want my expression to match.

I would like it to match only in this situation.

from db1.tableX tbx1 --quantity table
inner join db1.tableY tbx2 on tbx1.xyz = tbx2.xyz

My regex should give me tableX and tableY because they are prefixed by a database name.

Here is my expression:

(insert\s+into|from|inner\s+join|left\s+outer\s+join|join)\s+[\sa-zA-Z0-9_$#-]*\.\s*(?<table>[a-zA-Z0-9_]+)(?=\s+[a-zA-Z0-9_$#-]+)*

Solution

  • You can simplify your expression

    (?<=\b(from|join)\s+[a-zA-Z0-9_$#-]*\.\s*)[a-zA-Z0-9_]+
    

    It will yield the table name directly, without having to use named groups. I am using the pattern

    (?<=prefix)find
    

    This returns only the part "find" as match value. As prefix I use "from" or "join" followed by the database name the dot and possibly spaces.

    \b is the beginning or end of a word.

    The last part [a-zA-Z0-9_]+ is the table name.

    It makes no difference whether you have an inner join, outer join etc., so I dropped this distinction.

    Comments are ugly to detect, beacuse they can contain about anything and can appear about everywhere. Also you have two types of comments -- Until end of line and \* ... *\.

    You could try to detect comments and spaces like this

    (\s|--.*?$|\\\*.*?\*\\)+
    

    Note that the backspace and the star have to be escaped, so \* becomes \\\* and *\ becomes \*\\.

    $ denotes the end of line. The ? after .* ensures that not several comments are skipped at once, thus skipping meaningful text between them.

    This will work in most cases; however, comment like structures could be included within string literals: SELECT 'hello -- world' AS greeting FROM .... This is not a comment! Without a full syntax analysis of the SQL text you will not be able to cover all possible cases. But you cannot do this with Regex. Regex has its limits.