Search code examples
postgresqlpostgresql-9.1postgresql-9.3postgresql-9.5

Fetch table names using regexp


Need to fetch tablename from input using sql query.I know this can be done through regex but not sure how hence any help appreciated Input:

Field1 field2
a INSERT INTO test.table2{ name, age, city, id}SELECT name, age, city, id FROM table 1
b select from test1.table3
c select *from test2.table4
d select *from test2.table4(10)

Output:

Field1 field2
a test.table2
b test1.table3
c test2.table4

Solution

  • It will be extremely difficult to achieve with just a regular expression. If you really need to do that reliably, you can check out how pgpool-II re-used PostgreSQL's own parser for their needs - that's more of a C++ intensive task, rather than a SQL one.

    Here are some more examples at the wiki, some of which you can use directly. You could even get Python sqlparse to process the query text, then .get_identifiers(), all inside the db, in a PL/Python function.

    If you only wish to cover some very basic cases, you can try to target tokens following certain keywords:

    select field1
      ,unnest(
          regexp_matches(
             field2
            ,'(?:UPDATE|INTO|FROM|JOIN|USING|TABLE)(?:\s+ONLY)*\s+([[:alpha:]_]+[[:word:]]*|"[^"]+")'
            ,'gi'
          )
       ) AS spotted_table
    from table1;
    
    field1 spotted_table
    a table2
    a table1
    b table3
    c table4
    • The ?: in (?:something) makes the parentheses non-capturing, which is why it only reports matches from the last group that doesn't have the ?:
    • the g in 'gi' makes this report all matches, and i disables case-sensitivity
    • the identifier syntax is taken from the doc, except the $ oddity:

      SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($)

    That list of keywords is by no means exhaustive, and while it covers your examples, this regular expression is extremely brittle. See some obvious blind spots in this demo.

    It gets more and more tricky if you consider

    • identifiers can be double-quoted or unquoted, which makes them fold to lowercase
    • single or dollar-quoted string constants that contain SQL-looking text - useful if you're also considering dynamic SQL, otherwise it would get you false positives
    • technically, you can name a table "select from table7 join TABLE8 on waitaminute;" (this whole thing is the name of the table, not just the table7 part): demo
    • long, comma-separated lists of relations interleaved with sources that aren't relations, especially subqueries
    • end of line // and in-line/multi-line /*...*/ comments

    If the queries you're analysing operate on the same database you keep them in, you can correlate that with information_schema.tables. Note that there are 9 different things that can be in that same spot in the body of a query, and that's not counting the fact that table is just one form of one of those, a relation. You can also select from a view, materialized view, foreign table, partitioned table, all of which act like a table only to some extent. Some views are updatable, which means you can also insert/merge/update/delete from them as if you interacted with their underlying table directly.

    If you're trying to track effective interactions, you'd have to also track the rule and trigger systems, view and matview definitions as well as routine bodies and dependencies.