Search code examples
regexgreposx-yosemite

string extraction and dupes filtering mac OS X


I have a bunch of files with sql log. I'm looking out to extract all occurrences of the following pattern

SQL log has sql that looks something like this

sel *
from DB.T1;
update DB.T1;
delete from  DB.T2;
collect stats on 
DB.T3 index (a,b,c);
sel count(*) from Db.T1;
sel count(*) from db . T2;
sel count(*) from db.t2;

I want to scan through the files starting with logs_ and extract all the unique tables followed by the string DB./db./Db./dB. As you can see there is white space after db in few instances

The output I'm expecting is a deduped list T1, T2, T3

I'm on Mac OS X.

This is what I was able to get. I could not get past this

grep -o -i 'tb.*\Z' *logs_* | uniq

This gives empty results. I was using \Z as I want till the end of the string (and not end of the line)

Need help to build the right command.


Solution

  • Something like:

    grep -E -o -i 'DB ?\. ?[A-Z0-9$_]+' | cut -d . -f 2 | tr -d ' ' | sort -u
    

    \Z is not supported by grep, as far as I can tell. And in languages that do support it, it really means until the end of the string, not the end of some "word" in the string. So you need to explicitly match the table name in your grep.

    I use -E to use grep's extended regular expressions, which makes + and ? recognized as regex metacharacters. This isn't absolutely necessary; you could leave off the -E and use \+ and \? instead.

    The regular expression DB ?\. ?[A-Z0-9$_]+ (or DB \?\. \?[A-Z0-9$_]\+ if you leave off the -E flag) matches:

    the literal characters "DB" (case insensitively, because of -i)
    an optional space
    a literal "."
    an optional space
    one or more of any ascii letters, digits, $ or _ (the characters that can appear in an unquoted mysql table name)
    

    cut removes the database name, tr removes spaces before the table name, and sort -u returns just the unique table names. (uniq by itself does not do this; it only removes lines that are duplicates of the previous line, so only would have done what you want if you sorted first.)