Search code examples
regexawkgrepfinancelookbehind

On Cygwin (or windows 7), match a word, look backwards, skip a word and print x number of comma separated words


Have a headache trying to understand squiggly awks and greps but not gotten far. I have 100 thousand files from which I'm trying to extract a single line. A sample set of lines of the file is:

Revenue,876.08,,9361.000,444.000,333.000,222.000,111.00,485.000,"\t\t",178.90,9008.98
EV to Revenue,6.170,0.65,3.600,2.60,1.520,1.7,"\t\t",190.9,9008.98,80.9,87

(there are two tabs between the double quotes. I'm representing them with \t here. They are actual whitespace tabs)

I'm trying to output just this line that starts with Revenue:

Revenue,444.000,333.000,222.000,111.000

This output line outputs the first word of the line and the comma (ie: Revenue,) It then finds the two tabs ensconced in double quotes, looks backwards skipping the first set of comma separated numbers (also assume that instead of numbers, there could be nothing ie: just a comma separated blank) and then outputs the 4 set of comma separated numbers.

Is this doable in a simple grep or awk or cut or tr command on cygwin that won't be a bear to run on 100K files ? To clarify, there are 100K files that look very similar. Each file will contain lots of lines (separated by new line/carriage return). Some lines will contain the word Revenue at the start, some at the middle (as in the 2nd sample line I had paste above) etc. I'm only interested in those lines that start with Revenue followed by the comma and then the sequence above. Each file will contain that specific line.

As a completion to this kind of task (because working on 100K files would require this too), what would have to be added to sed to print out the current file name being operated on too? ie: output like this:

FileName1: Revenue,444.000,333.000,222.000,111.000 [I'll post the answer here if I find it]

Thank you!

Thanks to Sputnick for editing my question so it looks neat and thanks to shellter for responding. Ed, your solution looks really good. I'm testing it out and will reply back with info plus my understanding of how that regex works. Thank you very much for taking time to write this out!


Solution

  • Since this is just a simple subsitution on a single line it's really most appropriate for sed:

    $ sed -n -r 's/(^Revenue)(,[^,]*){3}(.*),[^,]*,"\t\t".*/\1\3/p' file
    Revenue,444.000,333.000,222.000,111.00
    

    but you can do the same in awk with gensub() (gawk) or match()/substr() or similar. It will run in the blink of an eye no matter what tool you use.