I have this file:
a=1 b=2 1234j12342134h d="a v" id="y_123456" something else
a=1 b=2 1234j123421341 d="a" something else
a=1 b=2 1234j123421342 d="a D v id=" id="y_123458" something else
a=1 b=2 1234j123421344 d="a v" something else
a=1 b=2 1234j123421346 d="a.a." id="y_123410" something else
and I want to retrieve only the lines that contain 'id=', and only the value for id and the 3rd column. The final product should be
1234j12342134h id="y_123456"
1234j123421342 id="y_123458"
1234j123421346 id="y_123410"
or
1234j12342134h "y_123456"
1234j123421342 "y_123458"
1234j123421346 "y_123410"
or even
1234j12342134h y_123456
1234j123421342 y_123458
1234j123421346 y_123410
I tried a grep -o
for the begin and end of the expression, but that misses the first block of ids. I tried awk, but that fails for columns with spaces.
I got it working with Java, but it is slow as the log files get bigger.
How can I do it using bash utilities?
With GNU awk (for 3rd arg for match()):
$ gawk 'match($0,/id="[^" ]+"/,a){ print $3, a[0] }' file
1234j12342134h id="y_123456"
1234j123421342 id="y_123458"
1234j123421346 id="y_123410"
WIth other awks:
$ awk 'match($0,/id="[^" ]+"/){ print $3, substr($0,RSTART,RLENGTH) }' file
1234j12342134h id="y_123456"
1234j123421342 id="y_123458"
1234j123421346 id="y_123410"
or if you want to strip some of the leading/trailing chars a couple of ways would be:
$ gawk 'match($0,/id="([^" ]+)"/,a){ print $3, a[1] }' file
1234j12342134h y_123456
1234j123421342 y_123458
1234j123421346 y_123410
or:
$ awk 'match($0,/id="[^" ]+"/){ print $3, substr($0,RSTART+4,RLENGTH-5) }' file
1234j12342134h y_123456
1234j123421342 y_123458
1234j123421346 y_123410