Search code examples
awkcountertolower

Integrate counter in awk and lower value of specific column


I am trying to incorporate 2 functions into my awk command. I want tolower the information in Col1 in a Column 2 (thus the information in Col1, will be the value of 2 cols - Col1 and Col2, with the values in lower in Col2) and I want to count from 1-N that begins and ends with the start of certain markers that I have.

The data (tab-separated) currently looks like this:

<s>
He  PRP -
could   MD  -
tell    VB  -
she PRP -
was VBD -
teasing VBG -
him PRP -
.   .   .
</s>
<s>
He  PRP -
kept    VBD -
his PRP$    -
eyes    NNS -
closed  VBD -
,   ,   -
but CC  -
he  PRP -
could   MD  -
feel    VB  -
himself PRP -
smiling VBG -
.   .   .
</s>

The ideal output would be like this:

<s>
He  he  PRP 1
could   could   MD  2
tell    tell    VB  3
she     she PRP 4
was was VBD     5
teasing teasing VBG 6
him him PRP 7
.   .   .   8
</s>
<s>
He  he  PRP 1-
kept    kept    VBD 2
his his PRP$    3
eyes    eyes    NNS 4
closed  closed  VBD 5
,   ,   ,   6
but but CC  7
he  he  PRP 8
could   could   MD  9
feel    feel    VB  10
himself     himself PRP 11
smiling smiling VBG 12
.   .   .   13
</s>

The 2-step awk that I am trying that does not work is this:

Step 1:

awk '!NF{$0=x}1' input | awk '{$1=$1; print "<s>\n" $0 "\t.\n</s>"}' RS=  FS='\n' OFS='\t-\n' > output

Here, I do not know how to make the "-" into a counter

and Step 2 (which directly gives me an error):

awk '{print $1 "\t" '$1 = tolower($1)' "\t" $2 "\t" $3}' input > output

Any suggestions 1. on how to solved the lower and counter and 2. if it is possible to combine these two steps?

Thank you in advance


Solution

  • I would do something like:

    $ awk 'BEGIN{FS=OFS="\t"} NF>1{$1=$1 FS tolower($1); $4=++f} NF==1{f=0}1' file
    <s>
    He he PRP - 1
    could could MD - 2
    tell tell VB - 3
    she she PRP - 4
    was was VBD - 5
    teasing teasing VBG - 6
    him him PRP - 7
    . . . . 8
    </s>
    <s>
    He he PRP - 1
    kept kept VBD - 2
    his his PRP$ - 3
    eyes eyes NNS - 4
    closed closed VBD - 5
    , , , - 6
    but but CC - 7
    he he PRP - 8
    could could MD - 9
    feel feel VB - 10
    himself himself PRP - 11
    smiling smiling VBG - 12
    . . . . 13
    </s>
    

    That is, set $1 and $4 on no <s> lines and reset the counter otherwise (yes, I know it is resetting twice but I cannot think on something neater right now). Then 1 to print normally.

    Note you are playing a lot with print and the delimiters. It is best to just change the fields and let print happen automatically upon a True condition (1) and using the given field separators. A kind of model-view-controller : )