Search code examples
regexlinuxawk

How to combine all the words of a sentence extracted with a regex?


I would like to combine with a linux command, if possible, all the words that start with a capital letter, excluding the one at the beginning of the line. The goal is to create edges between these words. For example:

My friend John met Beatrice and Lucio.

The result I would like to have should be:

  • John, Beatrice
  • John, Lucio
  • Beatrice, Lucio

I managed to get all the words that start with a capital letter, excluding the word at the beginning of the line through a regex. The regex is:

*cat gov.json | grep -oP "\b([A-Z][a-z']*)(\s[A-Z][a-z']*)*\b | ^(\s*.*?\s).*" > nodes.csv*

The nodes managed to enter them individually in column, ie:

  • John
  • Beatrice
  • Lucio

The goal now is to create the possible combinations between names that start with a capital letter and put them into a file. Any suggestions?


Solution

  • Here is another awk script doing the task, building the output while reading input.

    script.awk allowing duplicate names.

    BEGIN {FPAT =  " [[:upper:]][[:alpha:]]+"}
    {
        for (i = 1; i <= NF; i++ ) {
            for (name in namesArr) {
                namePairsArr[pairsCount++] = namesArr[name] $i;
            }
            namesArr[namesCount++] = $i;
        }   
    }
    END {for (i = 0; i < pairsCount; i++) print namePairsArr[i];}
    

    If duplicate names not allowed, script.awk is:

    BEGIN {FPAT =  " [[:upper:]][[:alpha:]]+"}
    {
        for (i = 1; i <= NF; i++ ) {
            if (nameSeenArr[$i]) continue;
            nameSeenArr[$i] = 1;
            for (name in namesArr) {
                  namePairsArr[pairsCount++] = namesArr[name] $i;
            }
            namesArr[namesCount++] = $i;
        }
    }
    END {for (i = 0; i < pairsCount; i++) print namePairsArr[i];}**
    

    run

    awk -f script.awk gov.json > nodes.csv
    

    sample input file:

    My friend John met Beatrice and Lucio
    My friend Johna met Beatricea and Lucioa
    

    sample output:

     John Beatrice
     John Lucio
     Beatrice Lucio
     John Johna
     Beatrice Johna
     Lucio Johna
     John Beatricea
     Beatrice Beatricea
     Lucio Beatricea
     Johna Beatricea
     John Lucioa
     Beatrice Lucioa
     Lucio Lucioa
     Johna Lucioa
     Beatricea Lucioa