Search code examples
awkseddelimiter

AWK seperates Lines with capitals and non-capital letters with a semicolon, if there is no semicolon


I have this converted dictionary to use in Pure Data. It consists of a series of 3 things: the word, how to pronounce it, and a semicolon to finish. In the converted dictionary, some semicolons are missing, so I want AWK to find the missings and put semicolons for me. I used delimiters before, but this one is difficult for me, so any help will be appreciated. See the text file: the first 3 are good, the last three are wrong, there is a semicolon missing at the end. I think the AWK delimiter will be between non-capital letters and capital letters, and the action is to put a semicolon if there is no semicolon already. How can I put this in AWK code?

ELFKIN
Elf
kin;
ELFLAND
Elf
land
;
ELFLOCK
Elf
lock
;
ELGIN
El
gin
ELICIT
E
lic
it
ELICIT
E
lic
it

I used some Delimiters before, but i do not know how to specify between in AWK. So the Delimiter is non-capital letters and Capital letters, and put a semicolon there. so some code would look like this awk 'length($0)>1 && line with All capitals put semicolon before this line' or awk 'line with non-capitals if Next line is Capitals put semicolon after line I have tryed this

awk 'length($0>1) && /[:^, upper:]/{l=l";"}NR>1{print l}{l=$0}END{print l}' file2

This is not good working.

Or am i pointing is the wrong direction.


Solution

  • I would harness GNU AWK for this task following way, let file.txt content be

    ELFKIN
    Elf
    kin;
    ELFLAND
    Elf
    land
    ;
    ELFLOCK
    Elf
    lock
    ;
    ELGIN
    El
    gin
    ELICIT
    E
    lic
    it
    ELICIT
    E
    lic
    it
    

    then

    awk 'BEGIN{RS=""}{print gensub(/([[:lower:]])\n([[:upper:]])/,"\\1;\n\\2","g")}' file.txt
    

    gives output

    ELFKIN
    Elf
    kin;
    ELFLAND
    Elf
    land
    ;
    ELFLOCK
    Elf
    lock
    ;
    ELGIN
    El
    gin;
    ELICIT
    E
    lic
    it;
    ELICIT
    E
    lic
    it
    

    Explanation: setting RS to empty string engage paragraph mode, as file.txt has not blank line, it is treated as 1 row. Then I use gensub string function to replace all (g like globally) occurences of lowercase letter followed by newline followed by uppercase letter by 1st of that letters followed by semicolon followed by newline followed by 2nd letter.

    (tested in GNU Awk 5.1.0)