Search code examples
regexparsingsedstring-formatting

How to parse/sed multiple string with regex expression


I'm trying to extract information from a bounce of ovpn files in order to update my server list. I find a way to extract information with sed and all works, but I'm stuck when I try to extract data to make the directory structure.

What I have is files inside a folder, for example:

ch101.tcp443.ovpn
ch101.udp1194.ovpn
ch102.nordvpn.com.tcp443.ovpn
ch102.nordvpn.com.udp1194.ovpn
ch102.tcp443.ovpn
ch102.udp1194.ovpn

Now I want to extract information to make directory structure, so I made a regex to extract all the info I need

It works on all files that I have, and gets data from the name of file. So from "ch101.udp1194.ovpn" it extracts "ch101" and "udp", into groups 1 and 2.

But when I try to make it works with sed I fail. I tried to break it down into steps, but even with only the 1st group looking for "ch101" it doesn't work:

echo 'ch101.udp1194.ovpn' | sed -rn 's/^([a-z\-]+\d{1,4})/\1/p'

What did I miss? I'm not sed expert but I find similar expression that works but this one don't.

My final purpose is to create directory and store in it all the information that I need, so:

for i in /opt/ovpn/*.ovpn ; do 
    [ -f "$i" ] || continue
    FIRST_ARG=$(echo $i | sed ...) # extract ch101
    SECOND_ARG=$(echo $i | sed ...) # extract udp
    FIRST_ARG_TEXT=$(echo $FIRST_ARG | sed ...) # extract text from FIRST_ARG
    FIRST_ARG_NUM=$(echo $FIRST_ARG | sed ...) # extract num from FIRST_ARG
    FIRST_ARG_NUM_4FORMAT=$(printf '%04i\n' $FIRST_ARG_NUM) # 4 digits for FIRST_ARG_NUM

    mkdir /opt/somedir/$FIRST_ARG_TEXT$FIRST_ARG_NUM_4FORMAT$SECOND_ARG
    cp ........
done

So from ch101.udp1194.ovpn I'll end with a directory named

ch0101udp

Maybe is not the best and clean way but to me seems simple and is the max that my knowledge can achieve

Any idea or question is good to me

Ps. I'm under busybox 1.30 so this must be sh not bash


Solution

  • A couple of problems: sed does not support a lot of the character class escape sequences like \d so you need to specify them as [0-9].

    As well, you're trying to replace the matched sequence with itself, so there would be no change in the output. You need to have .* to catch the stuff around it.

    Something like this would work for your first group:

    sed -En 's/^([a-z\-]+[0-9]{1,4}).*/\1/p'
    

    But really what you should be doing is using a proper program to do this. Not sure if it's available on Busybox but awk could do everything you're looking for:

    echo 'ch101.udp1194.ovpn' | awk -F. '{a=$1; b=$(NF-1); gsub(/[0-9]/, "", a); gsub(/[0-9]/, "", b); gsub(/^[a-z-]+/, "", $1); printf("%s%04d%s", a, $1, b)}'
    

    Output from your sample data:

    ch0101tcp
    ch0101udp
    ch0102tcp
    ch0102udp
    ch0102tcp
    ch0102udp
    

    An explanation:

    awk -F. '{
        a=$1;                          # assign the first field to a
        b=$(NF-1);                     # assign the second last field to b
        gsub(/[0-9]/, "", a);          # remove numbers from a
        gsub(/[0-9]/, "", b);          # remove numbers from b
        gsub(/^[a-z-]+/, "", $1);      # remove letters from the first field
        printf("%s%04d%s", a, $1, b)   # output in desired format
    }'