Search code examples
bashperlawksed

Trying to extract a substring and version number from a filename using bash


I'm currently trying to extract a substring and version number from a filename using bash.

There are two formats the filenames will be in:

example-substring-1.1.0.tgz
example-substring-1.1.0-branch-name.tgz

For the first scenario I was able to extract the version number using sed like so:

echo example-substring-1.1.0.tgz | sed "s/.*-\(.*\)\.[a-zA-Z0-9]\{3\}$/\1/"

However this won't work for the second scenario.

Eventually I would like to create a script that will store the first substring and version in an associative array like below.

example_array["example-substring"]="1.1.0"
example_array["example-substring"]="1.1.0-branch-name"

This is proving tricky however as I can't seem to find a good way that will work for both scenarios. And for scenarios where the version includes the branch name I can't know before hand how many words the branch name will consist of.

I think variable expansion may be the way to go but wasn't able to get it to output what I want.


Solution

  • With Perl

    echo "example-substring-1.1.0-branch-name.tgz" |
        perl -wne'print join " ", /(.+)\-([0-9]+\.[0-9]+\.[0-9]+.*)\.tgz/'
    

    Prints two words

    example-substring 1.1.0-branch-name
    

    This is thus its return to the shell script, from which this would be invoked I presume, and then one can form needed structures in the shell script. Tested also without the branch name, and with a few other variations of the input string.

    Since the example-substring can contain digits as well (why not?), and so can the branch name (why not?), the regex pattern has no restrictions and both the leading and (possible) trailing parts are matched simply by .+ and .*.

    But then we need something more specific for the version number and I've used an assumption that it always consists of three numbers separated by dots. I've also assumed the fixed rest of the string, the file extension .tgz. These can be relaxed somewhat if needed.


    One can directly read a list (key value key value...) into an associative array

    #!/bin/bash
    
    eval declare -A ver=( $( 
        echo "example-substring-1.1.0-branch-name.tgz" | 
        perl -wnE'say join " ", /(.+)\-([0-9]+\.[0-9]+\.[0-9]+.*)\.tgz/' ))
    
    echo ${ver["example-substring"]}
    

    Or it may be more suitable to assign to variables first

    str="example-substring-1.1.0-branch-name.tgz"
    
    read -r str val <<< $( 
    perl -wE'say join " ", $ARGV[0] =~ /(.+)\-([0-9]+\.[0-9]+\.[0-9]+.+)\.tgz/' 
        -- "$str" )
    
    ver[$str]=$val
    

    or even just using positional parameters

    set -- $(
        perl -wE'say join " ", $ARGV[0] =~ /(.+)\-([0-9]+\.[0-9]+\.[0-9]+.+)\.tgz/' 
            -- "$str" )
    
    ver[$1]=$2
    

    There are of course other ways to pass arguments to a Perl script or a command-line program ("one-liner"), and other ways to take its output in bash.

    Let me know if this Perl code needs commentary.