Search code examples
regexbashdateiso8601capture-group

How can I distinguish the n-th matched pattern of the m-th capture group of a regular expression from earlier or later matches in bash?


This question pertains to regular expressions that can be processed by bash.

I have a regular expression which finds in a text all matches of a date in the notation d.m.yyyy or dd.m.yyyy or d.mm.yyyy or dd.mm.yyyy if it happens to be between tabs or at least two white spaces:

(?<=\t|\s{2,})(\d{1,2}\.\d{1,2}\.\d{4})(?=\t|\s{2,})

How can I replace all the findings of this (let's assume first) capture group by a date formatted according to ISO 8601, i.e. in the notation yyyy-mm-dd?

Since the delimiting tabs or t least double spaces are in a lookaround condition they do not belong to my capture group. They would remain as they were in the original string.

The problem decomposes to:

1. how to address the n-th match of $1

2. how do I rearrange the three components separated by dots in this case?


Solution

  • If you want to process it with bash, would you please try the following:

    #!/bin/bash
    
    str=$'foo\t27.6.2021  bar'                      # example of the input line
    pat=$'^(.*)(\t| {2,})([0-9]{1,2})\.([0-9]{1,2})\.([0-9]{4})(\t| {2,})(.*)$'
    if [[ $str =~ $pat ]]; then
        a=("${BASH_REMATCH[@]:1}")                  # assign array "a" to the matched substrings excluding "${BASH_REMATCH[0]}" (entire match)
        y=${a[4]}; a[4]=${a[2]}; a[2]=$y;           # swap year and date
        printf "%s%s%04d-%02d-%02d%s%s\n" "${a[@]}" # print the formatted result
    fi
    

    As commented, bash regex does not support lookarounds. You need to capture whole line as substrings and reuse them.