Search code examples
regexbashparsinggreprsync

Parse rsync stats, line Number of files with bash only


I need to parse rsync stats like:

Number of files: 265 (reg: 189, dir: 10, link: 66)
Number of created files: 18
Number of deleted files: 4
Number of regular files transferred: 24
Total file size: 121.67K bytes
Total transferred file size: 0 bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 0
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 9.15K
Total bytes received: 33

sent 9.15K bytes received 33 bytes 18.37K bytes/sec
total size is 121.67K speedup is 13.24

Parsing each line is rather easy using commands like this:

$(echo "$rawstats" | grep -Po '(?<=Number of files: ).*')

Now I need to parse the first line. I found a Perl solution here: Perl Parse rsync Output
but I don't want to rely on perl and Dan Lowe answer won't work in all cases since what's in () could be any combination of reg:, dir:, link: (and even other I ignore). I.e :

265 (reg: 189, dir: 10, link: 66)
265 (dir: 10, link: 66)
265 (link: 66)

So I'm trying to build the right regex to pass to grep -P So far I found :

(\d+) \((?:([a-z]+): (\d+)(?:, )?)*\)?

Which is matching like this:

[0] is a null string
[1]=265
[2]=link
[3]=66

The result I expected :

[1]=265
[2]=reg
[3]=189
[4]=dir
[5]=10
[6]=link
[7]=66

I can't see how to improve my result. An even best result would be a bash associative array like :

[reg]=189
[dir]=10
[link]=66

Thanks for your help


Solution

  • Pure Bash with Grep

    I see no reason to avoid Perl, which is quite convenient when it comes to text parsing. But here is a pure Bash implementation that produces an associative array stats out of rawstats variable containing the rsync stats output:

    declare -A stats=()
    
    label_regex='Number of files:'
    num_of_files_line=$(grep -E "$label_regex" <<< "$rawstats")
    
    regex="$label_regex ([0-9]+)"
    [[ $num_of_files_line =~ $regex ]] && stats['total']=${BASH_REMATCH[1]}
    
    while read -r k v; do stats["$k"]="$v"; done < <( \
        regex='([a-z]+): ([0-9]+)'
        while [[ $num_of_files_line =~ $regex ]]; do
            match=${BASH_REMATCH[0]}
            printf "%s %s\n" "${BASH_REMATCH[1]} ${BASH_REMATCH[2]}"
            num_of_files_line=${num_of_files_line#*"$match"}
        done
    )
    

    Process substitution (<( ... )) allows to use the stats variable within the loop. Pipes would create sub-shells which do not share variables.

    Perl

    And here is a similar Perl implementation which I would probably use:

    declare -A stats=()
    while read -r k v; do stats["$k"]="$v"; done < <( \
      printf "%s\n" "$rawstats" | \
        perl -ne '/Number of files: (\d+)/ or next; print "total $1\n"; print "$1 $2\n" while (/([a-z]+): (\d+)/g)' \
    )