Search code examples
bashawksedgrepcut

Search multiple files in subfolders and print only numerical values


I have a question on in-line operation to search multiple files in sub-folders with TWO patterns, and print only numerical values.

Example:

Current directory: $HOME/work/A/ (where to run script)

Subfolders containing data: $HOME/work/A/trial1, trial2, trial3..

Input (each data file): eg. trial1/trial1.out

[text]
..
cutoff = 100
..
[text] 
..
! total energy= -23.4387 Ry
.. 

Need output: /A/totalenergy.txt

100   -23.4387
110   -23.2523
120   -24.0134
...

What I initially planned, is to use 'grep' to search each file and match pattern 'cutoff =' and '! ' to find the two desired lines, and print out only the cutoff number and energy number.

However, up to this point, what I am able to do is only search for 1 pattern, '! total energy' (more important), and use grep | tr | cut > file to get only the energy out.

grep -e "\!" */*.out | tr -s ' ' | cut -f5 -d' ' >totalenergy.txt

basically, I grep for '!', search all subfolders for *.out, trim multiple spaces, and retain only the numerical field

The line that contains '! total energy' after using grep looks like this

60/C.scf_60.out:!    total energy              =     -22.78085574 Ry

So, if I can somehow get the first number out from this line, plus what I have, I can also achieve my goal:

60  -22.78085574

I am trying to do this with one line command.

Thanks!


Solution

  • sed -rn -e 's/cutoff[ =]+([0-9]+)/\1/p' -e 's/.*total energy[= ]+([0-9.-]+).*/\1:/p' */*.out | tr '\n:' ' \n'
    

    Explanation:

    sed -rn -e <cmd1> -e <cmd2> */*.out
    

    I've used sed instead of grep because I fell into the necessity of using a flag (I choose :) to separate every register (cutoff total_energy).

    sed options

    -r # short form of --regexp-extended
    

    Needed to match with the sintax I've used. Specially ([0-9.-]+) -> I didn't need to escape the brackets, and I could filter .- without problems.

    -n # short option of --quiet or --silent
    

    It disables printing of patterns unless we explicitly ask to do so (with the flag p)

    -e # short of --expression
    

    Useful to combining multiple commands

    pattern and replacement

    cutoff[ =]+([0-9]+)/\1
    .*total energy[= ]+([0-9.-]+).*/\1:
    

    I'm just saving the value I need in \1.

    Notice that I appended a : character after the value matched for total energy. As I said, it is to help me to separate registers with tr.

    sed flag

    's/../../p'
    

    I've used p to print the patterns due to I'had disabled the printing with -n. It's needed to discard all the lines with no matches.


    tr '\n:' ' \n'
    

    Due to sed output each value in a different line, I used a flag (:) to know where to write a newline (\n).

    characters replacement

    tr is translating characters from SET1 ('\n:') to the ones in SET2 (' \n'). The translation is taken replacing each character in SET1 with each character in same position in SET2:

    # \n  ->  " " (space)
    # :   ->  \n
    

    Note: You'd maybe like to pipe once more (| tr -s ' ') to clean the output


    Another method to format the output

    A more rigorous way to print the result is to sed again so the output is exactly as you want:

    sed -rn -e 's/cutoff[ =]+([0-9]+)/\1/p' -e 's/.*total energy[= ]+([0-9.-]+).*/\1:/p' */*.out | tr '\n' ' ' | sed -r "s/([^:]+):[ ]*/\1\n/g"
    

    Notice that util the first | the command is exactly the same as the one above.

    tr '\n' ' '
    

    It just replaces the newlines with spaces.

    sed -r "s/([^:]+):[ ]*/\1\n/g"
    

    It saves the string until : and prints it followed by a newline