Search code examples
linuxbashparsingsdf

Parsing sdf file in bash


I found this code for parsing a sdf file but I cannot ignore the whitespace that's why Ki (nm) output does not show.

My file look like this:


> <Ligand InChI Key>
CPZBLNMUGSZIPR-NVXWUHKLSA-N

> <BindingDB MonomerID>
50417287

> <BindingDB Ligand Name>
Aloxi::Aurothioglucose::PALONOSETRON::PALONOSETRON HYDROCHLORIDE

> <Target Name Assigned by Curator or DataSource>
5-hydroxytryptamine receptor 3A

> <Target Source Organism According to Curator or DataSource>
Homo sapiens

> <Ki (nM)>
 0.0316

> <IC50 (nM)>


> <Kd (nM)>


> <EC50 (nM)>
---------------------------
awk -v  OFS='\t' '
    /^>/ { tag=$2; next }
    NF { f[tag]=$1 }
    $0 == "$$$$" {print f["<pH>"], f["<PMID>"], f["<Ki (nM)>"] }
' P46098.sdf 

Thank you!


Solution

  • Please try match() function to extract the tag between < and > inclusive.

    awk -v  OFS='\t' '
        /^>/ { match($0, /<.+>/); tag = substr($0, RSTART, RLENGTH); next }
        NF { f[tag]=$1 }
        $0 == "$$$$" {print f["<pH>"], f["<PMID>"], f["<Ki (nM)>"] }
    ' P46098.sdf
    
    • The function match($0, /<.+>/) returns a non-zero value if the regex <.+> matches $0 assigning awk variables RSTART and RLENGTH to the start position and the length of the matched substring.
    • The regex <.+> matches a substring which starts with < and ends with >. The substring may contain whitespace characters.
    • substr($0, RSTART, RLENGTH) returns the substring of $0 starting at RSTART and length of RLENGTH characters. Then the variable tag is assigned to it.