I find my title quite unclear, however I do not know how to rewrite it better so feel free to edit it!
Data
I have the following (simplified) JSON:
[
{
"genes_id": "eco:b0002",
"entry_id": "b0002",
"division": "CDS",
"organism": "Escherichia coli K-12 MG1655",
"organism_code": "eco",
"organism_id": "T00007",
"name": "thrA",
"names": [
"thrA"
],
"definition": "(RefSeq) Bifunctional aspartokinase/homoserine dehydrogenase 1",
"eclinks": [
],
"orthologs": {
"K12524": "bifunctional aspartokinase / homoserine dehydrogenase 1 [EC:2.7.2.4 1.1.1.3]"
},
"pathways": {
"eco00260": "Glycine, serine and threonine metabolism",
"eco00261": "Monobactam biosynthesis",
"eco00270": "Cysteine and methionine metabolism",
"eco00300": "Lysine biosynthesis",
"eco01100": "Metabolic pathways",
"eco01110": "Biosynthesis of secondary metabolites",
"eco01120": "Microbial metabolism in diverse environments",
"eco01130": "Biosynthesis of antibiotics",
"eco01230": "Biosynthesis of amino acids"
},
"modules": {
"eco_M00016": "Lysine biosynthesis, succinyl-DAP pathway, aspartate => lysine",
"eco_M00017": "Methionine biosynthesis, apartate => homoserine => methionine",
"eco_M00018": "Threonine biosynthesis, aspartate => homoserine => threonine"
},
"classes": [
],
"position": "337..2799",
"chromosome": null,
"gbposition": "337..2799",
"motifs": {
"Pfam": [
"Homoserine_dh",
"AA_kinase",
"NAD_binding_3",
"ACT_7",
"ACT",
"Sacchrp_dh_NADP"
]
},
"dblinks": {
"NCBI-GeneID": [
"945803"
],
"NCBI-ProteinID": [
"NP_414543"
],
"Pasteur": [
"thrA"
],
"RegulonDB": [
"ECK120000987"
],
"ECOCYC": [
"EG10998"
],
"ASAP": [
"ABE-0000008"
],
"UniProt": [
"P00561"
]
}
},
{
"genes_id": "eco:b0003",
"entry_id": "b0003",
"division": "CDS",
"organism": "Escherichia coli K-12 MG1655",
"organism_code": "eco",
"organism_id": "T00007",
"name": "thrB",
"names": [
"thrB"
],
"definition": "(RefSeq) homoserine kinase",
"eclinks": [
],
"orthologs": {
"K00872": "homoserine kinase [EC:2.7.1.39]"
},
"pathways": {
"eco00260": "Glycine, serine and threonine metabolism",
"eco01100": "Metabolic pathways",
"eco01110": "Biosynthesis of secondary metabolites",
"eco01120": "Microbial metabolism in diverse environments",
"eco01230": "Biosynthesis of amino acids"
},
"modules": {
"eco_M00018": "Threonine biosynthesis, aspartate => homoserine => threonine"
},
"classes": [
],
"position": "2801..3733",
"chromosome": null,
"gbposition": "2801..3733",
"motifs": {
"Pfam": [
"GHMP_kinases_N",
"GHMP_kinases_C"
]
},
"dblinks": {
"NCBI-GeneID": [
"947498"
],
"NCBI-ProteinID": [
"NP_414544"
],
"Pasteur": [
"thrB"
],
"RegulonDB": [
"ECK120000988"
],
"ECOCYC": [
"EG10999"
],
"ASAP": [
"ABE-0000010"
],
"UniProt": [
"P00547"
]
}
}
]
Desired output
This is an array of two objects. I'm interested in the genes_id
and pathways
for both objects and want to obtain a tab separated file containing the following:
eco:b0002 eco00260 Glycine, serine and threonine metabolism
eco:b0002 eco00261 Monobactam biosynthesis
eco:b0002 eco00270 Cysteine and methionine metabolism
eco:b0002 eco00300 Lysine biosynthesis
eco:b0002 eco01100 Metabolic pathways
eco:b0002 eco01110 Biosynthesis of secondary metabolites
eco:b0002 eco01120 Microbial metabolism in diverse environments
eco:b0002 eco01130 Biosynthesis of antibiotics
eco:b0002 eco01230 Biosynthesis of amino acids
eco:b0003 eco00260 Glycine, serine and threonine metabolism
eco:b0003 eco01100 Metabolic pathways
eco:b0003 eco01110 Biosynthesis of secondary metabolites
eco:b0003 eco01120 Microbial metabolism in diverse environments
eco:b0003 eco01230 Biosynthesis of amino acids
What I found out
I know it's possible to extract the data in a format like:
eco:b0002: list of pathways and ids
eco:b0003: list of pathways and ids
However I want to spread the pathways out to individual rows as in the above example. I could not find any information on how to do this using jq and therefore doubt whether this is actually possible. Thus, if this is possible how could this be achieved using Jq?
Invocation: jq -rf totsv.jq input.json
Program (totsv.jq):
.[]
| .genes_id as $id
| .pathways
| to_entries[]
| [$id, .key, .value]
| @tsv
TSV is a good choice (as is jq)!