Search code examples
jsoncsvjqdata-extraction

Jq: Gather JSON data for each object


I find my title quite unclear, however I do not know how to rewrite it better so feel free to edit it!


Data
I have the following (simplified) JSON:

[
  {
    "genes_id": "eco:b0002",
    "entry_id": "b0002",
    "division": "CDS",
    "organism": "Escherichia coli K-12 MG1655",
    "organism_code": "eco",
    "organism_id": "T00007",
    "name": "thrA",
    "names": [
      "thrA"
    ],
    "definition": "(RefSeq) Bifunctional aspartokinase/homoserine dehydrogenase 1",
    "eclinks": [

    ],
    "orthologs": {
      "K12524": "bifunctional aspartokinase / homoserine dehydrogenase 1 [EC:2.7.2.4 1.1.1.3]"
    },
    "pathways": {
      "eco00260": "Glycine, serine and threonine metabolism",
      "eco00261": "Monobactam biosynthesis",
      "eco00270": "Cysteine and methionine metabolism",
      "eco00300": "Lysine biosynthesis",
      "eco01100": "Metabolic pathways",
      "eco01110": "Biosynthesis of secondary metabolites",
      "eco01120": "Microbial metabolism in diverse environments",
      "eco01130": "Biosynthesis of antibiotics",
      "eco01230": "Biosynthesis of amino acids"
    },
    "modules": {
      "eco_M00016": "Lysine biosynthesis, succinyl-DAP pathway, aspartate => lysine",
      "eco_M00017": "Methionine biosynthesis, apartate => homoserine => methionine",
      "eco_M00018": "Threonine biosynthesis, aspartate => homoserine => threonine"
    },
    "classes": [

    ],
    "position": "337..2799",
    "chromosome": null,
    "gbposition": "337..2799",
    "motifs": {
      "Pfam": [
        "Homoserine_dh",
        "AA_kinase",
        "NAD_binding_3",
        "ACT_7",
        "ACT",
        "Sacchrp_dh_NADP"
      ]
    },
    "dblinks": {
      "NCBI-GeneID": [
        "945803"
      ],
      "NCBI-ProteinID": [
        "NP_414543"
      ],
      "Pasteur": [
        "thrA"
      ],
      "RegulonDB": [
        "ECK120000987"
      ],
      "ECOCYC": [
        "EG10998"
      ],
      "ASAP": [
        "ABE-0000008"
      ],
      "UniProt": [
        "P00561"
      ]
    }
  },
  {
    "genes_id": "eco:b0003",
    "entry_id": "b0003",
    "division": "CDS",
    "organism": "Escherichia coli K-12 MG1655",
    "organism_code": "eco",
    "organism_id": "T00007",
    "name": "thrB",
    "names": [
      "thrB"
    ],
    "definition": "(RefSeq) homoserine kinase",
    "eclinks": [

    ],
    "orthologs": {
      "K00872": "homoserine kinase [EC:2.7.1.39]"
    },
    "pathways": {
      "eco00260": "Glycine, serine and threonine metabolism",
      "eco01100": "Metabolic pathways",
      "eco01110": "Biosynthesis of secondary metabolites",
      "eco01120": "Microbial metabolism in diverse environments",
      "eco01230": "Biosynthesis of amino acids"
    },
    "modules": {
      "eco_M00018": "Threonine biosynthesis, aspartate => homoserine => threonine"
    },
    "classes": [

    ],
    "position": "2801..3733",
    "chromosome": null,
    "gbposition": "2801..3733",
    "motifs": {
      "Pfam": [
        "GHMP_kinases_N",
        "GHMP_kinases_C"
      ]
    },
    "dblinks": {
      "NCBI-GeneID": [
        "947498"
      ],
      "NCBI-ProteinID": [
        "NP_414544"
      ],
      "Pasteur": [
        "thrB"
      ],
      "RegulonDB": [
        "ECK120000988"
      ],
      "ECOCYC": [
        "EG10999"
      ],
      "ASAP": [
        "ABE-0000010"
      ],
      "UniProt": [
        "P00547"
      ]
    }
  }
]

Desired output
This is an array of two objects. I'm interested in the genes_id and pathways for both objects and want to obtain a tab separated file containing the following:

eco:b0002   eco00260    Glycine, serine and threonine metabolism
eco:b0002   eco00261    Monobactam biosynthesis
eco:b0002   eco00270    Cysteine and methionine metabolism
eco:b0002   eco00300    Lysine biosynthesis
eco:b0002   eco01100    Metabolic pathways
eco:b0002   eco01110    Biosynthesis of secondary metabolites
eco:b0002   eco01120    Microbial metabolism in diverse environments
eco:b0002   eco01130    Biosynthesis of antibiotics
eco:b0002   eco01230    Biosynthesis of amino acids
eco:b0003   eco00260    Glycine, serine and threonine metabolism
eco:b0003   eco01100    Metabolic pathways
eco:b0003   eco01110    Biosynthesis of secondary metabolites
eco:b0003   eco01120    Microbial metabolism in diverse environments
eco:b0003   eco01230    Biosynthesis of amino acids

What I found out
I know it's possible to extract the data in a format like:

eco:b0002: list of pathways and ids
eco:b0003: list of pathways and ids

However I want to spread the pathways out to individual rows as in the above example. I could not find any information on how to do this using jq and therefore doubt whether this is actually possible. Thus, if this is possible how could this be achieved using Jq?


Solution

  • Invocation: jq -rf totsv.jq input.json

    Program (totsv.jq):

    .[]
    | .genes_id as $id
    | .pathways
    | to_entries[]
    | [$id, .key, .value]
    | @tsv
    

    TSV is a good choice (as is jq)!