Search code examples
jsonbashjqdata-conversiongsutil

Convert text from txt file into json using jq tool


I have a txt file with values obtained by calling the following command recursively: gsutil ls -r gs://bucket-test/** | while IFS= read -r key; do gsutil stat $key; done, it looks like this:

gs://bucket-test/4e123978-8eed-43ae-f521-8fba54c704ea.zip:
    Creation time:          Wed, 21 Dec 2022 10:39:27 GMT
    Update time:            Wed, 21 Dec 2022 10:39:27 GMT
    Storage class:          STANDARD
    Content-Length:         0
    Content-Type:           application/zip
    Hash (crc32c):          AAAAAA==
    Hash (md5):             1B2M2Y8AsgTpgAmY7PhCfg==
    ETag:                   CM30q9XCivwCEAE=
    Generation:             1671619167320653
    Metageneration:         1
gs://bucket-test/GKiSQMZ5rAqrSWwur/uploads/GENERAL/SNrQD97nzQN9eDLeA/AAZYefiL5CT8pxe4L:
    Creation time:          Mon, 10 Apr 2023 19:09:41 GMT
    Update time:            Mon, 10 Apr 2023 19:09:41 GMT
    Storage class:          STANDARD
    Content-Disposition:    inline; filename=James_INGREDIENTS_A3.pdf
    Content-Length:         4381797
    Content-Type:           application/pdf
    Hash (crc32c):          GOzitA==
    Hash (md5):             eUSLC/z70gjDB2WQKIPOuQ==
    ETag:                   CLGPvu+BoP4CEAE=
    Generation:             1681153781106609
    Metageneration:         1
gs://bucket-test/prova.pdf:
    Creation time:          Mon, 08 May 2023 15:37:26 GMT
    Update time:            Mon, 08 May 2023 15:40:12 GMT
    Storage class:          STANDARD
    Content-Disposition:    inline; filename=James_KEY_VISUAL_A3.pdf
    Content-Language:       ace
    Content-Length:         15407
    Content-Type:           application/pdf
    Metadata:               
        meta-1:             prova 1
        meta-2:             prova 2
    Hash (crc32c):          ZIrHPA==
    Hash (md5):             oZbD+S8y35spkNozW3hUDA==
    ETag:                   CNDj09OG5v4CEAM=
    Generation:             1683560246604240
    Metageneration:         3

I need to convert the output to json format, splitting by leading spaces and assigning the value present on the first row of each group to the "Key" field, then there may be subfields for example under the "Metadata" value:

{
  "Key": "gs://bucket-test/4e123978-8eed-43ae-f521-8fba54c704ea.zip",
  "Creation time": "Wed, 21 Dec 2022 10:39:27 GMT",
  "Update time": "Wed, 21 Dec 2022 10:39:27 GMT",
  "Storage class": "STANDARD",
  "Content-Length": "0",
  "Content-Type": "application/zip",
  "Hash (crc32c)": "AAAAAA==",
  "Hash (md5)": "1B2M2Y8AsgTpgAmY7PhCfg==",
  "ETag": "CM30q9XCivwCEAE=",
  "Generation": "1671619167320653",
  "Metageneration": "1"
},
{
  "Key": "gs://bucket-test/GKiSQMZ5rAqrSWwur/uploads/GENERAL/SNrQD97nzQN9eDLeA/AAZYefiL5CT8pxe4L",
  "Creation time": "Mon, 10 Apr 2023 19:09:41 GMT",
  "Update time": "Mon, 10 Apr 2023 19:09:41 GMT",
  "Storage class": "STANDARD",
  "Content-Disposition": "inline; filename=James_INGREDIENTS_A3.pdf",
  "Content-Length": "4381797",
  "Content-Type": "application/pdf",
  "Hash (crc32c)": "GOzitA==",
  "Hash (md5)": "eUSLC/z70gjDB2WQKIPOuQ==",
  "ETag": "CLGPvu+BoP4CEAE=",
  "Generation": "1681153781106609",
  "Metageneration": "1"
},
{
  "Key": "gs://bucket-test/prova.pdf",
  "Creation time": "Mon, 08 May 2023 15:37:26 GMT",
  "Update time": "Mon, 08 May 2023 15:40:12 GMT",
  "Storage class": "STANDARD",
  "Content-Disposition": "inline; filename=James_KEY_VISUAL_A3.pdf",
  "Content-Language": "ace",
  "Content-Length": "15407",
  "Content-Type": "application/pdf",
  "Metadata": {
    "meta-1": "prova 1",
    "meta-2": "prova 2"
  },
  "Hash (crc32c)": "ZIrHPA==",
  "Hash (md5)": "oZbD+S8y35spkNozW3hUDA==",
  "ETag": "CNDj09OG5v4CEAM=",
  "Generation": "1683560246604240",
  "Metageneration": "3"
}

I tried with this command for an only group but without success: gsutil stat gs://bucket-test/prova.pdf | printf %s "$(cat)" | jq -R -s 'split("\n") | map({key: split(": ")[0], value: split(": ")[1]})'

The json is converted into an array:

[
  {
    "key": "gs://spin8-test/prova.pdf:",
    "value": null
  },
  {
    "key": "    Creation time",
    "value": "         Mon, 08 May 2023 15:37:26 GMT"
  },
  {
    "key": "    Update time",
    "value": "           Mon, 08 May 2023 15:40:12 GMT"
  },
  {
    "key": "    Storage class",
    "value": "         STANDARD"
  },
  {
    "key": "    Content-Disposition",
    "value": "   inline; filename=James_KEY_VISUAL_A3.pdf"
  },
  {
    "key": "    Content-Language",
    "value": "      ace"
  },
  {
    "key": "    Content-Length",
    "value": "        15407"
  },
  {
    "key": "    Content-Type",
    "value": "          application/pdf"
  },
  {
    "key": "    Metadata",
    "value": "              "
  },
  {
    "key": "        meta-1",
    "value": "            prova 1"
  },
  {
    "key": "        meta-2",
    "value": "            prova 2"
  },
  {
    "key": "    Hash (crc32c)",
    "value": "         ZIrHPA=="
  },
  {
    "key": "    Hash (md5)",
    "value": "            oZbD+S8y35spkNozW3hUDA=="
  },
  {
    "key": "    ETag",
    "value": "                  CNDj09OG5v4CEAM="
  },
  {
    "key": "    Generation",
    "value": "            1683560246604240"
  },
  {
    "key": "    Metageneration",
    "value": "        3"
  }
]

Any suggestions? Thanks


Solution

  • With jq, you can read in raw text using the -R flag, and iterate through the lines using reduce. Start out with an empty array [], then, based on the indentation, add a new item, append to the last one, or append to last one's .Metadata field. Checking the indentation and parsing the line's content is done using regular expressions with match and capture, respectively:

    jq -Rn '
      reduce (inputs | {
        ind: match("^\\s*").length,
        cap: capture("\\s*(?<key>.*):(\\s+(?<value>.*))?$")
      }) as {$ind, $cap} ([];
        if $ind == 0 then . + [$cap | {key}]
        elif $ind == 4 then last += ([$cap | select(.key == "Metadata").value = {}] | from_entries)
        elif $ind == 8 then last.Metadata += ([$cap] | from_entries)
        else . end
      )
    '
    

    This creates a valid JSON array (because without the brackets but with commas in between the items, it wouldn't be valid JSON):

    [
      {
        "key": "gs://bucket-test/4e123978-8eed-43ae-f521-8fba54c704ea.zip",
        "Creation time": "Wed, 21 Dec 2022 10:39:27 GMT",
        "Update time": "Wed, 21 Dec 2022 10:39:27 GMT",
        "Storage class": "STANDARD",
        "Content-Length": "0",
        "Content-Type": "application/zip",
        "Hash (crc32c)": "AAAAAA==",
        "Hash (md5)": "1B2M2Y8AsgTpgAmY7PhCfg==",
        "ETag": "CM30q9XCivwCEAE=",
        "Generation": "1671619167320653",
        "Metageneration": "1"
      },
      {
        "key": "gs://bucket-test/GKiSQMZ5rAqrSWwur/uploads/GENERAL/SNrQD97nzQN9eDLeA/AAZYefiL5CT8pxe4L",
        "Creation time": "Mon, 10 Apr 2023 19:09:41 GMT",
        "Update time": "Mon, 10 Apr 2023 19:09:41 GMT",
        "Storage class": "STANDARD",
        "Content-Disposition": "inline; filename=James_INGREDIENTS_A3.pdf",
        "Content-Length": "4381797",
        "Content-Type": "application/pdf",
        "Hash (crc32c)": "GOzitA==",
        "Hash (md5)": "eUSLC/z70gjDB2WQKIPOuQ==",
        "ETag": "CLGPvu+BoP4CEAE=",
        "Generation": "1681153781106609",
        "Metageneration": "1"
      },
      {
        "key": "gs://bucket-test/prova.pdf",
        "Creation time": "Mon, 08 May 2023 15:37:26 GMT",
        "Update time": "Mon, 08 May 2023 15:40:12 GMT",
        "Storage class": "STANDARD",
        "Content-Disposition": "inline; filename=James_KEY_VISUAL_A3.pdf",
        "Content-Language": "ace",
        "Content-Length": "15407",
        "Content-Type": "application/pdf",
        "Metadata": {
          "meta-1": "prova 1",
          "meta-2": "prova 2"
        },
        "Hash (crc32c)": "ZIrHPA==",
        "Hash (md5)": "oZbD+S8y35spkNozW3hUDA==",
        "ETag": "CNDj09OG5v4CEAM=",
        "Generation": "1683560246604240",
        "Metageneration": "3"
      }
    ]
    

    Demo