Search code examples
jsonpowershelllarge-files

Using Powershell, how do I filter a JSON to exclude certain key names?


I am trying to reduce the size of a JSON which is 700MB. It's a slightly smaller version of this: https://kaikki.org/dictionary/All%20languages%20combined/by-pos-name/kaikki_dot_org-dictionary-all-by-pos-name.json

I'm doing that by removing unnecessary information. The keys I don't need are: hypernyms,pos,categories,alt_of,inflection_templates,hyponyms,meronyms,source,wikipedia,holonyms,proverbs,head_templates,etymology_text,lang_code,hyphenation,forms,synonyms,antonyms.

I've tried

$Obj = 
  [System.IO.File]::ReadLines((Convert-Path -LiteralPath namesonly.json)) | 
  ConvertFrom-Json
$foo = Select-Object $Obj -ExcludeProperty hypernyms,pos,categories,alt_of,inflection_templates,hyponyms,meronyms,source,wikipedia,holonyms,proverbs,head_templates,etymology_text,lang_code,hyphenation,forms,synonyms,antonyms
$foo | ConvertTo-Json -Depth 100 > namesonlycleaned.json

But this results in an empty file. How do I fix it so I'll get a new JSON without those unnecessary fields?

Edit: Suggested in comments to add an asterisk - If I got it right then

$Obj = 
  [System.IO.File]::ReadLines((Convert-Path -LiteralPath namesonly.json)) | 
  ConvertFrom-Json
$foo = Select-Object $Obj * -ExcludeProperty hypernyms,pos,categories,alt_of,inflection_templates,hyponyms,meronyms,source,wikipedia,holonyms,proverbs,head_templates,etymology_text,lang_code,hyphenation,forms,synonyms,antonyms
$foo | ConvertTo-Json -Depth 100 > namesonlycleaned.json

Returns the error

A positional parameter cannot be found that accepts argument '*'.

Solution

    • Your immediate problems are the ones pointed out by Mathias R. Jessen:

      • Unfortunately, in Windows PowerShell the use of Select-Object's -ExcludeProperty alone does not work as intended (outputs empty objects) and requires combining with -Property * - this problem has been fixed in PowerShell (Core) 7+

      • Input objects must be provided to Select-Object via the pipeline:

        $Obj | Select-Object -Property * -ExcludeProperty hypernyms,pos,categories,alt_of,inflection_templates,hyponyms,meronyms,source,wikipedia,holonyms,proverbs,head_templates,etymology_text,lang_code,hyphenation,forms,synonyms,antonyms      
        
    • However, this alone will not solve your problem:

      • Judging by the linked data source and the array of properties you're trying to exclude, some of those properties are those of nested objects, i.e. you're looking to remove properties from each object's object graph.

      • Select-Object doesn't support this, but the custom Remove-Property function (source code in the bottom section) does.


    Use the following (make sure you've defined the Remove-Property function from the bottom section first):

    [System.IO.File]::ReadLines((Convert-Path -LiteralPath large.json)) | 
      ConvertFrom-Json |
      Remove-Property -Recurse -Property hypernyms,pos,categories,alt_of,inflection_templates,hyponyms,meronyms,source,wikipedia,holonyms,proverbs,head_templates,etymology_text,lang_code,hyphenation,forms,synonyms,antonyms |
      ConvertTo-Json -Compress -Depth 100 > namesonlycleaned.json
    

    Note:

    • This will run for quite some time, but by using a single pipeline it avoids unnecessary memory use due to intermediate storage of results.

      • That said (at least as of PowerShell 7.4), ConvertFrom-Json reads all input up front before producing output; in terms of runtime performance, however, this part finishes fairly quickly.
    • For troubleshooting - say to limit output to the first 10 objects - you can insert Select-Object -First 10 as a pipeline segment before the ConvertTo-Json segment.


    Remove-Property source code:

    function Remove-Property {
      <#
      .SYNOPSIS
      Removes properties from [pscustomobject] or dictionary objects (hashtables)
      and outputs the resulting objects.
      
      .DESCRIPTION
      Use -Recurse to remove the specified properties / entries from 
      the entire object *graph* of each input object, i.e. also from any *nested* 
      [pscustomobject]s or dictionaries.
    
      Useful for removing unwanted properties / entries from object graphs parsed
      from JSON via ConvertFrom-Json.
    
      Attempts to remove non-existent properties / entries are quietly ignored.
      
      .EXAMPLE
      [pscustomobject] @{ foo=1; bar=2 } | Remove-Property foo
    
      Removes the 'foo' property from the given custom object and outputs the result.
    
      .EXAMPLE
      @{ foo=1; bar=@{foo=10; baz=2} } | Remove-Property foo -Recurse
    
      Removes 'foo' properties (entries) from the entire object graph, i.e. from
      the top-level hashtable as well as from any nested hashtables.
      #>
      param(
        [Parameter(Mandatory, Position = 0)] [string[]] $Property,
        [switch] $Recurse,
        [Parameter(Mandatory, ValueFromPipeline)] [object] $InputObject
      )
      process {
        if (-not (($isPsCustObj = $InputObject -is [System.Management.Automation.PSCustomObject]) -or $InputObject -is [System.Collections.IDictionary])) { Write-Error "Neither a [pscustomobject] nor an [IDictionary] instance: $InputObject"; return }
        # Remove the requested properties from the input object itself.
        foreach ($propName in $Property) {
          # Note: In both cases, if  a property / entry by a given name doesn't exist, the .Remove() call is a quiet no-op.
          if ($isPsCustObj) {        
            $InputObject.psobject.Properties.Remove($propName)
          }
          else {
            # IDictionary
            $InputObject.Remove($propName)
          }
        }
        # Recurse, if requested.
        if ($Recurse) {
          if ($isPsCustObj) {
            foreach ($prop in $InputObject.psobject.Properties) {
              if ($prop.Value -is [System.Management.Automation.PSCustomObject] -or $prop.Value -is [System.Collections.IDictionary]) {
                $prop.Value = Remove-Property -InputObject $prop.Value -Recurse -Property $Property
              }
            }
          }
          else {
            # IDictionary
            foreach ($entry in $InputObject.GetEnumerator()) {
              if ($entry.Value -is [System.Management.Automation.PSCustomObject] -or $entry.Value -is [System.Collections.IDictionary]) {
                $entry.Value = Remove-Property -InputObject $entry.Value -Recurse -Property $Property
              }
            }
          }
        }
        $InputObject # Output the potentially modified input object.
      }
    }