Search code examples
jsonpowershellreflectionlarge-data

From a 700MB json file, how do I list all the keys in Powershell?


I tried

$obj = [System.IO.File]::ReadLines((Convert-Path -LiteralPath names.json)) | ConvertFrom-Json
 
$keys = @() 
 
foreach ($key in $obj.GetEnumerator()) { 
  $keys += $key.Key 
} 
 
Write-Output $keys

But after over 24 hours it had not completed.

I need the key names so I can

  1. Delete irrelevant info and make it smaller
  2. Convert it to csv (the key names are required, otherwise PS just uses the first object and ignores keys which are not present in the first object)

The JSON is a version of this one (though 200 megs smaller): https://kaikki.org/dictionary/All%20languages%20combined/by-pos-name/kaikki_dot_org-dictionary-all-by-pos-name.json


Solution

  • You can Stream the file and process each line, avoiding Get-Content for large files and minimizing the number of cmdlet invocations:

    $allKeys = @{}
    
    [System.IO.File]::ReadLines("names.json") | ForEach-Object {
        $obj = ConvertFrom-Json $_
        $obj.PSObject.Properties.Name | ForEach-Object {
            $allKeys[$_]= $true
        }
    }
    
    $uniqueKeys = $allKeys.Keys
    
    Write-Output $uniqueKeys
    

    This approach uses the .NET ReadLines method to efficiently stream the file and process each line.

    I've finally finished downloading your massive JSON file.

    • To process the JSON further, here's a simple extraction of "glosses" from each line:
    $allGlosses = @()
    
    [System.IO.File]::ReadLines("sample.json") | ForEach-Object {
        $obj = ConvertFrom-Json $_
        $obj.senses.glosses | ForEach-Object {
            $allGlosses += $_
        }
    }
    
    Write-Output $allGlosses
    

    This will give you a list of glosses from each JSON object.

    Feedback Reply:

    Perhaps use a hashtable to store unique keys and avoid using ForEach-Object unnecessarily, and use a full path for file access:

    $allKeys = @{}
    $filePath = Resolve-Path "names.json"
    
    foreach ($line in [System.IO.File]::ReadLines($filePath)) {
        $obj = ConvertFrom-Json $line
        foreach ($key in $obj.PSObject.Properties.Name) {
            $allKeys[$key] = $true
        }
    }
    
    $uniqueKeys = $allKeys.Keys
    
    Write-Output $uniqueKeys
    

    This code avoids growing an array with +=, uses a foreach loop instead of ForEach-Object, and resolves the file path before using it in the .NET method.