I have some very large (> 500MB) JSON files that I need to map to a new format and upload to a new DB.
The old format:
{
id: '001',
timestamp: 2016-06-02T14:10:53Z,
contentLength: 123456,
filepath: 'original/...',
size: 'original'
},
{
id: '001',
timestamp: 2016-06-02T14:10:53Z,
contentLength: 24565,
filepath: 'medium/...',
size: 'medium'
},
{
id: '001',
timestamp: 2016-06-02T14:10:53Z,
contentLength: 5464,
filepath: 'small/...',
size: 'small'
}
The new format:
{
Id: '001',
Timestamp: 2016-06-02T14:10:53Z,
OriginalSize: {
ContentLength: 123456,
FilePath: 'original/...'
},
MediumSize: {
ContentLength: 24565,
FilePath: 'medium/...'
},
SmallSize: {
ContentLength: 5464,
FilePath: 'small/...'
}
}
I was achieving this with small datasets like this, processing the 'original' size first:
let out = data.filter(o => o.size === 'original).map(o => {
return {
Id: o.id,
Timestamp: o.timestamp,
OriginalSize: {
ContentLength: o.contentLength,
FilePath: o.filepath
}
};
});
data.filter(o => o.size !== 'original').forEach(o => {
let orig = out.find(function (og) {
return og.Timestamp === o.timestamp;
});
orig[o.size + 'Size'] = {
ContentLength: o.contentLength,
FilePath: o.filepath
};
)
// out now contains the correctly-formatted objects
The problem comes with the very large datasets, where I can't load the hundreds of megabytes of JSON into memory all at once. This seems like a great time to use streams, but of course if I read the file in chunks, running .find() on a small array to find the 'original' size won't work. If I scan through the whole file to find originals and then scan through again to add the other sizes to what I've found, I end up with the whole dataset in memory anyway.
I know of JSONStream, which would be great if I was doing a simple 1-1 remapping of my objects.
Surely I can't be the first one to run into this kind of problem. What solutions have been used in the past? How can I approach this?
I think the trick is to update the database on the fly. If the JSON file is too big for memory, then I expect the resulting set of objects (out
in your example) is too big for memory too.
In the comments you state the JSON file has one object per line. Therefore using node.js builtin fs.createReadStream
and readline
to get each line of the text file. Next process the line (string) into a json object, and finally update the database.
parse.js
var readline = require('readline');
var fs = require('fs');
var jsonfile = 'text.json';
var linereader = readline.createInterface({
input: fs.createReadStream(jsonfile)
});
linereader.on('line', function (line) {
obj = parseJSON(line); // convert line (string) to JSON object
// check DB for existing id/timestamp
if ( existsInDB({id:obj.id, timestamp:obj.timestamp}) ) {
updateInDB(obj); // already exists, so UPDATE
}
else { insertInDB(obj); } // does not exist, so INSERT
});
// DUMMY functions below, implement according to your needs
function parseJSON (str) {
str = str.replace(/,\s*$/, ""); // lose trailing comma
return eval('(' + str + ')'); // insecure! so no unknown sources
}
function existsInDB (obj) { return true; }
function updateInDB (obj) { console.log(obj); }
function insertInDB (obj) { console.log(obj); }
text.json
{ id: '001', timestamp: '2016-06-02T14:10:53Z', contentLength: 123456, filepath: 'original/...', size: 'original' },
{ id: '001', timestamp: '2016-06-02T14:10:53Z', contentLength: 24565, filepath: 'medium/...', size: 'medium' },
{ id: '001', timestamp: '2016-06-02T14:10:53Z', contentLength: 5464, filepath: 'small/...', size: 'small' }
NOTE: I needed to quote the timestamp value to avoid a syntax error. From your question and example script I expect you either don't have this problem or already have this solved, maybe another way.
Also, my implementation of parseJSON
may be different from how you are parsing the JSON. Plain old JSON.parse
failed for me due to the properties not being quoted.