We will receive up to 10k JSON files in a separate directory that must be parsed and converted to separate .csv files. Then the file at the URL in each must be downloaded to another directory. I was planning on doing this in Automator on the Mac and calling a Python script for downloading the files. I have the portion of the shell script done to convert to CSV but have no idea where to start with python to download the URLs.
Here's what I have so far for Automator:
- Shell = /bin/bash
- Pass input = as arguments
- Code = as follows
#!/bin/bash
/usr/bin/perl -CSDA -w <<'EOF' - "$@" > ~/Desktop/out_"$(date '+%F_%H%M%S')".csv
use strict;
use JSON::Syck;
$JSON::Syck::ImplicitUnicode = 1;
# json node paths to extract
my @paths = ('/upload_date', '/title', '/webpage_url');
for (@ARGV) {
my $json;
open(IN, "<", $_) or die "$!";
{
local $/;
$json = <IN>;
}
close IN;
my $data = JSON::Syck::Load($json) or next;
my @values = map { &json_node_at_path($data, $_) } @paths;
{
# output CSV spec
# - field separator = SPACE
# - record separator = LF
# - every field is quoted
local $, = qq( );
local $\ = qq(\n);
print map { s/"/""/og; q(").$_.q("); } @values;
}
}
sub json_node_at_path ($$) {
# $ : (reference) json object
# $ : (string) node path
#
# E.g. Given node path = '/abc/0/def', it returns either
# $obj->{'abc'}->[0]->{'def'} if $obj->{'abc'} is ARRAY; or
# $obj->{'abc'}->{'0'}->{'def'} if $obj->{'abc'} is HASH.
my ($obj, $path) = @_;
my $r = $obj;
for ( map { /(^.+$)/ } split /\//, $path ) {
if ( /^[0-9]+$/ && ref($r) eq 'ARRAY' ) {
$r = $r->[$_];
}
else {
$r = $r->{$_};
}
}
return $r;
}
EOF
I'm unfamiliar with Automator so perhaps someone else can address that but as far as the Python portion goes, it is fairly simple to download a file from a url. It would go something like this:
import requests
r = requests.get(url) # assuming you don't need to do any authentication
with open("my_file_name", "wb") as f:
f.write(r.content)
Requests is a great library for handling http(s) and since the content attribute of the Response is a byte string we can open a file for writing bytes (the "wb") and write it directly. This works for executable payloads too so be sure you know what you are downloading. If you don't already have requests installed run pip install requests
or the Mac equivalent.
If you were inclined to do your whole process in python I would suggest you look at the json and csv packages. Both of these are part of the standard library and provide high-level interfaces for exactly what you are doing
Edit:
Here's an example if you were using the json
module on a file like this:
[
{
"url": <some url>,
"name": <the name of the file>
}
]
Your Python code might look similar to this:
import requests
import json
with open("my_json_file.json", "r") as json_f:
for item in json.load(json_f)
r = requests.get(item["url"])
with open(item["name"], "wb") as f:
f.write(r.content)