Using the MediaWiki API, this gives me an output like so, for search term Tiger
https://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Tiger&format=json&exintro=1
Response:
{"batchcomplete":"","query":{"pages":{"9796":{"pageid":9796,"ns":0,"title":"Tiger","extract":"<p>The <b>tiger</b> (<i>Panthera tigris</i>) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.</p>\n<p></p>"}}}}
How do I get an output as
The tiger (Panthera tigris) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.
Please can someone also tell me how to store everything in a text file? I'm a beginner here so please be nice. I need this for a project I'm doing in Bash, on a Raspberry Pi 2, with Raspbian
It's usually recommended to use JSON parser for handling JSON, one that I like is jq
% jq -r '.query.pages[].extract' file
<p>The <b>tiger</b> (<i>Panthera tigris</i>) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.</p>
<p></p>
To remove the HTML tags you can do something like:
... | sed 's/<[^>]*>//g'
Which will remove HTML tags that are not on continues lines:
% jq -r '.query.pages[].extract' file | sed 's/<[^>]*>//g'
The tiger (Panthera tigris) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.
file
is the file the JSON is stored in, eg:
curl -so - 'https://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Tiger&format=json&exintro=1' > file
jq '...' file
or
jq '...' <(curl -so - 'https://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Tiger&format=json&exintro=1')
You can install jq
with:
sudo apt-get install jq
For your example input you can also use grep
with -P
(PCRE). But using a proper JSON parser as above is recommended
grep -oP '(?<=extract":").*?(?=(?<!\\)")' file
<p>The <b>tiger</b> (<i>Panthera tigris</i>) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.</p>\n<p></p>