I'm trying to extract the number of videos or audio files present in a Wikipedia article, I searched the APIs but didn't find one for that.
I did notice that when using the API to extract the images for a specific page, the audio file with .ogg extension appears in the list with the images.
I don't know if this case can be generalized, and whether I can use it to count videos and audio files? Does anyone have another way to do this?
Basically all file types are treated equally by the API, but you can fetch the mediatype of each file, and use that to filter out the videos and audio files.
To get the mediatype of a file you would use prop=imageinfo
(this will be changed to the more accurate prop=fileinfo
in future versions) for each file. As prop=images
can be used as a generator, you can get the list of files, and their mediatype, in one single API call, like this:
https://ar.wikipedia.org/w/api.php?action=query&generator=images&titles=%D8%AD%D9%88%D8%AB%D9%8A%D9%88%D9%86&redirects=&prop=imageinfo&iiprop=mediatype&continue=&format=xml
Here images
is used as a generator, returning a list of files, and the list of files in its turn is being fed to the imageinfo
call.
For each file, you will get something like this:
"2014232": {
"pageid": 2014232,
"ns": 6,
"title": "\u0645\u0644\u0641:06-Salame-Al Aadm 001.ogg",
"imagerepository": "local",
"imageinfo": [
{
"mediatype": "AUDIO"
}
]
}
The mediatype
can be any of the following (copy-and-paste from the manual):
UNKNOWN // unknown format
BITMAP // some bitmap image or image source (like psd, etc). Can't scale up.
DRAWING // some vector drawing (SVG, WMF, PS, ...) or image source (oo-draw, etc). Can scale up.
AUDIO // simple audio file (ogg, mp3, wav, midi, whatever)
VIDEO // simple video file (ogg, mpg, etc; no not include formats here that may contain executable sections or scripts!)
MULTIMEDIA // Scriptable Multimedia (flash, advanced video container formats, etc)
OFFICE // Office Documents, Spreadsheets (office formats possibly containing apples, scripts, etc)
TEXT // Plain text (possibly containing program code or scripts)
EXECUTABLE // binary executable
ARCHIVE // archive file (zip, tar, etc)
The default mapping of mimetype <=> mediatype is available here, though it's possible to override this for an individual wiki.