I'm learning to use Wiki API to get the public information of users. I found the script, "get_users.py", in MediaWiki-API-demos that can help us get general information, like edit count or email address. However, the personal description on the user page can not be fetched in this way.
(An example is shown below. I want to get the text information like "I'm not usually active on English Wikipedia. Please refer ...")
I found that "API: Get the contents of a page" offers an option to achieve that. Because I know nothing about PHP, may I ask is there any way we can get these textual contents using the API in Python?
Thank you a lot for your time in advance!
Update:
I'm trying to search for the user information of a user list like the following:
If I want to search personal statements of them, is there any way we can execute them at once, instead of looping them one by one then inputting into the script? (It comes from the demo: get_pages_revisions.py)
(Suppose we want to find the info of Catrope and Bob, the following implementation by modifying PARAMS
cannot work correctly:
PARAMS = {
"action": "query",
"prop": "revisions",
"titles": "User:Catrope|Bob",
"rvprop": "timestamp|user|comment|content",
"rvslots": "main",
"formatversion": "2",
"format": "json"
}
)
You don't have to know PHP to use information from API: Get the contents of a page
. There are only URLs with extension .php
- nothing more - and you can use these URLs with any language - ie. python. Even code in get_users.py
uses URL with extension .php
and it doesn't use PHP
code for this.
You have to only add &format=json
to get data as JSON
instead of HTML
I don't know which URL you need to get data but you can use it as string
import requests
r = requests.get("https://en.wikipedia.org/w/api.php?action=parse&page=Pet_door&prop=text&formatversion=2&format=json")
data = r.json()
print(data['parse']['text'])
Or you can write params as dictionary - like in get_users.py
- and it is more readable for user and it is easier to change param
import requests
params = {
'action': 'parse',
# 'page': 'Pet_door',
'page': 'USER:Catrope',
# 'prop': 'text',
'prop': 'wikitext',
'formatversion': 2,
'format': 'json'
}
r = requests.get("https://en.wikipedia.org/w/api.php", params=params)
data = r.json()
#print(data.keys())
#print(data)
#print('---')
#print(data['parse'].keys())
#print(data['parse'])
#print('---')
#print(data['parse']['text']) # if you use param `'prop': 'text'
#print('---')
print(data['parse']['wikitext']) # if you use param `'prop': 'wikitext'
print('---')
# print all not empty lines
for line in data['parse']['wikitext'].split('\n'):
line = line.strip() # remove spaces
if line: # skip empty lines
print('--- line ---')
print(line)
print('---')
# get first line of text (with "I'm not usually active on English Wikipedia. Please refer...")
print(data['parse']['wikitext'].split('\n')[0])
Because for 'prop': 'text'
it returns HTML
then it would need lxml
or BeautifulSoup
to search information in HTML
. For 'prop': 'wikitext'
it gives text without HTML tags and it was easier to use split('\n')[0]
to get first line with
I'm not usually active on English Wikipedia. Please refer to my [[mw:User:Catrope|user page]] at [[mw:|MediaWiki.org]].
EDIT: It doesn't have method to get all pages in one request. You have to use for
-loop with 'page': 'USER:{}'.format(name)
import requests
for name in ['Catrope', 'Barek']:
print('name:', name)
params = {
'action': 'parse',
'page': 'USER:{}'.format(name), # create page name
# 'prop': 'text',
'prop': 'wikitext',
'formatversion': 2,
'format': 'json'
}
r = requests.get("https://en.wikipedia.org/w/api.php", params=params)
data = r.json()
#print(data['parse']['text'])
print(data['parse']['wikitext'])
print('---')
EDIT: For query
revisions
you have to use full titles
"titles": "User:Catrope|User:Bob|User:Barek",
But not titles gives results so you have to check if there is revisions
in data
import requests
S = requests.Session()
URL = "https://www.mediawiki.org/w/api.php"
PARAMS = {
"action": "query",
"prop": "revisions",
"titles": "User:Catrope|User:Bob|User:Barek",
"rvprop": "timestamp|user|comment|content",
"rvslots": "main",
"formatversion": "2",
"format": "json"
}
R = S.get(url=URL, params=PARAMS)
DATA = R.json()
PAGES = DATA["query"]["pages"]
for page in PAGES:
if "revisions" in page:
for rev in page["revisions"]:
print(rev['slots']['main']['content'])
else:
print(page)