I'm trying to fetch public user information from Wikipedia using API. (Using the script get_pages_revisions.py). After I got the revisions, I used BeautifulSoup to strip all the HTML tags. However, I found the remaining text is still quite messy.
For example, when I fetched the textual data from the User:(aeropagitica), the results showed the following: (A small part of it)
{{administrator}}
{{divbox|gray||Wikipedia is currently working on {{NUMBEROFARTICLES}} articles. The local time at the Wikipedia servers is '''{{CURRENTTIME}}''' on {{CURRENTDAYNAME}} {{CURRENTDAY}} {{CURRENTMONTHNAME}}, {{CURRENTYEAR}}.}}
• '''[[:WP:AIV|AIV]]''' •
'''[[Wikipedia:Articles for deletion/Log/{{CURRENTYEAR}} {{CURRENTMONTHNAME}} {{CURRENTDAY}}|AfD]]''' • '''[[User:(aeropagitica)/RFA summary|RfA]]''' • '''[[:Category:Candidates for speedy deletion|CSD]]''' • '''[[Wikipedia:Template messages|tpl]]''' • '''[[Wikipedia:Template_messages/User_talk_namespace|user talk tpl]]''' • '''[[Special:Newpages|new]]''' • '''[[Wikipedia:Stubs|stubs]]''' • '''[[Wikipedia:Copyright problems|(c)]]''' • '''[[Wikipedia:Manual of Style|MoS]]''' • '''[[User:Interiot/Tool2|edits (interiot)]]''' • '''[[Wikipedia:Proposed_deletion|prod]]''' • '''[[Special:Log/Newusers|newusers]]''' • '''[http://tools.wikimedia.de/~essjay/edit_count/Count.php? PHP interiot's tool]''' • '''[http://tools.wikimedia.de/~interiot/cgi-bin/Tool1/wannabe_kate Interiot's tool 1]''' • '''[[:Wikipedia:Article Creation and Improvement Drive|Article Improvement]]'''
{{purge|Purge server cache}}
I was [[Wikipedia:Requests_for_adminship/%28aeropagitica%29|nominated for adminship]] by [[User:King of Hearts|King of Hearts]] on February 27th 2006. The vote achieved consensus and I was accepted for the role with a score of '''40/10/5''' on March 7th 2006.
When I am not working on Wikipedia pages, I enjoy learning to play acoustic fingerstyle guitar, photography, learning languages (Spanish and French) and travel.
''Userboxes''
{| style="text-align:center; border: 1px solid #000000; background-color:#00cc99; width:100%; -moz-border-radius: 15px;"
|- padding:5em;padding-top:0.5em;"
|{{user en}}
May I ask:
style="...."
, cellpadding="...."
or something like these here? Can I remove all the format strings like these at once?{{Userbox|#77E0E8|#D0F8FF|{{CURRENTDAY}}|It is currently a [[{{CURRENTDAYNAME}}]]. I don't like {{CURRENTDAYNAME}}s.}}
The information after "It is .." is what we need, but the text before it: Userbox|#77E0E8
, is also used for the web layout definition and should be removed. Is there any way we can remove the first half of this line?
(Userbox
is just one kind of it, there are many other types like User:
, Category:
, hence it will be quite hard to move them with customize re
rules)
(I'm a beginner of BeautifulSoup and Web Parser, so any suggestions or hints will be valuable. Thank you for your help in advance!)
You're using the Revisions API which only allows you to get the page content as Wikitext. That's the "messy" text you're seeing.
You can instead use the Parse API to get the rendered HTML content of the page, which you can then put into a local DOM parser of your choosing or just strip HTML tags if that works for you.
See the MediaWiki API documentation for details, including examples on how to request the parsed contents of a page.