Search code examples
pythonwikipedia-apipywikibot

What is the regex to find any English words for find and replace mode of pywikibot


I wrote the following programme to delink the English words in a ta.wikipedia page. Delink means removal of square brackets which are before and after the English words. I am new to PAWS(pywikibot). It seems that to removal can done by regex(A-Z,a-z). How?

import pywikibot
import re

    site = pywikibot.Site('ta', 'wikipedia')
    page = pywikibot.Page(site, title)
    page.text = page.text.replace('[[Eudicots]]','Eudicots')
    page.save()

Sorry for my English.English is a bridge language for me. I am not asking to debug.But how to avoid the following repeated type of code. for example, the following 26 (alphabet) code helps to remove the [[ brackets.

page.text = page.text.replace('[[A','A')
page.text = page.text.replace('[[B','B')
page.text = page.text.replace('[[C','C')
likewise, A to Z
page.text = page.text.replace('[[X','X')
page.text = page.text.replace('[[Y','Y')
page.text = page.text.replace('[[Z','Z')

Then i have to remove lowercase which is always at the end of a word. Because, every word ends in lowercase. to remove lowercase, i have to write the following code,

    page.text = page.text.replace('a]]','a')
    page.text = page.text.replace('b]]','b')
    page.text = page.text.replace('c]]','c')
    page.text = page.text.replace('d]]','d')
     (likewise, for all the 26 English letters)
    page.text = page.text.replace('x]]','x')
    page.text = page.text.replace('y]]','y')

I think this is not good coding. So i want to use regex. I hope that i furnished the need for the wikimedia project.

In other words, I want to remove English word's brackets only not the English words.


Solution

  • Some PCRE-compatible regular expression libraries can match character classes based on their Unicode properties (e.g. \p{Latin} would match any character of a Latin script), but Python's re module does not. There are other Python modules which you could use instead (this answer has the details) but as long as you are only looking for ASCII characters it's easier to build your own character class: [A-Za-z] will match a single character which is within those ranges, and re.sub('([A-Za-z])]]', '\\1', text) will keep that character and discard the brackets.