I'm trying to clean a text to keep at most letters, numbers and most usual ponctuation marks. For example, I have sometimes '''words''' or ''words'' so I want to strip those multiple simple quotes. So far I've chosen to use two regex :
import re
tqre=re.compile('\'\'\'[^\']*\'\'\'') #for triple quotes
dqre=re.compile('\'\'[^\']*\'\'') #for "double" quotes
Then strip each match :
res1=tqre.sub(self.quoteExtract,text)
res2=dqre.sub(self.quoteExtract,res1)
where:
def quoteExtract(self,match):
return match.group().strip("'")
It looks like it works well for triple quote, but I've got many double simple quotes passing through, seems they are not caught. Is it because they are not really simple quotes but another lookalike signs ? Is there another way to handle them ?
Ex : In * ''Esquisse d'une grammaire comparée de l'arménien classique'', 1903.
the regex is not found.
It doesn't pass because there is a '
(l'arménien) between the double-quotes, but you are trying to match [^']*
.
This kind of regex is best expressed using the lazy quantifier:
tqre = re.compile("'''.*?'''")
dqre = re.compile("''.*?''")
Here .*?
means match anything string, and when there are multiple matches, choose the shortest one.
.
= any character except new-line, *
= zero or more, ?
after the star = non-greedy match