As a beginner in python, I am trying to build my own named entity recognizer and it is known that word shape features are particularly important in NER. Are there any known libraries where these features are defined? For example, one version of these features denotes lowers-case letters by x and upper-case letters by X, numbers by d and retaining punctuation, maps DC10-30 to XX-dd-dd and I.M.F to X.X.X.
So I look for a library which will improve my recognizer by applying these popularly known features. If there is no such library, how can I extract word shape features of a word like
wordshape("D-Day") = X-Xxx
Thanks in advance.
You can solve this problem with regex (regular expressions). The Python standard library for regex is re
.
The function below can achieve what you want
def wordshape(text):
import re
t1 = re.sub('[A-Z]', 'X',text)
t2 = re.sub('[a-z]', 'x', t1)
return re.sub('[0-9]', 'd', t2)
>>> wordshape("DC10-30")
'XXdd-dd'
>>> wordshape("D-Day")
'X-Xxx'
>>> wordshape('I.M.F')
'X.X.X'