Search code examples
pythonstringnlpnamed-entity-recognition

Is there any word shape feature library for NER in python?


As a beginner in python, I am trying to build my own named entity recognizer and it is known that word shape features are particularly important in NER. Are there any known libraries where these features are defined? For example, one version of these features denotes lowers-case letters by x and upper-case letters by X, numbers by d and retaining punctuation, maps DC10-30 to XX-dd-dd and I.M.F to X.X.X.

So I look for a library which will improve my recognizer by applying these popularly known features. If there is no such library, how can I extract word shape features of a word like

wordshape("D-Day") = X-Xxx

Thanks in advance.


Solution

  • You can solve this problem with regex (regular expressions). The Python standard library for regex is re.

    The function below can achieve what you want

    def wordshape(text):
        import re
        t1 = re.sub('[A-Z]', 'X',text)
        t2 = re.sub('[a-z]', 'x', t1)
        return re.sub('[0-9]', 'd', t2)
    
    >>> wordshape("DC10-30")
    'XXdd-dd'
    >>> wordshape("D-Day")
    'X-Xxx'
    >>> wordshape('I.M.F')
    'X.X.X'