Search code examples
pythonregextext-mininginformation-extraction

Information extraction with Python using huge list of entity names


I have a large collection of multilingual html files from which I'd like to extract structured data. I also have huge list (+5M) of entity names occurring in the corpus (multi-word: persons & organisation names, places,...) that can be of help.

I'm looking for a Python library that can do fast tagging of text with entity names (and perhaps but not necessary do other task like POS tagging and elementary NER). The result should be searchable with simple REGEXP like expression augmented with tags. For example: ".+? [last_name] (is|was)( best)? CEO of [organisation_name]".

I've tried to find this functionality in NLTK and CLIPS pattern (pattern.search is similar) but failed. The closest open source library with such functionality is GATE but it is in Java and seems like overkill for this task.

Thanks,

Davor


Solution

  • You can try htql.RegEx from http://htql.net. Here is the example from the website:

    import htql; 
    address = '88-21 64th st , Rego Park , New York 11374'
    states=['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 
        'Delaware', 'District Of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 
        'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 
        'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 
        'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 
        'Oregon', 'PALAU', 'Pennsylvania', 'PUERTO RICO', 'Rhode Island', 'South Carolina', 'South Dakota', 
        'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 
        'Wyoming']; 
    
    a=htql.RegEx(); 
    a.setNameSet('states', states);
    
    state_zip1=a.reSearchStr(address, "&[s:states][,\s]+\d{5}", case=False)[0]; 
    # state_zip1 = 'New York 11374'
    
    state_zip2=a.reSearchList(address.split(), r"&[ws:states]<,>?<\d{5}>", case=False)[0]; 
    # state_zip2 = ['New', 'York', '11374']