Search code examples
c#parsingtext

How can personal and place names be extracted from text using C#?


Is there any C# algorithm by which personal and place names can be extracted from text?

e.g., given the following text:

St. Mark died at Alexandria, in Egypt.  He was martyred, I think.
However, that has nothing to do with my legend.  About the founding of
the city of Venice--

(taken from "The Innocents Abroad" by Mark Twain)

...is there any way to extract:

St. Mark
Alexandria (or better yet, "Alexandria, Egypt")
Venice

?

I realize that there is no way to get 100% accuracy (where all place names and personal names are captured, and no "false positives" are added), but 80% accuracy could be very valuable.

I understand that each word could be compared with an encyclopedia or some such, but there must be a better way. Also, how could the algorithm know to combine "St." and "Mark" and to see "Alexandria, in Egypt" as "Alexandria, Egypt"?


Solution

  • You are best off using some kind of API that will be able to perform this kind of entity matching, as what you are asking is potentially very complex and requires some degree of semantic textual analysis backed up by a large database. I'd recommend at looking at APIs such as:

    OpenCalais - English Semantic Metadata: Entity/Fact/Event Definitions and Descriptions web-service

    Calais supports a rich set of semantic metadata, including entities, events and facts.

    Alchemy API - Entity Extraction API

    AlchemyAPI is capable of identifying people, companies, organizations, cities, geographic features, and other typed entities within your HTML, text, or web-based content. We employ sophisticated statistical algorithms and natural language processing technology to analyze your information, extracting the semantic richness embedded within.