Search code examples
textfilteringclassificationcorpus

List of proper names?


I'm trying to filter names out of text blobs. Currently I'm just generating a words list and filtering it by hand but I've got ~8k words to go so I'm looking for a better way. I could grab a dictionary and filter them out but that would cull names like smith and cliff.

What I need is either of the following:

  • a list of common names (I'd need the >5k most common names)
  • a list of names that also happen to be words

I figure between them, I can do a combined blacklist/whitelist to get what I need.


Solution

  • US Census name list: http://www.census.gov/genealogy/www/

    That should get you one angle on the problem, anyway.

    edited changed URL, per comment below about page moving. Nobody believes in HTTP 302 anymore?