I'm trying to filter names out of text blobs. Currently I'm just generating a words list and filtering it by hand but I've got ~8k words to go so I'm looking for a better way. I could grab a dictionary and filter them out but that would cull names like smith and cliff.
What I need is either of the following:
I figure between them, I can do a combined blacklist/whitelist to get what I need.
US Census name list: http://www.census.gov/genealogy/www/
That should get you one angle on the problem, anyway.
edited changed URL, per comment below about page moving. Nobody believes in HTTP 302 anymore?