Search code examples
nlpnltk

Identifying the gender and nationality of a list of names?


I have a list of names that I've extracted from articles, and I'm trying to guess demographic information about them (gender and nationality).

The list looks like:

Šefik Džaferović
Miloš Zeman
Abdel Fattah el-Sisi
სალომე ზურაბიშვილი
Michael D. Higgins
Maia Sandu
محمد السادس
Стево Пендаровски

with each list item including at least a first and second name.

Any advice on where to start?


Solution

  • You could get list of names from different countries; most countries will have records of their most common first names etc.

    Once you have that data, you can set up a mapping between a name and the countries it is used in -- this will be a probability, as many names (most, probably) will occur in many countries, but will be more common in some than in others. For example, a lot of names of Turkish origin will be used in Germany, due to the sizable Turkish communities living there.

    When you then get a name, you can consult that map, and get a likelihood for the nationality. If this is separate for first and last name, that might be more precise; but be aware that there is no absolute certainty.

    With Gender it would work the same (helpfully, many list of baby names are split by gender); but there are also some ambiguous ones (Alex, Jan, Sam, Leslie, ...)