Search code examples
machine-learningnlpartificial-intelligencetext-mining

How to create a feature that detect age in text in different languages?


I have a text classification task in several languages. What aproach should use if I would like to create a feature that extract age from text if this are the possible classes: 18-24,25-34,35-49 and 50-xx" and I have only tweets as a corpus. I all ready tried using all the tweets but with very low performance(0.66) any idea of how to aproach this task?. Thanks in advance.


Solution

  • Since it is still a research task, I suggest several links to scientific papers (links and the following summary are mostly taken from 'related work' section of our paper - unfortunately, in Russian, so I edited Google translation a little).

    So, take a look at these works (marked by year): 2009, 2010, 2011, 2013, 2014.

    In summary: you should find or create tagged corpora and use supervised machine learning with the following features:

    1. text features: n-grams over words and characters,
    2. stylistic features: parts of speech, slang, the average sentence length, punctuation, acronyms, emoticons, etc.
    3. social network features: the number of friends a user, the number of posts displayed on the page of the user, the total number of posts, the average number of comments for a post of the user.