Search code examples
nlp

How to automatically determine text quality?


A lot of Natural Language Processing (NLP) algorithms and libraries have a hard time working with random texts from the web, usually because they are presupposing clean, articulate writing. I can understand why that would be easier than parsing YouTube comments.

My question is: given a random piece of text, is there a process to determine whether that text is well written, and is a good candidate for use in NLP? What is the general name for these algorithm?

I would appreciate links to articles, algorithms or code libraries, but I would settle for good search terms.


Solution

  • 'Well written' and 'good for NLP' may go together but don't have to. For a text to be 'good for NLP', it maybe should contain whole sentences with a verb and a dot at the end, and it should perhaps convey some meaning. For a text to be well written it should also be well-structured, cohesive, coherent, correctly substitute nouns for pronouns, etc. What you need depends on your application.

    The chances of a sentence to be properly processed by an NLP tool can often be estimated by some simple heuristics: Is it too long (>20 or 30 words, depending on the language)? Too short? Does it contain many weird characters? Does it contain urls or email adresses? Does it have a main verb? Is it just a list of something? To my knowledge, there is no general name for this, nor any particular algorithm for this kind of filtering - it's called 'preprocessing'.

    As to a sentence being well-written: some work has been done on automatically evaluating readability, cohesion, and coherence, e.g. the articles by Miltsakaki (Evaluation of text coherence for electronic essay scoring systems and Real-time web text classification and analysis of reading difficulty) or Higgins (Evaluating multiple aspects of coherence in student essays). These approaches are all based on one or the other theory of discourse structure, such as Centering Theory. The articles are rather theory-heavy and assume knowledge of both centering theory as well as machine learning. Nonetheless, some of these techniques have successfully been applied by ETS to automatically scoring student's essays and I think this is quite similar to what you are trying to do, or at least, you may be able to adapt a few ideas.

    All this being said, I believe that within the next years, NLP will have to develop techniques to process language which is not well-formed with respect to current standards. There is a massive amount of extremely valuable data out there on the web, consisting of exactly the kinds of text you mentioned: youtube comments, chat messages, twitter and facebook status messages, etc. All of them potentially contain very interesting information. So, who should adapt - the people wrting that way or NLP?