Search code examples
pythonnlpdata-analysisreddit

I have Parsed Reddit posts using their API How can I extract only questions from this posts using NLTK?


I want only questions from this post and want to analyze what topics are asked the most. According to the analysis, I will create podcasts on only those topics. example If I want to know what are the topics related to Stock Market on which people are asking questions on Reddit. If I use this subreddit I want to extract questions like "What is an ETFs?". I will create a podcast regarding such questions. I want to extract such questions from the posts using nltk. How can I do that?

Sample: I am getting JSON from you will get data from here Now from this JSON I am extracting titles now I want to know which of these titles are Interrogative regex is a good option but people sometimes ask questions like this - what a lovely day. Now our conditions fail here. Can you suggest a more appropriate method?


Solution

  • I'm working a lot on text clustering, classification and so on and can give several advises:

    1. Using regexp and check for keywords as How, What, Where (as Gaurav Taneja told in comments). It is a good start. More of this you can manually improve this method by adding specific conditions. For example: question keyword must be first in sentences (or second too "And how can I...?"). ? must be at the end of the sentence (but not always: what if anyone just skip punctuation or two sentence question: "I want to classify text. How?"). You can skip short questions (consist of 2 words).

    2. One more interesting opportunity is to use Morphological analysis. The idea is that we need get correct questions to get their topics. So it must consist not only from question keyword and ? symbol but must have additional nouns - we will catch them and try to classify (there are a lot of methods what to do with them, but it is another question). The question without them are general questions without current topic. See more info here.

    3. And one more interesting way: we can get first test question sample manually and create classifier to find another questions from corpus automatically. Simple example you can find here (section 2.2). There are some underwater rocks here: for example If in the test sample there were no examples of a certain (specific) type classifier would not find them. So it is useful to catch a glimpse of corpus to find new question types and add them to test sample.