Search code examples
sqlselectgroup-bygoogle-bigquery

Querying Stackoverflow public dataset on BigQuery on Q&A SQL


i have a home work assignment : We want to find all questions about the python pandas library, as well as their answers. Write a query that retrieves all the questions for which the title contains the word "pandas" from the posts_questions table, as well as all the appropriate answers for each such question from the post_answers , where each row in the returned table will represent a pair of (question + answer). If the question has a number Answers, the same question will appear in multiple rows in the returned table. returned and the of the question as well as the following fields: the id, title, tag, answer_count score, creation time (creation_date) and the body of the text (the body) of both the question and the answer. For the body, all slash characters must be removed the line '\n'.

for this i wrote the following SQL code:

SELECT  tb1.id as q_id,tb1.title as q_title,tb1.tags as q_tags
,tb1.creation_date as q_creation_date,tb1.score as q_score,tb1.answer_count as q_answer_count
,REPLACE(tb1.body,'\n',' ') as body_qustion,REPLACE(tb2.body,'\n',' ') as body_answer
from  `bigquery-public-data.stackoverflow.posts_questions` as tb1 
left join  `bigquery-public-data.stackoverflow.posts_answers`  as tb2 
on tb1.id=tb2.id
 where( tb1.title like "%pandas%" or tb1.title like "%Pandas%" or tb1.title like "%PANDAS%")
 group by tb1.id ,tb1.title ,tb1.tags,tb1.creation_date,tb1.score
,tb1.answer_count,body_qustion,body_answer

but the problem is that when for example for a question i have 3 answers i expect it to return 3 rows for the question instead it returns only one and i dont know what is the problem .

the data is : bigquery-public-data.stackoverflow.posts_questions

enter image description here

and bigquery-public-data.stackoverflow.posts_answers :

enter image description here


Solution

  • You have joined with the wrong ID of the answer table. In the answer table ID column represents the ID of the answer itself whereas parent_id represents the question id. You can play with the below query to have more understanding.

    Query:

    SELECT
      q.id AS q_id   #id of the question in question table
      ,
      a.id AS a_id  #id of the answer in answer table
      ,
      q.title AS q_title,
      q.tags AS q_tags,
      q.creation_date AS q_creation_date,
      q.score AS q_score,
      q.answer_count AS q_answer_count,
      REPLACE(q.body,'\n',' ') AS body_qustion,
      REPLACE(a.body,'\n',' ') AS body_answer
    FROM
      `bigquery-public-data.stackoverflow.posts_questions` q
    LEFT JOIN
      `bigquery-public-data.stackoverflow.posts_answers` a
    ON
      q.id = a.parent_id #Joining with quesiton Ids
    WHERE
      LOWER(q.title) LIKE '%pandas%'
      AND q.creation_date BETWEEN '2021-01-01'
      AND '2021-01-31'
      AND q.answer_count >1
    

    Output:

    enter image description here