Search code examples
variablessascreation

SAS calculating new variable for a new dataset from an existing Dataset


This is a portion of given dataset, projet.details_etest:

 "survey_instance_id"  "user_id"   "question_id"    "Item_correct"
'"2008"                "14389"        "4243"           "0"
'"2008"                "14489"        "4243"           "1"
'"2008"                "14499"        "4253"           "0"
'"2008"                "1669"         "4253"           "1"

I want to create a new dataset called projet.resume_question which contains the dataset details sort by question_id, containing the variables:

  • survey_instance_id
  • question_id
  • nb_correct_answers
  • nb_incorrect answers
  • nb_omitted_answers
  • nb_total_with_omitted_answers
  • nb_total_without_omitted_answers

The variable nb_omitted_answers is the total number of participants minus nb_correct_answers, the number of correct answers per question, minus nb_incorrect_answers, the number of incorrect answers per question.

The variable nb_total_with_omitted_answers is the total number of participants who have participated in the test.

The variable nb_total_without_omitted_answers is the total number of participants who have answered each question.

Here is what I did:

   data projet.resume_question;
set projet.details_etest;
by question_id;
keep survey_instance_id question_id nb_correct_answers nb_incorrect_answers;
retain nb_correct_answers 0 nb_incorrect_answers 0;
if Item_correct =1 then correct_answers= Item_correct;
else if Item_correct =0 then incorrect_answers= Item_correct;
nb_correct_answers = sum (correct_answers);
nb_incorrect_answers= sum (incorrect_answers);
run;
proc print data=projet.resume_question;
run;

I start this way and what I found seems wrong to me when I printed it. Can someone help me please?


Solution

  • First sort the dataset by survey, question, participant.

    proc sort data = projet.details_etest out = details;
        by survey_instance_id question_id user_id;
    run;
    

    Now get the number of participants for each survey.

    proc sql;
        create table participated as
        select survey_instance_id,
            count(distinct user_id) as nb_total_with_omitted_answers
        from details
        group by survey_instance_id;
    quit;
    

    Compute the aggregates by survey, question.

    data aggregated;
        set details;
        by survey_instance_id question_id;
    
        retain nb_total_without_omitted_answers
               nb_correct_answers nb_incorrect_answers 0;
    
        if first.question_id then do;
            nb_total_without_omitted_answers = 0;
            nb_correct_answers = 0;
            nb_incorrect_answers = 0;
        end;
    
        if item_correct in (0, 1) then nb_total_without_omitted_answers + 1;
    
        if item_correct = 1 then nb_correct_answers + 1;
        else if item_correct = 0 then nb_incorrect_answers + 1;
    
        if last.question_id then output;
    
        drop user_id item_correct;
    run;
    

    Lastly, compute the number of omitted answers per question.

    data projet.resume_question;
        merge participated aggregated;
        by survey_instance_id;
    
        nb_omitted_answers = nb_total_with_omitted_answers -
            nb_correct_answers - nb_incorrect_answers;
    run;
    

    This should get you what you need.