I was working on a problem that involved creating dummy variables, but I ran into an issue where I'm having missing values for the dummy variables in the corresponding reference category even though the dataset doesn't have missing values. Even if I'm selecting one of the categories to be the reference category or variable, shouldn't the dummy variable values be zero? I had the same issue even when I did not account for missing values. I've included my code, log, output, and the content of the text file for context and so that my question will be clearer.
The part of the homework assignment that I'm having issues with is the following:
Fibromyalgia is a syndrome of widespread body pain that is often treated by rheumatologists. One way of measuring the impact of fibromyalgia on patients is the Fibromyalgia Impact Questionnaire (FIQ). On the FIQ, high values show greater impact of disease (bad) and low values show lesser impact of disease (good). We have data on women with fibromyalgia who attended one of two types of disease self-management classes or who received standard care (the control group).
Data from this study are in the file fibr03_sum18.txt on the BS 805 web site in the Assignments section for Class 6. The variables in the data file are:
FIQ score (3.1 format) taken after the classes Group (1 = class 1, 2 = class 2, 3 = standard care) Disease Severity (On a scale of 1 to 6) before the classes Age (years) Since the data were entered into this file, information on a new patient and a correction to the data have been found. The new patient is in the control group, has FIQ = 8.2, Disease Severity =2, and Age = 25 years. The correction is that the second subject in class 1 was 17 rather than 18 years old.
A) Create a temporary SAS data set using these data. In the data set, create a set of indicator variables that code for group membership. Use PROC PRINT to list the data.
I read in the text file using column input, but I think it can be read in using list input as well? The text file contained the data below was the file was called: fibr03_sum18.txt.
3.1 1 6 21
1.8 1 6 18
3.3 1 5 22
2.9 1 4 15
4.3 1 3 24
4.8 1 3 22
4.9 1 2 17
6.4 1 2 18
5.7 2 5 17
6.1 2 5 25
8.5 2 3 31
7.1 2 2 17
7.7 2 1 25
9.8 2 1 22
5.1 3 4 23
7.2 3 1 15
8.3 3 1 22
6.7 3 2 20
My code for reading in the data and creating the temporary dataset with the dummy variables was:
*Part A: Reading in Data and Creating a Temporary Dataset;
libname HW6 'C:\Users\jackz\Desktop\SAS';
filename HW6new 'C:\Users\jackz\Desktop\SAS\fibr03_sum18.txt';
proc format;
value grpf 1='class 1' 2='class 2' 3='standard care';
run;
data one;
infile HW6new;
input @1 FIQ 3.1 @5 grp 1. @7 disev 1. @9 age 2.;
*Creating Dummy Variables;
if grp=1 then classc1=1; else if grp=2 then classc1=0;
if grp=2 then classc2=1; else if grp=1 then classc2=0;
if grp=. then classc1=.;
if grp=. then classc2=.;
label FIQ='FIQ Score'
grp='Group'
disev='Disease Severity'
age='Age';
format grp grpf.;
run;
*Printout of Dataset one;
proc print data=one label;
run;
My log for this code was:
NOTE: Copyright (c) 2016 by SAS Institute Inc., Cary, NC, USA.
NOTE: SAS (r) Proprietary Software 9.4 (TS1M5)
Licensed to BOSTON UNIVERSITY - SFA T&R, Site 70009029.
NOTE: This session is executing on the W32_10HOME platform.
NOTE: Updated analytical products:
SAS/STAT 14.3
SAS/ETS 14.3
SAS/OR 14.3
SAS/IML 14.3
SAS/QC 14.3
NOTE: Additional host information:
W32_10HOME WIN 10.0.16299 Workstation
NOTE: SAS initialization used:
real time 0.96 seconds
cpu time 0.95 seconds
1 *Part A: Reading in Data and Creating a Temporary Dataset;
2 libname HW6 'C:\Users\jackz\Desktop\SAS';
NOTE: Libref HW6 was successfully assigned as follows:
Engine: V9
Physical Name: C:\Users\jackz\Desktop\SAS
3 filename HW6new 'C:\Users\jackz\Desktop\SAS\fibr03_sum18.txt';
4 proc format;
5 value grpf 1='class 1' 2='class 2' 3='standard care';
NOTE: Format GRPF has been output.
6 run;
NOTE: PROCEDURE FORMAT used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
7 data one;
8 infile HW6new;
9 input @1 FIQ 3.1 @5 grp 1. @7 disev 1. @9 age 2.;
10 *Creating Dummy Variables;
11 if grp=1 then classc1=1; else if grp=2 then classc1=0;
12 if grp=2 then classc2=1; else if grp=1 then classc2=0;
13 if grp=. then classc1=.;
14 if grp=. then classc2=.;
15 label FIQ='FIQ Score'
16 grp='Group'
17 disev='Disease Severity'
18 age='Age';
19 format grp grpf.;
20 run;
NOTE: The infile HW6NEW is:
Filename=C:\Users\jackz\Desktop\SAS\fibr03_sum18.txt,
RECFM=V,LRECL=32767,File Size (bytes)=214,
Last Modified=15Jun2018:12:56:26,
Create Time=15Jun2018:12:56:26
NOTE: 18 records were read from the infile HW6NEW.
The minimum record length was 10.
The maximum record length was 10.
NOTE: The data set WORK.ONE has 18 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
21 *Printout of Dataset one;
22 proc print data=one label;
NOTE: Writing HTML Body file: sashtml.htm
23 run;
NOTE: There were 18 observations read from the data set WORK.ONE.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.27 seconds
cpu time 0.06 seconds
Here is the output, although it is not lined up:
The SAS System
Obs FIQ Score Group Disease
Severity Age classc1 classc2
1 3.1 class 1 6 21 1 0
2 1.8 class 1 6 18 1 0
3 3.3 class 1 5 22 1 0
4 2.9 class 1 4 15 1 0
5 4.3 class 1 3 24 1 0
6 4.8 class 1 3 22 1 0
7 4.9 class 1 2 17 1 0
8 6.4 class 1 2 18 1 0
9 5.7 class 2 5 17 0 1
10 6.1 class 2 5 25 0 1
11 8.5 class 2 3 31 0 1
12 7.1 class 2 2 17 0 1
13 7.7 class 2 1 25 0 1
14 9.8 class 2 1 22 0 1
15 5.1 standard care 4 23 . .
16 7.2 standard care 1 15 . .
17 8.3 standard care 1 22 . .
18 6.7 standard care 2 20 . .
You can see that there are missing values for the dummy variables classc1 and classc2 even though there are no missing values in the original dataset. Should those values read 0, since group 3 does not fall in either grp=1 or grp=2?
Can anyone give me any hints as to what I have done wrong, if I have done anything wrong? Thanks for all of your help!
The output shows that the rows where the flag variables are missing values have group = 3 (standard care). The missing values are not missing due to the if statements, but due to the implicit resetting of data step variables to missing at the start of the implicit loop.
When group=3, there is no if statement that causes the flags variables to change from their initial 'reset to missing'
* when grp=3 neither classic1 nor classic2 variable is changed from its initial missing value;
put 'NOTE: ' _n_= (classic:) (=);
if grp=1 then classc1=1; else if grp=2 then classc1=0;
if grp=2 then classc2=1; else if grp=1 then classc2=0;
if grp=. then classc1=.;
if grp=. then classc2=.;
put 'NOTE: ' _n_= (classic:) (=);