Search code examples
rlogistic-regression

Confusion with the output of the function str


The data set birth.csv collected at the Baystate Medical Center, Springfield, USA during 1986 has the following format

enter image description here

After I imported the csv file (using read.csv() with colClasses specification), the output of the function str() didn't match with that of the function head(). For example, the first 6 values of the column low were supposed to be 0 but the output sample generated by str() showed they were 1

'data.frame':   189 obs. of  9 variables:
 $ low  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...  # shouldn't they be 0 0 0 0... instead?
 $ age  : num  19 33 20 21 18 21 22 17 29 26 ...
 $ lwt  : num  182 155 105 108 107 124 118 103 123 113 ...
 $ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
 $ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
 $ ptl  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ ht   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ ui   : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
 $ ftv  : Factor w/ 3 levels "0","1","2": 1 3 2 3 1 1 2 2 2 1 ...

A data.frame: 6 × 9
    low age lwt race smoke  ptl ht  ui  ftv
    <fct><dbl><dbl><fct><fct><fct><fct><fct><fct>
1   0   19  182 2    0      0   0   1   0
2   0   33  155 3    0      0   0   0   2
3   0   20  105 1    1      0   0   0   1
4   0   21  108 1    1      0   0   1   2
5   0   18  107 1    1      0   0   1   0
6   0   21  124 3    0      0   0   0   0

Could someone please explain what happened? If I built a logistic model for that imported dataset, would the result be wrong?


Solution

  • Factors (categorical variables, <fct> in the tibble column class labels) in R are stored internally as integers with 1 being the first level (or category), 2 the second level, etc., along with a lookup table mapping the integer values to their labels/levels.

    str() a few of the levels and then the integer values. Most other functions print the labels, not the integer values.

    It's extra confusing in your case because your labels are (character-class) integers starting at 0. For a somewhat clearer example, let's look at a factor with letters as the labels

    x = factor(c("a", "b", "a", "c"))
    
    x
    # [1] a b a c
    # Levels: a b c
    
    str(x)
    # Factor w/ 3 levels "a","b","c": 1 2 1 3