Search code examples
rsplinetidymodels

How to set the number of knots in a regression spline


Someone had this same question but they were using the splines library whereas I am using tidymodels.

I want to fit a cubic spline and split the domain of the independent variable into, say, 6 bins (i.e. make 5 cuts in its domain).

I believe this is done with step_bs() (or step_ns() in the case of natural splines).

I was unable to find which parameter sets the number of knots in the documentation. Moreover, it seems that splines::ns() can be passed to the options parameter, but the Readme is not available.


Solution

  • You may find this answer helpful in understanding the relationship between knots and degrees of freedom. You can set both deg_free and degree (the polynomial degree) in step_bs():

    library(recipes)
    #> Loading required package: dplyr
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    #> 
    #> Attaching package: 'recipes'
    #> The following object is masked from 'package:stats':
    #> 
    #>     step
    data(biomass, package = "modeldata")
    
    biomass_tr <- biomass[biomass$dataset == "Training",]
    biomass_te <- biomass[biomass$dataset == "Testing",]
    
    rec <- recipe(HHV ~ carbon + hydrogen + oxygen,
                  data = biomass_tr) %>%
        step_bs(carbon, deg_free = 7, degree = 4)
    
    ## training data
    prep(rec) %>% bake(new_data = biomass_tr)
    #> # A tibble: 456 × 10
    #>    hydrogen oxygen   HHV carbon_bs_1 carbon_bs_2 carbon_bs_3 carbon_bs_4
    #>       <dbl>  <dbl> <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
    #>  1     5.64   42.9  20.0 0           0                 0.489       0.494
    #>  2     5.7    41.3  19.2 0           0.000000158       0.502       0.484
    #>  3     5.8    46.2  18.3 0           0.000812          0.575       0.421
    #>  4     4.97   35.6  18.2 0.000196    0.0256            0.669       0.305
    #>  5     5.4    40.7  18.4 0.000000163 0.00476           0.619       0.375
    #>  6     5.75   40.2  18.5 0.000102    0.0202            0.663       0.317
    #>  7     5.99   38.2  18.7 0           0.00263           0.603       0.393
    #>  8     5.7    39.7  18.3 0.0000470   0.0156            0.655       0.329
    #>  9     5.5    40.9  18.6 0           0.0000451         0.532       0.460
    #> 10     5.9    40    18.9 0           0.00293           0.606       0.390
    #> # … with 446 more rows, and 3 more variables: carbon_bs_5 <dbl>,
    #> #   carbon_bs_6 <dbl>, carbon_bs_7 <dbl>
    
    ## testing data
    prep(rec) %>% bake(new_data = biomass_te)
    #> # A tibble: 80 × 10
    #>    hydrogen oxygen   HHV carbon_bs_1 carbon_bs_2 carbon_bs_3 carbon_bs_4
    #>       <dbl>  <dbl> <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
    #>  1     5.67   47.2  18.3  0.00000387   0.00795         0.635      0.357 
    #>  2     5.5    48.1  17.6  0.00261      0.0730          0.687      0.237 
    #>  3     5.5    49.1  17.2  0.00431      0.0907          0.685      0.220 
    #>  4     6.1    37.3  18.9  0.00000294   0.00750         0.633      0.359 
    #>  5     6.32   42.8  20.5  0            0.0000535       0.534      0.458 
    #>  6     5.5    41.7  18.5  0.000751     0.0434          0.682      0.274 
    #>  7     5.23   54.1  15.1  0.0358       0.229           0.610      0.124 
    #>  8     4.66   33.8  16.2  0.00687      0.111           0.680      0.201 
    #>  9     4.4    31.1  11.1  0.294        0.396           0.224      0.0160
    #> 10     3.77   23.7  10.8  0.339        0.376           0.175      0.0107
    #> # … with 70 more rows, and 3 more variables: carbon_bs_5 <dbl>,
    #> #   carbon_bs_6 <dbl>, carbon_bs_7 <dbl>
    

    Created on 2022-04-18 by the reprex package (v2.0.1)