Elastic Net SearcheR Datasets

ensr package version 0.1.0.9001

Peter E. DeWitt

2020-03-02

This vignette provides descriptions of the example data sets in the ensr package. We provide information about how the data sets were built and also about each individual variable.

The packages needed to constuct the data sets are:

library(data.table)
library(magrittr)
library(qwraps2)
library(digest)
set.seed(42)

library(ensr)
## Loading required package: glmnet
## Loading required package: Matrix
## Loaded glmnet 3.0-2

Traumatic Brain Injuries

A synthetic data set based based on a Traumatic Brain Injury (TBI) study has been included with the ensr package. All data points in this data set have been randomly generated and have no association with any real patient.

The columns of this data set are:

Column(s) Meaning
age Age of the patient, in days, at time of hospital admission
female An integer flag for sex, 1 == female, 0 == male.
los Length of stay: number of days between hospital admission and discharge.
pcode1, …, pcode6 Indicator columns for the presence or absence of a procedure code from a billing database.
ncode1, …, ncode6 Indicator columns for the presence or absence of the same procedure code, but from a trauma database.
injury1, …, injury3 Indicator columns for the presence or absence of three different types of TBI.

Construction of the synthetic data set:

TBI_SUBJECTS <- 1323L

tbi <- data.table(age    = round(rweibull(TBI_SUBJECTS, shape = 1.9, scale = 2000)),
                  female = rbinom(TBI_SUBJECTS, 1, 0.10),
                  los    = round(rexp(TBI_SUBJECTS, rate = 1/13)))

The pcode and ncode variables are difficult to simulate. There are 64 permutations for each set. We built two vectors with relative frequencies of the codes.

pcodes <-
  c(`0.0.0.0.0.0` = 1230, `1.0.0.0.0.0` = 130, `0.1.0.0.0.0` = 40,
    `1.1.0.0.0.0` = 7, `0.0.1.0.0.0` = 120, `1.0.1.0.0.0` = 4, `0.1.1.0.0.0` = 20,
    `1.1.1.0.0.0` = 2, `0.0.0.1.0.0` = 4, `1.0.0.1.0.0` = 4, `0.1.0.1.0.0` = 20,
    `1.1.0.1.0.0` = 4, `0.0.1.1.0.0` = 20, `1.0.1.1.0.0` = 2, `0.1.1.1.0.0` = 2,
    `1.1.1.1.0.0` = 7, `0.0.0.0.1.0` = 5, `1.0.0.0.1.0` = 4, `0.1.0.0.1.0` = 10,
    `1.1.0.0.1.0` = 9, `0.0.1.0.1.0` = 10, `1.0.1.0.1.0` = 7, `0.1.1.0.1.0` = 1,
    `1.1.1.0.1.0` = 7, `0.0.0.1.1.0` = 10, `1.0.0.1.1.0` = 6, `0.1.0.1.1.0` = 5,
    `1.1.0.1.1.0` = 8, `0.0.1.1.1.0` = 5, `1.0.1.1.1.0` = 2, `0.1.1.1.1.0` = 6,
    `1.1.1.1.1.0` = 9, `0.0.0.0.0.1` = 40, `1.0.0.0.0.1` = 7, `0.1.0.0.0.1` = 40,
    `1.1.0.0.0.1` = 10, `0.0.1.0.0.1` = 10, `1.0.1.0.0.1` = 1, `0.1.1.0.0.1` = 20,
    `1.1.1.0.0.1` = 5, `0.0.0.1.0.1` = 10, `1.0.0.1.0.1` = 2, `0.1.0.1.0.1` = 6,
    `1.1.0.1.0.1` = 1, `0.0.1.1.0.1` = 8, `1.0.1.1.0.1` = 1, `0.1.1.1.0.1` = 7,
    `1.1.1.1.0.1` = 2, `0.0.0.0.1.1` = 9, `1.0.0.0.1.1` = 1, `0.1.0.0.1.1` = 1,
    `1.1.0.0.1.1` = 2, `0.0.1.0.1.1` = 6, `1.0.1.0.1.1` = 9, `0.1.1.0.1.1` = 6,
    `1.1.1.0.1.1` = 4, `0.0.0.1.1.1` = 2, `1.0.0.1.1.1` = 9, `0.1.0.1.1.1` = 3,
    `1.1.0.1.1.1` = 1, `0.0.1.1.1.1` = 7, `1.0.1.1.1.1` = 4, `0.1.1.1.1.1` = 4,
    `1.1.1.1.1.1` = 3)

ncodes <-
  c(`0.0.0.0.0.0` = 1240, `1.0.0.0.0.0` = 200, `0.1.0.0.0.0` = 130,
    `1.1.0.0.0.0` = 4, `0.0.1.0.0.0` = 90, `1.0.1.0.0.0` = 9, `0.1.1.0.0.0` = 20,
    `1.1.1.0.0.0` = 7, `0.0.0.1.0.0` = 20, `1.0.0.1.0.0` = 3, `0.1.0.1.0.0` = 10,
    `1.1.0.1.0.0` = 8, `0.0.1.1.0.0` = 7, `1.0.1.1.0.0` = 9, `0.1.1.1.0.0` = 8,
    `1.1.1.1.0.0` = 8, `0.0.0.0.1.0` = 6, `1.0.0.0.1.0` = 9, `0.1.0.0.1.0` = 10,
    `1.1.0.0.1.0` = 3, `0.0.1.0.1.0` = 9, `1.0.1.0.1.0` = 7, `0.1.1.0.1.0` = 6,
    `1.1.1.0.1.0` = 8, `0.0.0.1.1.0` = 6, `1.0.0.1.1.0` = 2, `0.1.0.1.1.0` = 9,
    `1.1.0.1.1.0` = 6, `0.0.1.1.1.0` = 2, `1.0.1.1.1.0` = 2, `0.1.1.1.1.0` = 5,
    `1.1.1.1.1.0` = 7, `0.0.0.0.0.1` = 20, `1.0.0.0.0.1` = 6, `0.1.0.0.0.1` = 4,
    `1.1.0.0.0.1` = 1, `0.0.1.0.0.1` = 2, `1.0.1.0.0.1` = 4, `0.1.1.0.0.1` = 7,
    `1.1.1.0.0.1` = 6, `0.0.0.1.0.1` = 4, `1.0.0.1.0.1` = 1, `0.1.0.1.0.1` = 8,
    `1.1.0.1.0.1` = 6, `0.0.1.1.0.1` = 1, `1.0.1.1.0.1` = 4, `0.1.1.1.0.1` = 2,
    `1.1.1.1.0.1` = 7, `0.0.0.0.1.1` = 3, `1.0.0.0.1.1` = 7, `0.1.0.0.1.1` = 7,
    `1.1.0.0.1.1` = 9, `0.0.1.0.1.1` = 5, `1.0.1.0.1.1` = 1, `0.1.1.0.1.1` = 6,
    `1.1.1.0.1.1` = 5, `0.0.0.1.1.1` = 2, `1.0.0.1.1.1` = 3, `0.1.0.1.1.1` = 1,
    `1.1.0.1.1.1` = 5, `0.0.1.1.1.1` = 6, `1.0.1.1.1.1` = 5, `0.1.1.1.1.1` = 4,
    `1.1.1.1.1.1` = 2)

We generated the code variables by taking a random sample of the names of the above objects, splitting the names by the ., and coercing the result to a data.table.

pcodes <-
  sample(names(pcodes), TBI_SUBJECTS, replace = TRUE, prob = pcodes) %>%
  strsplit("\\.") %>%
  lapply(as.integer) %>%
  do.call(rbind, .) %>%
  as.data.table %>%
  setNames(., paste0("pcode", 1:6))

pcodes
##       pcode1 pcode2 pcode3 pcode4 pcode5 pcode6
##    1:      0      0      1      1      0      0
##    2:      0      0      0      0      0      0
##    3:      0      0      0      0      0      0
##    4:      0      0      0      0      0      0
##    5:      0      0      0      0      0      0
##   ---                                          
## 1319:      0      0      1      1      0      1
## 1320:      0      0      0      0      0      0
## 1321:      0      0      0      0      0      0
## 1322:      0      0      0      0      0      0
## 1323:      0      0      0      0      0      1

ncodes <-
  sample(names(ncodes), TBI_SUBJECTS, replace = TRUE, prob = ncodes + 0.1) %>%
  strsplit("\\.") %>%
  lapply(as.integer) %>%
  do.call(rbind, .) %>%
  as.data.table %>%
  setNames(., paste0("ncode", 1:6))

ncodes
##       ncode1 ncode2 ncode3 ncode4 ncode5 ncode6
##    1:      0      1      0      0      0      0
##    2:      0      0      0      0      0      0
##    3:      0      0      0      0      0      0
##    4:      0      0      0      0      0      0
##    5:      0      0      1      1      1      1
##   ---                                          
## 1319:      0      0      0      0      0      0
## 1320:      1      1      0      1      1      0
## 1321:      0      0      0      0      0      0
## 1322:      1      0      0      0      0      0
## 1323:      0      0      0      0      0      0

The last step in building the predictor variables is to bind the three data.tables together:

tbi <- cbind(tbi, pcodes, ncodes)

The three injury categories are constructed by fitting a simplified logistic regression model. The regression coefficient vectors are defined next.

injury1_coef <-
  matrix(c(-5.80494, -3.03946e-05, 0.773355, 0.00556597, -1.04436, -0.849594,
           0.770122, 0.968153, 16.6315, 0.411286, 4.07926, 5.16926, 2.84976,
           -17.1038, -14.7382, 4.30086),
         ncol = 1)

injury2_coef <-
  matrix(c(-427.083, 0.0742405, -342.072, -8.09704, 299.132, 991.75, -85.0155,
           -170.344, -57.408, 123.201, 161.41, -568.483, -767.944, 706.95,
           10.2964, 148.551),
         ncol = 1)

injury3_coef <-
  matrix(c(-3.54036, -0.00054294, 1.75962, -0.0475071, -17.515, -60.4276,
           4.58458, 0.58551, -19.7312, -1.16923, 2.16091, 63.7699, 39.3569,
           -37.8554, -14.1002, -14.8469),
         ncol = 1)

injury_expression <-
  quote(
        c("injury1", "injury2", "injury3") :=
        list(
             sapply(qwraps2::invlogit(cbind(1, as.matrix(tbi)) %*% injury1_coef),
                    function(p) rbinom(1, 1, p)),
             sapply(qwraps2::invlogit(cbind(1, as.matrix(tbi)) %*% injury2_coef),
                    function(p) rbinom(1, 1, p)),
             sapply(qwraps2::invlogit(cbind(1, as.matrix(tbi)) %*% injury3_coef),
                    function(p) rbinom(1, 1, p))
             )
        )

tbi[, eval(injury_expression)]

To prove that the tbi object constructed in this vignette is the same as the one provided in via calling data(tbi, package = "ensr") we show the sha1 for both data sets here:

ensr_env <- new.env()
data(tbi, package = "ensr", envir = ensr_env)

all.equal(digest::sha1(tbi),
          digest::sha1(ensr_env$tbi))
## [1] TRUE

Water Percolation Through A Landfill

The file landfill.csv contains results from a computer simulation of water percolation through five layers in a landfill over one year. The raw data file is provided so the end user can explore their own pre-processing of the data.

landfill <-
  fread(file = system.file("extdata/landfill.csv", package = "ensr"), sep = ",")

This landfill consists of five layers:

  1. Topsoil
  2. Geomembrane Liner
  3. Clay
  4. Weathered Lavery Till (WLT)
  5. Unweathered Lavery Till (ULT)

The outcomes modeled by the computer program are:

Outcome Variable Name(s)
evaporation evap
runoff runoff
percolation perc_topsoil, perc_liner, perc_clay, perc_wlt, perc_ult
drainage drain_topsoil
water content wc_topsoil, wc_clay, wc_wlt, wc_ult

The set of predictors included in the data set are:

Predictor Variable Name(s) Notes
Porosity topsoil_porosity, clay_porosity, wlt_porosity, ult_porosity
van Genuchten \(\alpha\) topsoil_alpha, clay_alpha, wlt_alpha, ult_alpha related to the inverse of the air entry suction
van Genuchten \(\theta_r\) topsoil_thetar, clay_thetar, wlt_thetar, ult_thetar residual water content
van Genuchten N topsoil_n, clay_n, wlt_n, ult_n pore size
Saturated hydraulic conductivity topsoil_ks, clay_ks, wlt_ks, ult_ks
Relative Humidity rh
Leaf Area Index lai
Grow Start growstart First day of the growing season
Grow End growend Last day of the growing season
Wind wind Average wind speed in mph
Solar Radiation rmw, rmd Mean of daily solar radiation for wet days, or dry days
Amplitude of daily solar radiation ar
SCS AMCII Curve Number cn
Plant root beta beta Used for calculating evaporation zone depth
Linear Defects liner_defects Number of defects per acre of liner
Linear Pinholes liner_pinholes Number of pinholes per acre of liner
100 year average Solar Radiation weather_solrad
100 year average Precipitation weather_precip
100 year average Temperature weather_temp

Landfill Data Preparation

The elastic net method suggests that all predictors be centered and scaled. In the case of multi-variable responses, it is recommended that the outcomes be centered and scaled as well. For calls to glmnet and cv.glmnet, the argument standardized = TRUE (default) will center and scale the values. If you have already centered and scaled the data you may prefer to set standardized = FALSE.

If you want to explicitly standardize your data, the base R function scale is adequate for most situations. However, with the landfill data the simple centering and scaling could be considered inappropriate. scale will use the mean and standard deviation of the input. However, the landfill data contains measurements with the same values. Centering and scaling should be based on the mean and standard deviation of just the unique values. The ensr function standardize will, by default, standardize a numeric vector based on just the unique values using either the mean and standard deviation or the median and interquartile range (IQR).

Here is an example. There are 974 rows in the landfill data set. There are 83 unique values for the van Genuchten \(\alpha\) value. Here is how the standardization results are affected by non-unique predictor values:

nrow(landfill)
## [1] 974
length(unique(landfill$topsoil_alpha))
## [1] 83

# Standard scaling, using the mean and standard deviation for the full vector.
scaled_topsoil_alpha_v1 <- as.vector(scale(landfill$topsoil_alpha))

scaled_topsoil_alpha_v2 <-
  (landfill$topsoil_alpha - mean(landfill$topsoil_alpha)) / sd(landfill$topsoil_alpha)

all.equal(target  = scaled_topsoil_alpha_v1,
          current = scaled_topsoil_alpha_v2,
          check.attributes = FALSE)
## [1] TRUE

# Center and scale using the mean and standard deviation of only the unique
# values of the vector.
scaled_topsoil_alpha_v3 <-
  as.vector(scale(landfill$topsoil_alpha,
                  center = mean(unique(landfill$topsoil_alpha)),
                  scale  = sd(unique(landfill$topsoil_alpha))))

scaled_topsoil_alpha_v4 <- standardize(landfill$topsoil_alpha)

all.equal(target  = scaled_topsoil_alpha_v3,
          current = scaled_topsoil_alpha_v4,
          check.attributes = FALSE)
## [1] TRUE

# Center and scale using the median and IQR.
scaled_topsoil_alpha_v5 <-
  as.vector(scale(landfill$topsoil_alpha,
                  center = median(unique(landfill$topsoil_alpha)),
                  scale  = IQR(unique(landfill$topsoil_alpha))))

scaled_topsoil_alpha_v6 <-
  standardize(landfill$topsoil_alpha,
              stats = list(center = "median", scale = "IQR"))

all.equal(target  = scaled_topsoil_alpha_v5,
          current = scaled_topsoil_alpha_v6,
          check.attributes = FALSE)
## [1] TRUE

For the full landfill data set we can center and scale all the columns using:

landfill <- standardize(landfill)

Note that the values used for centering and scaling are retained as attributes for each variable in the data set.

str(landfill[, 1:2] )
## Classes 'data.table' and 'data.frame':   974 obs. of  2 variables:
##  $ topsoil_porosity: num  0.244 -0.171 -0.301 -0.223 -0.586 ...
##   ..- attr(*, "center")= num 0.296
##   ..- attr(*, "scale")= num 0.0385
##  $ topsoil_alpha   : num  -0.735 -0.455 -0.455 0.146 -0.535 ...
##   ..- attr(*, "center")= num 0.0143
##   ..- attr(*, "scale")= num 0.0025
##  - attr(*, ".internal.selfref")=<externalptr>

The landfill object generated in this vignette is the same as the one generated using data(landfill, package = "ensr").

data(landfill, package = "ensr", envir = ensr_env)
all.equal(digest::sha1(landfill),
          digest::sha1(ensr_env$landfill))
## [1] TRUE

Session Info

print(sessionInfo(), local = FALSE)
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ensr_0.1.0.9001    glmnet_3.0-2       Matrix_1.2-18      digest_0.6.25     
## [5] magrittr_1.5       data.table_1.12.8  qwraps2_0.4.2.9002 knitr_1.28        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3       codetools_0.2-16 lattice_0.20-38  foreach_1.4.8   
##  [5] grid_3.6.2       evaluate_0.14    rlang_0.4.5      stringi_1.4.6   
##  [9] rmarkdown_2.1    iterators_1.0.12 tools_3.6.2      stringr_1.4.0   
## [13] xfun_0.12        yaml_2.2.1       compiler_3.6.2   shape_1.4.4     
## [17] htmltools_0.4.0