This vignette provides descriptions of the example data sets in the ensr package. We provide information about how the data sets were built and also about each individual variable.
The packages needed to constuct the data sets are:
library(data.table)
library(magrittr)
library(qwraps2)
library(digest)
set.seed(42)
library(ensr)
## Loading required package: glmnet
## Loading required package: Matrix
## Loaded glmnet 3.0-2
A synthetic data set based based on a Traumatic Brain Injury (TBI) study has been included with the ensr package. All data points in this data set have been randomly generated and have no association with any real patient.
The columns of this data set are:
Column(s) | Meaning |
---|---|
age |
Age of the patient, in days, at time of hospital admission |
female |
An integer flag for sex, 1 == female, 0 == male. |
los |
Length of stay: number of days between hospital admission and discharge. |
pcode1 , …, pcode6 |
Indicator columns for the presence or absence of a procedure code from a billing database. |
ncode1 , …, ncode6 |
Indicator columns for the presence or absence of the same procedure code, but from a trauma database. |
injury1 , …, injury3 |
Indicator columns for the presence or absence of three different types of TBI. |
Construction of the synthetic data set:
TBI_SUBJECTS <- 1323L
tbi <- data.table(age = round(rweibull(TBI_SUBJECTS, shape = 1.9, scale = 2000)),
female = rbinom(TBI_SUBJECTS, 1, 0.10),
los = round(rexp(TBI_SUBJECTS, rate = 1/13)))
The pcode
and ncode
variables are difficult to simulate. There are 64 permutations for each set. We built two vectors with relative frequencies of the codes.
pcodes <-
c(`0.0.0.0.0.0` = 1230, `1.0.0.0.0.0` = 130, `0.1.0.0.0.0` = 40,
`1.1.0.0.0.0` = 7, `0.0.1.0.0.0` = 120, `1.0.1.0.0.0` = 4, `0.1.1.0.0.0` = 20,
`1.1.1.0.0.0` = 2, `0.0.0.1.0.0` = 4, `1.0.0.1.0.0` = 4, `0.1.0.1.0.0` = 20,
`1.1.0.1.0.0` = 4, `0.0.1.1.0.0` = 20, `1.0.1.1.0.0` = 2, `0.1.1.1.0.0` = 2,
`1.1.1.1.0.0` = 7, `0.0.0.0.1.0` = 5, `1.0.0.0.1.0` = 4, `0.1.0.0.1.0` = 10,
`1.1.0.0.1.0` = 9, `0.0.1.0.1.0` = 10, `1.0.1.0.1.0` = 7, `0.1.1.0.1.0` = 1,
`1.1.1.0.1.0` = 7, `0.0.0.1.1.0` = 10, `1.0.0.1.1.0` = 6, `0.1.0.1.1.0` = 5,
`1.1.0.1.1.0` = 8, `0.0.1.1.1.0` = 5, `1.0.1.1.1.0` = 2, `0.1.1.1.1.0` = 6,
`1.1.1.1.1.0` = 9, `0.0.0.0.0.1` = 40, `1.0.0.0.0.1` = 7, `0.1.0.0.0.1` = 40,
`1.1.0.0.0.1` = 10, `0.0.1.0.0.1` = 10, `1.0.1.0.0.1` = 1, `0.1.1.0.0.1` = 20,
`1.1.1.0.0.1` = 5, `0.0.0.1.0.1` = 10, `1.0.0.1.0.1` = 2, `0.1.0.1.0.1` = 6,
`1.1.0.1.0.1` = 1, `0.0.1.1.0.1` = 8, `1.0.1.1.0.1` = 1, `0.1.1.1.0.1` = 7,
`1.1.1.1.0.1` = 2, `0.0.0.0.1.1` = 9, `1.0.0.0.1.1` = 1, `0.1.0.0.1.1` = 1,
`1.1.0.0.1.1` = 2, `0.0.1.0.1.1` = 6, `1.0.1.0.1.1` = 9, `0.1.1.0.1.1` = 6,
`1.1.1.0.1.1` = 4, `0.0.0.1.1.1` = 2, `1.0.0.1.1.1` = 9, `0.1.0.1.1.1` = 3,
`1.1.0.1.1.1` = 1, `0.0.1.1.1.1` = 7, `1.0.1.1.1.1` = 4, `0.1.1.1.1.1` = 4,
`1.1.1.1.1.1` = 3)
ncodes <-
c(`0.0.0.0.0.0` = 1240, `1.0.0.0.0.0` = 200, `0.1.0.0.0.0` = 130,
`1.1.0.0.0.0` = 4, `0.0.1.0.0.0` = 90, `1.0.1.0.0.0` = 9, `0.1.1.0.0.0` = 20,
`1.1.1.0.0.0` = 7, `0.0.0.1.0.0` = 20, `1.0.0.1.0.0` = 3, `0.1.0.1.0.0` = 10,
`1.1.0.1.0.0` = 8, `0.0.1.1.0.0` = 7, `1.0.1.1.0.0` = 9, `0.1.1.1.0.0` = 8,
`1.1.1.1.0.0` = 8, `0.0.0.0.1.0` = 6, `1.0.0.0.1.0` = 9, `0.1.0.0.1.0` = 10,
`1.1.0.0.1.0` = 3, `0.0.1.0.1.0` = 9, `1.0.1.0.1.0` = 7, `0.1.1.0.1.0` = 6,
`1.1.1.0.1.0` = 8, `0.0.0.1.1.0` = 6, `1.0.0.1.1.0` = 2, `0.1.0.1.1.0` = 9,
`1.1.0.1.1.0` = 6, `0.0.1.1.1.0` = 2, `1.0.1.1.1.0` = 2, `0.1.1.1.1.0` = 5,
`1.1.1.1.1.0` = 7, `0.0.0.0.0.1` = 20, `1.0.0.0.0.1` = 6, `0.1.0.0.0.1` = 4,
`1.1.0.0.0.1` = 1, `0.0.1.0.0.1` = 2, `1.0.1.0.0.1` = 4, `0.1.1.0.0.1` = 7,
`1.1.1.0.0.1` = 6, `0.0.0.1.0.1` = 4, `1.0.0.1.0.1` = 1, `0.1.0.1.0.1` = 8,
`1.1.0.1.0.1` = 6, `0.0.1.1.0.1` = 1, `1.0.1.1.0.1` = 4, `0.1.1.1.0.1` = 2,
`1.1.1.1.0.1` = 7, `0.0.0.0.1.1` = 3, `1.0.0.0.1.1` = 7, `0.1.0.0.1.1` = 7,
`1.1.0.0.1.1` = 9, `0.0.1.0.1.1` = 5, `1.0.1.0.1.1` = 1, `0.1.1.0.1.1` = 6,
`1.1.1.0.1.1` = 5, `0.0.0.1.1.1` = 2, `1.0.0.1.1.1` = 3, `0.1.0.1.1.1` = 1,
`1.1.0.1.1.1` = 5, `0.0.1.1.1.1` = 6, `1.0.1.1.1.1` = 5, `0.1.1.1.1.1` = 4,
`1.1.1.1.1.1` = 2)
We generated the code variables by taking a random sample of the names of the above objects, splitting the names by the .
, and coercing the result to a data.table.
pcodes <-
sample(names(pcodes), TBI_SUBJECTS, replace = TRUE, prob = pcodes) %>%
strsplit("\\.") %>%
lapply(as.integer) %>%
do.call(rbind, .) %>%
as.data.table %>%
setNames(., paste0("pcode", 1:6))
pcodes
## pcode1 pcode2 pcode3 pcode4 pcode5 pcode6
## 1: 0 0 1 1 0 0
## 2: 0 0 0 0 0 0
## 3: 0 0 0 0 0 0
## 4: 0 0 0 0 0 0
## 5: 0 0 0 0 0 0
## ---
## 1319: 0 0 1 1 0 1
## 1320: 0 0 0 0 0 0
## 1321: 0 0 0 0 0 0
## 1322: 0 0 0 0 0 0
## 1323: 0 0 0 0 0 1
ncodes <-
sample(names(ncodes), TBI_SUBJECTS, replace = TRUE, prob = ncodes + 0.1) %>%
strsplit("\\.") %>%
lapply(as.integer) %>%
do.call(rbind, .) %>%
as.data.table %>%
setNames(., paste0("ncode", 1:6))
ncodes
## ncode1 ncode2 ncode3 ncode4 ncode5 ncode6
## 1: 0 1 0 0 0 0
## 2: 0 0 0 0 0 0
## 3: 0 0 0 0 0 0
## 4: 0 0 0 0 0 0
## 5: 0 0 1 1 1 1
## ---
## 1319: 0 0 0 0 0 0
## 1320: 1 1 0 1 1 0
## 1321: 0 0 0 0 0 0
## 1322: 1 0 0 0 0 0
## 1323: 0 0 0 0 0 0
The last step in building the predictor variables is to bind the three data.tables together:
The three injury categories are constructed by fitting a simplified logistic regression model. The regression coefficient vectors are defined next.
injury1_coef <-
matrix(c(-5.80494, -3.03946e-05, 0.773355, 0.00556597, -1.04436, -0.849594,
0.770122, 0.968153, 16.6315, 0.411286, 4.07926, 5.16926, 2.84976,
-17.1038, -14.7382, 4.30086),
ncol = 1)
injury2_coef <-
matrix(c(-427.083, 0.0742405, -342.072, -8.09704, 299.132, 991.75, -85.0155,
-170.344, -57.408, 123.201, 161.41, -568.483, -767.944, 706.95,
10.2964, 148.551),
ncol = 1)
injury3_coef <-
matrix(c(-3.54036, -0.00054294, 1.75962, -0.0475071, -17.515, -60.4276,
4.58458, 0.58551, -19.7312, -1.16923, 2.16091, 63.7699, 39.3569,
-37.8554, -14.1002, -14.8469),
ncol = 1)
injury_expression <-
quote(
c("injury1", "injury2", "injury3") :=
list(
sapply(qwraps2::invlogit(cbind(1, as.matrix(tbi)) %*% injury1_coef),
function(p) rbinom(1, 1, p)),
sapply(qwraps2::invlogit(cbind(1, as.matrix(tbi)) %*% injury2_coef),
function(p) rbinom(1, 1, p)),
sapply(qwraps2::invlogit(cbind(1, as.matrix(tbi)) %*% injury3_coef),
function(p) rbinom(1, 1, p))
)
)
tbi[, eval(injury_expression)]
To prove that the tbi
object constructed in this vignette is the same as the one provided in via calling data(tbi, package = "ensr")
we show the sha1 for both data sets here:
The file landfill.csv
contains results from a computer simulation of water percolation through five layers in a landfill over one year. The raw data file is provided so the end user can explore their own pre-processing of the data.
This landfill consists of five layers:
The outcomes modeled by the computer program are:
Outcome | Variable Name(s) |
---|---|
evaporation | evap |
runoff | runoff |
percolation | perc_topsoil, perc_liner, perc_clay, perc_wlt, perc_ult |
drainage | drain_topsoil |
water content | wc_topsoil, wc_clay, wc_wlt, wc_ult |
The set of predictors included in the data set are:
Predictor | Variable Name(s) | Notes |
---|---|---|
Porosity | topsoil_porosity, clay_porosity, wlt_porosity, ult_porosity | |
van Genuchten \(\alpha\) | topsoil_alpha, clay_alpha, wlt_alpha, ult_alpha | related to the inverse of the air entry suction |
van Genuchten \(\theta_r\) | topsoil_thetar, clay_thetar, wlt_thetar, ult_thetar | residual water content |
van Genuchten N | topsoil_n, clay_n, wlt_n, ult_n | pore size |
Saturated hydraulic conductivity | topsoil_ks, clay_ks, wlt_ks, ult_ks | |
Relative Humidity | rh | |
Leaf Area Index | lai | |
Grow Start | growstart | First day of the growing season |
Grow End | growend | Last day of the growing season |
Wind | wind | Average wind speed in mph |
Solar Radiation | rmw, rmd | Mean of daily solar radiation for wet days, or dry days |
Amplitude of daily solar radiation | ar | |
SCS AMCII Curve Number | cn | |
Plant root beta | beta | Used for calculating evaporation zone depth |
Linear Defects | liner_defects | Number of defects per acre of liner |
Linear Pinholes | liner_pinholes | Number of pinholes per acre of liner |
100 year average Solar Radiation | weather_solrad | |
100 year average Precipitation | weather_precip | |
100 year average Temperature | weather_temp |
The elastic net method suggests that all predictors be centered and scaled. In the case of multi-variable responses, it is recommended that the outcomes be centered and scaled as well. For calls to glmnet
and cv.glmnet
, the argument standardized = TRUE
(default) will center and scale the values. If you have already centered and scaled the data you may prefer to set standardized = FALSE
.
If you want to explicitly standardize your data, the base R function scale
is adequate for most situations. However, with the landfill data the simple centering and scaling could be considered inappropriate. scale
will use the mean and standard deviation of the input. However, the landfill data contains measurements with the same values. Centering and scaling should be based on the mean and standard deviation of just the unique values. The ensr function standardize
will, by default, standardize a numeric vector based on just the unique values using either the mean and standard deviation or the median and interquartile range (IQR).
Here is an example. There are 974 rows in the landfill data set. There are 83 unique values for the van Genuchten \(\alpha\) value. Here is how the standardization results are affected by non-unique predictor values:
nrow(landfill)
## [1] 974
length(unique(landfill$topsoil_alpha))
## [1] 83
# Standard scaling, using the mean and standard deviation for the full vector.
scaled_topsoil_alpha_v1 <- as.vector(scale(landfill$topsoil_alpha))
scaled_topsoil_alpha_v2 <-
(landfill$topsoil_alpha - mean(landfill$topsoil_alpha)) / sd(landfill$topsoil_alpha)
all.equal(target = scaled_topsoil_alpha_v1,
current = scaled_topsoil_alpha_v2,
check.attributes = FALSE)
## [1] TRUE
# Center and scale using the mean and standard deviation of only the unique
# values of the vector.
scaled_topsoil_alpha_v3 <-
as.vector(scale(landfill$topsoil_alpha,
center = mean(unique(landfill$topsoil_alpha)),
scale = sd(unique(landfill$topsoil_alpha))))
scaled_topsoil_alpha_v4 <- standardize(landfill$topsoil_alpha)
all.equal(target = scaled_topsoil_alpha_v3,
current = scaled_topsoil_alpha_v4,
check.attributes = FALSE)
## [1] TRUE
# Center and scale using the median and IQR.
scaled_topsoil_alpha_v5 <-
as.vector(scale(landfill$topsoil_alpha,
center = median(unique(landfill$topsoil_alpha)),
scale = IQR(unique(landfill$topsoil_alpha))))
scaled_topsoil_alpha_v6 <-
standardize(landfill$topsoil_alpha,
stats = list(center = "median", scale = "IQR"))
all.equal(target = scaled_topsoil_alpha_v5,
current = scaled_topsoil_alpha_v6,
check.attributes = FALSE)
## [1] TRUE
For the full landfill data set we can center and scale all the columns using:
Note that the values used for centering and scaling are retained as attributes for each variable in the data set.
str(landfill[, 1:2] )
## Classes 'data.table' and 'data.frame': 974 obs. of 2 variables:
## $ topsoil_porosity: num 0.244 -0.171 -0.301 -0.223 -0.586 ...
## ..- attr(*, "center")= num 0.296
## ..- attr(*, "scale")= num 0.0385
## $ topsoil_alpha : num -0.735 -0.455 -0.455 0.146 -0.535 ...
## ..- attr(*, "center")= num 0.0143
## ..- attr(*, "scale")= num 0.0025
## - attr(*, ".internal.selfref")=<externalptr>
The landfill
object generated in this vignette is the same as the one generated using data(landfill, package = "ensr")
.
print(sessionInfo(), local = FALSE)
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ensr_0.1.0.9001 glmnet_3.0-2 Matrix_1.2-18 digest_0.6.25
## [5] magrittr_1.5 data.table_1.12.8 qwraps2_0.4.2.9002 knitr_1.28
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.3 codetools_0.2-16 lattice_0.20-38 foreach_1.4.8
## [5] grid_3.6.2 evaluate_0.14 rlang_0.4.5 stringi_1.4.6
## [9] rmarkdown_2.1 iterators_1.0.12 tools_3.6.2 stringr_1.4.0
## [13] xfun_0.12 yaml_2.2.1 compiler_3.6.2 shape_1.4.4
## [17] htmltools_0.4.0