--- title: "Implementation of a Simulation Study in R" author: "M Schnitzer" date: "06/06/2021" output: html_document: default pdf_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Introduction Simulation studies are used in statistics and computer science to evaluate the performance of estimators against a set of criteria. Typically, one will generate a synthetic data set using fully specified distributions. Then the estimation method(s) of interest is(are) applied to the synthetic data and the results recorded. These two phases are independently repeated $M$ times. The results over the $M$ runs are then summarized to give the results. It is important to identify the specific goal of your simulation study. If the goal is parameter estimation, then you must first derive or estimate the true value of the parameter of interest, which depends on the distribution used to generate the data. Another potential goal in a simulation study is to evaluate the performance of a prediction method. This tutorial will focus on simulation studies for the evaluation of an estimator of a parameter from a frequentist perspective. ## A simple example As a simple example, suppose that we are interested in comparing the performance of a linear regression with a spline regression to estimate the expected value of an outcome, $Y$, for a certain value of a covariate, $X$. Our parameter of interest will be $E(Y\mid X=2)$. To conduct the simulation study, we will first choose the distributions of variables $X$ and $Y$. We will generate 1000 independent realizations (or draws) of these two variables. We will apply the linear regression estimator then the spline estimator for the quantity of interest and use ```predict``` to get an estimate of the expectation of $Y$ at the value $X=2$, along with a standard error of the prediction. Then, we will record the estimate and standard error by writing to a file. We will use a loop to repeat this procedure $M=200$ times. We will demonstrate how to obtain the true value of $E(Y\mid X=2)$ using an analytical approach and a numerical method. ## A single run ```{r setseed, include=FALSE} set.seed(3434) ``` Let's set both distributions to be Gaussian and take 1000 independent draws of $(X,Y)$: ```{r datagen} n = 1000 X = rnorm(n) Y = rnorm(n,mean=(0.5*X+X^2)) ``` In order to estimate the parameter of interest on successive runs, we can define functions that output the information that we need as a list: ```{r functions} linreg<-function(X,Y){ #linear regression, main term X predobj<-predict(lm(Y~X),newdata=data.frame(X=2),se.fit=T) return(list(est=predobj$fit,se=predobj$se.fit)) } splinereg<-function(X,Y){ #spline regression, degree 3 predobj<-predict(lm(Y~poly(X,3)),newdata=data.frame(X=2),se.fit=T) return(list(est=predobj$fit,se=predobj$se.fit)) } ``` We then run the functions on the synthetic data and save the results. To write to file on the local directory, we can use ```{r results} reslm<-linreg(X,Y) ressp<-splinereg(X,Y) write(c(reslm$est,reslm$se,ressp$est,ressp$se),"results_single.txt",ncol=10,append=T) ``` ## Getting the true value of the parameter of interest We want to compare our results with the true value of the parameter of interest. In this example, calculating $E(Y\mid X=2)=0.5\times 2+2^2=5$ is trivial. When the true value is less easily derived (through algebra or numerical integration), one can estimate this value numerically. ```{r numerical} n.s = 100000 #large number X.s = 2 Y.s = rnorm(n.s,mean=(0.5*X.s+X.s^2)) mean(Y.s) ``` ```{r clearmemory,include=FALSE} rm(Y.s) ``` If the resulting value from ```mean(Y.s)``` is unstable, the sample size can be increased for the calculation and/or this procedure can be looped and the results averaged. Now we set the true value to ```{r settruth} truth=5 ``` The errors from one simulated dataset can be calculated as: ```{r error} reslm$est-truth #error for the linear regression ressp$est-truth #error for the spline regression ``` We see that the linear regression model gives more error than the spline model. This makes sense because the true model for $E(Y\mid X)$, which contains a squared term of $X$, is not contained in the linear regression model but it is contained in the spline model of degree 3. ```{r plots} l1<-predict(lm(Y~X)) l2<-predict(lm(Y~poly(X,3))) j <- order(X) plot(Y ~ X, pch=19,cex=0.1) lines(X[j],l1[j],col="red",lwd=4) lines(X[j],l2[j],col="blue",lwd=2) abline(v=2,lty="dashed") ``` From the above plot, we'd expect that the linear regression (red, thicker curve) will not perform well as we move outside of the centre mass of the data. The spline regression of degree 3 (blue, thinner curve) does a better job, in particular near the value of interest, $X=2$ (vertical dashed line). ## Reproducible simulated datasets In order to evaluate the performance of the estimators over multiple draws, we will loop the above procedure. But we will make sure that this procedure is reproducible by setting seeds for each iteration. We generate 200 random seeds and then store them to a file: ```{r setseedforseeds} set.seed(111) ``` ```{r generateseeds} write(sample(1:10000000,size=200),file="seeds.txt",ncolumns=1) ``` This is only done once and the file is stored permanently. Any time we want to run the simulation study, we then read in the seeds stored in this file: ```{r readseeds} seeds<-read.table("seeds.txt",header=F)$V1 ``` and set each seed when we generate the data in each loop. This way, the same datasets will be generated each time we run the loop. ## Running the loop The complete simulation study can be conducted as follows (using only 200 draws for the purposes of this tutorial though normally we would use 1000+ draws). Here, we use a sample size of $n=1000$. This emulates a study where we have 1000 units or participants. ```{r completeloop} n = 1000 for (i in 1:200){ #generate data set.seed(seeds[i]) #sets the seed to the ith element in the vector seeds X = rnorm(n) Y = rnorm(n,mean=(0.5*X+X^2)) #get estimates reslm<-linreg(X,Y) ressp<-splinereg(X,Y) #write to file write(c(i,reslm$est,reslm$se,ressp$est,ressp$se),"results.txt",ncol=10,append=T) } ``` Now we should have a file with a 200x5 matrix of results. The first column is the iteration number. The second and fourth columns are the estimates for the linear and spline regressions, respectively. And the third and fifth columns are the standard errors for the respective estimators. The results are imported using: ```{r loadres} dat1<-read.table("results.txt",header=F) #load results dim(dat1) #check dimension ``` We can summarize the estimated bias (or average error) using ```{r error1} colMeans(dat1[,c(2,4)])-truth ``` and the average standard error using ```{r error2} sqrt(colMeans(dat1[,c(3,5)]^2)) ``` Note that this takes the squared root of the mean of the estimation variances produced by the ```lm``` function in each run. To automatically generate a latex table, we could use the package ```xtable```. ```{r table} bias<-colMeans(dat1[,c(2,4)])-truth avg.se<-sqrt(colMeans(dat1[,c(3,5)]^2)) datfr<-data.frame(cbind(bias,avg.se)) names(datfr)<-c("Bias","Avg SE") rownames(datfr)<-c("Linear regression","Spline regression deg 3") datfr library(xtable) xtable(datfr) ``` From this, we see that the linear regression had bias but the spline regression of degree 3 did not. The linear regression also had more average standard error than the spline regression. ## Adding a comparator Now suppose that we want to add a third comparator. For example, we can investigate the performance of a spline of degree 2. Below we add predictions from a spline of degree 2 (thicker black curve) to the previous plot. It overlaps the spline of degree 3 in the region of interest, so we'd expect the performance to be similar. ```{r plot2} l1<-predict(lm(Y~X)) l2<-predict(lm(Y~poly(X,3))) j <- order(X) plot(Y ~ X, pch=19,cex=0.1) lines(X[j],l1[j],col="red",lwd=4) lines(X[j],l2[j],col="blue",lwd=2) abline(v=2,lty="dashed") l3<-predict(lm(Y~poly(X,2))) lines(X[j],l3[j],col="black",lwd=4) ``` Because we can reproduce the same simulated datasets, we can run a separate loop and don't need to rerun the other estimators in order to make a fair comparison. ```{r comparator} splinereg2<-function(X,Y){ predobj<-predict(lm(Y~poly(X,2)),newdata=data.frame(X=2),se.fit=T) return(list(est=predobj$fit,se=predobj$se.fit)) } for (i in 1:200){ #generate data set.seed(seeds[i]) #sets the seed to the ith element in the vector seeds X = rnorm(n) Y = rnorm(n,mean=(0.5*X+X^2)) #get estimates ressp2<-splinereg2(X,Y) #write to file write(c(i,ressp2$est,ressp2$se),"results2.txt",ncol=10,append=T) } ``` Now we can add these results to the table. ```{r addres} dat2<-read.table("results2.txt",header=F) #load results dim(dat2) #check dimension datfr<-rbind(datfr,c(mean(dat2[,2])-truth,sqrt(mean(dat2[,3]^2)))) rownames(datfr)[3]<-"Spline regression deg 2" datfr xtable(datfr) ``` Thus, the spline of degree 2 is similarly unbiased and has lower standard error than either of the other estimators, on average. ## Adding summary information to the table We may also be interested in evaluating how the estimated standard error compares to the Monte Carlo standard error (MCSE) and the coverage of the 95% confidence intervals derived from the standard errors (SE) using $est \pm 1.96\times SE$. We combine the results ```dat1``` and ```dat2``` and calculate these quantities for all three estimators: ```{r addinfo} dat_res<-cbind(dat1,dat2) MCSE<-apply(dat_res[c(2,4,7)],2,sd) cover<-colMeans(dat_res[,c(2,4,7)]-1.96*dat_res[,c(3,5,8)]<=truth & dat_res[,c(2,4,7)]+1.96*dat_res[,c(3,5,8)]>=truth) datfr$"MCSE"<-MCSE datfr$"Coverage"<-cover datfr xtable(datfr) ``` Now we can see that the linear regression method underestimated the standard error of the method. The splines performed better, leading to close to optimal coverage. ## Version control A simple way to achieve version control is to save copies of the simulation study R document(s), seeds, and results in a separate folder with the date and version number. Note which version you use in a presentation or paper. Then you can always refer to the exact code and data used to obtain the results if needed. ## Conclusion In this tutorial, we presented an approach to conducting simulation studies in R. In particular, one can retrieve any given synthetic dataset by calling the ```i```th seed and re-generating the data. Because the saved results are indexed by ```i```, this allows you to further investigate anomalous individual results to better understand the performance of your estimator. There are many considerations in simulation studies that were not addressed here, including choosing interesting or relevant data generating distributions and varying the dataset sample size $n$. Such considerations should be guided by scientific interest that can stem from an application or methods performance in general.