Etienne Low-Décarie
12 July 2018
Workshop
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.
Sir Ronald Fisher, Presidential Address to the First Indian Statistical Congress, 1938. Sankhya 4, 14-17.
Good question / good hypothesis
Recognizing that not all studies are aimed at hypothesis testing.
quantitative phenotype differs significantly between European-derived and Asian-derived populations
Common genetic variants account for differences in gene expression among ethnic groups
Interestingly, the arrays used to measure expression for the CEU [European-derived] individuals were primarily processed from 2003 to 2004,whereas the arrays used to measure expression for the ASN [Asian-derived] individuals were all processed in 2005–2006.
.
microarray batch effects appear to be completely confounded with population effects.”
On the design and analysis of gene expression studies in human populations
Sediment suppresses herbivory across a coral reef depth gradient
stratified with split plot treatment: air gun and air lift (vacuuming) control: plastic sheet protection
Accounts for variability
Challenge with:
The effect of elevated CO2 on growth and competition in experimental phytoplankton communities
Dependant\Independant | Continuous | Categorical |
---|---|---|
Continuous | Regression | ANOVA |
Categorical | Logistic regression | Tabular |
Many are optimal experimental designs if all required experimental units are available
Park grass experiment: -started by Lawes and Gilbert in 1856 at Rothemsted -162 years old
Park grass :
Redesign park grass experiment:
** WHat other famous failures of experimental design do you know? **
30 minutes
Analysis despite flaws in experimental design:
Determinants of Species Richness in the Park Grass Experiment
Park grass inspired experiment using proper experimental design:
Effects of grassland management on plant C:N:P stoichiometry:implications for soil element cycling and storage (example publication, dozens of related publications)
pwr
Create a spreadsheet simulation of an experiment testing :
and/or
and/or
What were issues you identified?
Throwing a six face die
Create the die
die <-seq(from=1, to=6, by=1)
Throw the die
sample(x=die, size=1, replace=T)
[1] 4
Throw the 1000 times and keep the results
die_results <- sample(x=die,
size=1000,
replace=T)
Plot the results
hist(die_results,
breaks=seq(from=0.5,
to=6.5,
by=1))
#load TeachingDemos
require(TeachingDemos)
#create the die
rgl.die()
#roll the regular die
roll.rgl.die(steps=10)
TeachingDemos
also contains a dice
that throws ndice
, rolls
times
dice(rolls=3,
ndice=4,
sides=6,
plot.it=T,
load=rep(1, 6))
die10 <-seq(from=1, to=10, by=1)
coin <- c("head", "tails")
p_dice<-c(1,1,1,1,1,5)
die_results <- sample(x=die,
size=1000,
replace=T,
prob= p_dice)
Normal distribution
rnorm(n = number of observations,
mean = vector of means,
sd = vector of means)
Normal distribution
sample_norm <- rnorm(n = 10000,
mean = 10,
sd = 1)
hist(sample_norm)
Available in base r:
beta beta
, binomial binom
, Cauchy cauchy
, chi-squared chisq
, exponential exp
, Fisher F f
, gamma gamma
, geometric geom
, hypergeometric hyper
, logistic logis
, lognormal lnorm
, negative binomial nbinom
, normal norm
, Poisson pois
, Student t t
, uniform unif
, Weibull weibull
Additional challenges:
1.
dolphins <- rpois(n = 1000,
lambda = 3)
hist(dolphins)
2.
species <- rlnorm(n = 1000,
mean = 25,
sd=1)
hist(species)
R
, create a new column containing simulated values using a more sensible distribution of error60 minutes
design setup “manually”
Example 2 factors, 2 levels, 3 replicates, full factorial
one_replicate <- data.frame(factorA=factor(rep(c("FacA_Level1",
"FacA_Level2"),
3)),
factorB=factor(rep(c("FacB_Level1",
"FacB_Level2", "FacB_Level3"),
each=2)))
head(one_replicate)
factorA factorB
1 FacA_Level1 FacB_Level1
2 FacA_Level2 FacB_Level1
3 FacA_Level1 FacB_Level2
4 FacA_Level2 FacB_Level2
5 FacA_Level1 FacB_Level3
6 FacA_Level2 FacB_Level3
Alternatives:
expand.grid()
require(cowplot)
one_replicate$response <- with(one_replicate,
as.numeric(factorA)*
as.numeric(factorB))
function to create scaled numeric predictors from all factors
num_scale_fact <- function(data_frame){
ind <- sapply(data_frame, is.factor)
data_frame[paste0("num_",names(data_frame[ind]))] <-
lapply(data_frame[ind],
function(x)scale(as.numeric(x)))
return(data_frame)
}
one_replicate <- num_scale_fact(one_replicate)
one_replicate$response <- with(one_replicate,
num_factorA*
num_factorB)
experiment1 <- NULL
for(i in LETTERS[1:3]){
one_replicate$replicate <- factor(i)
one_replicate$response <- with(one_replicate,
num_factorA*
num_factorB+rnorm(n = nrow(one_replicate), sd=2))
experiment1 <- rbind(experiment1,one_replicate)
}
Avoid for loops in this context using the apltly named
replicate
1.
require(ggplot2)
p <- qplot(data=experiment1,
x=factorA,
y=response,
colour=factorB,
geom="boxplot")
print(p)
2.
fit <- aov(data = experiment1,
response~factorA*factorB)
summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
factorA 1 0.000 0.00 0.00 1
factorB 2 0.000 0.00 0.00 1
factorA:factorB 2 12.500 6.25 24.98 5.28e-05 ***
Residuals 12 3.002 0.25
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Experimental setup using tools
require(AlgDesign)
experiment2 <- gen.factorial(levels=c(3,2,2),
nVars=3,
varNames=c("A","B","C"),
factors="all")
create an expanded simulated data set and analyse each trial
create the experimental design
experiment_design <- gen.factorial(levels=c(3,2,2),
nVars=3,
varNames=c("A","B","C"),
factors="all")
experiment_design <- num_scale_fact(experiment_design)
trial_rep <- expand.grid(trial=as.factor(1:1000),
rep=as.factor(1:5))
expand.grid.df <- function(...) Reduce(function(...) merge(..., by=NULL), list(...))
rep_exp1 <- expand.grid.df(trial_rep,experiment_design)
https://stackoverflow.com/questions/11693599/alternative-to-expand-grid-for-data-frames
rep_exp1$response <- with(rep_exp1, num_A+num_B+num_B*num_C+num_A*num_B*num_C+rnorm(nrow(rep_exp1), sd=2))
p <- qplot(data=rep_exp1,
x=A,
y=response,
colour = B,
alpha=I(0.02),
geom="jitter")+
geom_line(aes(group=paste(trial, rep, B)),
alpha=0.02)+
facet_grid(. ~ C, labeller = label_both)
print(p)
require(broom)
require(dplyr)
fit_results <- rep_exp1 %>%
group_by(trial) %>%
do(tidy(aov(response~A*B*C, data=.)))
require(ggplot2)
p <- qplot(data = fit_results,
x = p.value,
binwidth=0.05)+
facet_wrap(~term)
require(sjstats)
fit_results <- rep_exp1 %>%
group_by(trial) %>%
do(omega_sq(aov(response~A*B*C, data=.)))
fit model to single experiment and use simulate
simulate.lm : normal distribution with same MSE as original model
creat a base experiment
exp_d_rep <- do.call("rbind", replicate(3, experiment_design, simplify = FALSE))
exp_resp <- exp_d_rep
exp_resp$response <- with(exp_resp,
num_A+num_B+num_B*num_C+num_A*num_B*num_C+rnorm(nrow(exp_resp), sd=2))
fit <- lm(response~A*B*C, data=exp_resp)
exp_sim <- cbind(exp_d_rep,simulate(fit,nsim = 10))
require(tidyr)
exp_sim <- gather(exp_sim,contains("sim_"), key = trial ,value = response)
Sensitivity analysis of design: simulate a series of experimental design scenarios and see how it affects inference from the study.
Lets get super meta: setup an experimental design for your sensitivity analysis experiment of your experiment
sens_setup <- expand.grid(trial=as.factor(1:100),
replication=c(3:10),
error=c(0.1,0.5,1,5,10))
Setup a function
f_rep_error <- function(rep,error){
rep_exp <- expand.grid.df(data.frame(rep=rep, replicate=1:rep),experiment_design)
rep_exp$response <- with(rep_exp, num_A+num_B+num_B*num_C+num_A*num_B*num_C+rnorm(nrow(rep_exp), sd=error))
return(rep_exp)
}
Apply the function to individual trials
trial_rep_error <- sens_setup %>% group_by(trial, replication, error) %>%
do(with(.,f_rep_error(replication, error)))
require(broom)
require(dplyr)
fit_results <- trial_rep_error %>%
group_by(trial, rep, error) %>%
do(tidy(aov(response~A*B*C, data=.)))
Optimize the design using tools
For certain types of design, the allocation of experimental units can be optimized using established algorithsm.
For example,in some cases, a full factorial design can be reduced to a carefully selected subset of cases with only a marginal effect on estimated parameters.
see R vignette("AlgDesign")
gen.factorial
use one of the optimize incomplete factorial designs using the same setup rep_exp1
aov
?30 minutes
Dojo: deliberate practice
Alternative: pair/team coding on each other's experiments (turns focusing on each experiment in the team).
Produce an optimal design for the treatments in the Park Grass experiment
Design an experiment to:
Redesign the experiment from your favourite recent publication