6 S: Point Estimates and Their Uncertainty

Load R-libraries

Code

# Load the library "rstanarm" to fit Bayesian linear models using the function 
# stan_glm() 

library(rstanarm)

Topics

Point estimates
Interval estimates
- Compatibility interval. A range of parameter values compatible with the observed data and the assumptions of the chosen statistical model. The term “compatibility interval” serves as an umbrella for intervals such as “confidence intervals” and “credible intervals.”
- Resist dichotomous thinking! Whether “… intervals include or exclude zero should play no role in their interpretation, …”, Amrhein et al. (2019); “Forget about p-values, and forget about whether your confidence intervals exclude zero.” Gelman et al. (2021) (p. 493)
Type M and Type S errors.
Sampling distribution and standard error
- Confidence intervals. The classical (frequentist) compatibility interval, derived analytically or using simulation (bootstrapping).
Likelihood function
- Credible intervals. The Bayesian compatibility interval, it is the likelihood function weighted by prior probability.
Bayesian inference
- General idea: Method for updating beliefs in light of new data using Bayes’ rule
- Prior distribution
- Likelihood function
- Posterior distribution

Readings.

Amrhein et al. (2019)
Gelman et al. (2021)
- Ch 3: Section 3.5, focus on normal distribution, linear transformation, and binomial distribution. See also “comparing distributions”, for use of percentiles
- Ch 4: You may skip section 4.6
- Ch 5: Section 5.1, for a simulation involving the binomial distribution,
- Ch 5, section 5.4 on bootstrapping.
- Appendix B “Forget about statistical significance”
Wilkinson (1999), in particular section on hypothesis tests, sizes and interval estimates.

6.1 Estimation

Read Wilkinson et al. !

Hypothesis tests. It is hard to imagine a situation in which a dichotomous accept-reject decision is better than reporting an actual p value or, better still, a confidence interval. Never use the unfortunate expression “accept the null hypothesis.” Always provide some effect size estimate when reporting a p value.

Effect sizes. Always present effect sizes for primary outcomes. If the units of measurement are meaningful on a practical level (e.g., number of cigarettes smoked per day), then we usually prefer an unstandardized measure (regression coefficient or mean difference) to a standardized measure (r or d). It helps to add brief comments that place these effect sizes in a practical and theoretical context.

Interval estimates. Interval estimates should be given for any effect sizes involving principal outcomes. Provide intervals for correlations and other coefficients of association or variation whenever possible.

Wilkinson (1999), p. 599.

From the inspirational article for Data set 3. The table show regression coefficients (point estimates) with 95 % confidence intervals.

This is the inspirational article for Data set 1. The figure show mean z-scores (point estimates) with standard errors of the mean (i.e., 68 % confidence intervals).

What is estimated?

Unknown and typically unobservable parameter value to be estimated:

Single parameter value, Population proportion, mean, median, mode, … For example, the median income in a population.
Contrast: Difference, ratio or some other comparison of two parameter values that may not be related to a causal research question. For example, estimating the difference in income between males and females.
Measure of association: Degree of association between variables in the population. For example, the population coefficient of correlation between age and income in a population.
Effect size: A population contrast or measure of association (in population) related to a specific causal research question. Note: The term “effect size” is typically used in a broader sense than this, including contrasts between variables that we do not believe to be causally related (see also notes on measurement, ch.5). In this course, we will restrict our use of the term “effect size” to estimates of causal effect sizes (which makes a lot of sense given that “effect” is a causal term). The Population average treatment effect (PATE, see Gelman et al., 2021, p. 343) is an example of a population effect size defined as a contrast between unknown parameters.The Sample average treatment effect (SATE, see Gelman et al., 2021, p. 343) may also be included here. Although it is about properties of a sample, it is defined as a contrast between unknown parameters: unobserved single-unit casual effects among sample units.

Point and interval estimate

Point Estimate

Sample statistic: proportion, mean, median, mode, …
Sample contrast or association:
- Unstandardized contrast: Difference between means (or medians), regression coefficient, …
- Relative proportions: Relative risk, Risk difference, Odds ratio …
- Standardized contrast: Cohen’s d, Glass’ \(\delta\), Hedge’s g, …
- Standardized association: Pearson’s r, Spearman’s \(\rho\), \(R^2\), \(\eta^2\), …
- (Case-wise proportions: Probability of superiority, … not part of the course)
Sample effect size: Sample contrast or association related to a specific causal research question. Again: The term sample “effect size” is typically used in a broader sense than this, including measures of contrasts between variables that we do not believe to be causally related. But I will try to restrict the use of “effect” to statements about causal effects.

Interval

Point estimates should be accompanied by interval estimates. Three general types of intervals are

Confidence intervals, from the Frequentist statistic perspective, derived from sampling distribution, e.g., \(95\%\;CI \approx \bar{X} \pm 2 * SE_{\bar{X}}\), or by simulation (bootstrapping).
Credible interval, from the Bayesian perspective, derived from the posterior distribution (often using simulation methods) estimated from prior distribution and likelihood function.
Likelihood intervals, similar to credible interval, but without incorporating prior beliefs (i.e., assuming “flat priors”).

The good news: For the same data, confidence intervals, credible intervals, and likelihood intervals tend to be quite similar, especially when derived from large datasets.

Compatibility interval: a generic term that recently has become popular:

” .. we can think of the confidence interval as a “compatibility interval” …, showing effect sizes most compatible with the data according to their P-values, under the model used to compute the interval. Likewise, we can think of a … Bayesian “credible interval,” as a compatibility interval showing effect sizes most compatible with the data, under the model and prior distribution used to compute the interval … ” (from Amrhein et al., 2019)

Gelman et al. (2021) discusses the term on pp. 110-111 (Gelman seem to prefer the term “uncertainty interval”, but I prefer “compatibility interval” as it makes no claims about our state of certainty; “compatibility” only refer to the relation between, on the one hand, possible parameter values, and, on the other, our data and statistical model).

Type M (magnitude) and Type S (sign) errors

Gelman et al. (2021), p. 59, defines:

Type S error occurs when the sign of the estimated effect is of the opposite direction to the true effect.
Type M error occurs when the magnitude of the estimated effect is much different from the true effect.

This classification of errors seem more relevant than the old Type I and Type II error classification which was based on null hypothesis significance testing and its dichotomization of results as either statistically significant or non-significant.

“We do not generally use null hypotheses significance testing in our work. In the field in which we work, we do not generally think null hypotheses can be true: in social science and public health, just about every treatment one might consider will have some effect, and no comparison or regression coefficient of interest will be exactly zero. We do not find it particular helpful to formulate and test null hypotheses that we know ahead of time cannot be true. Testing null hypothesis is just a matter of data collection: with sufficient sample size, any hypothesis can be rejected and there is no real point in gathering a mountain of data just to reject a hypothesis that we did not believe in the first place.”

Gelman et al. (2021), p. 59.

Variability versus Compatibility

It is important to distinguish between intervals that refer to the variability in our data (e.g., \(\pm 1 SD\)) and to intervals that refer to our parameter estimates (i.e. \(\pm 1 SE\)). I will call the former Variability intervals (a new term that I just invented) and the latter compatibility intervals for reasons discussed above.

Variability intervals, e.g., \(\pm 1 SD\), \(IQR\), \(Max-Min\), describe the variability (dispersion) of the data, the wider the interval the larger variability in the data. The size of the interval does not depend on sample size (but may vary a lot for small samples due to sampling error).
Compatibility intervals, e.g., \(\pm 1 SE\), \(95\%\;CI\), tell us about the compatibility of our estimate with possible parameter values, given our statistical model. The compatibility interval says little about the variability of the data, because the width of the interval depends strongly on sample size.

Simple simulation of variability of scores and compatibility of mean scores in a small and a large study, sampled from the same population of test takers:

Code

errbar_plot <- function(y){
  # Plot point estimate +/- 1 SD
  plot(NA, xlim = c(0, 3), ylim = c(-2, 4), axes = FALSE, xlab = '', ylab = '')
  arrows(x0 = 1, x1 = 1, y0 = mean(y) - sd(y), y1 = mean(y) + sd(y),
       length = 0.01, angle = 90, code = 3)
  points(1, mean(y), pch = 21, bg = 'grey')
  text(x = 0.8, y = -2.0, "+/- 1 SD", pos = 4)

  # Add point estimate with 95 % CI
  ci <- confint(lm(y ~ 1))
  arrows(x0 = 2, x1 = 2, y0 = ci[1], y1 = ci[2],
       length = 0.01, angle = 90, code = 3)
  points(2, mean(y), pch = 21, bg = 'grey')
  text(x = 1.8, y = -2.0, "95 % CI", pos = 4)
  
  # Add y-axis
  axis(2, las = 2, pos = 0.5, tck = 0.01)
  mtext(text = "outcome variable", side = 2, line = -2.5, cex = 1)
  
  # Add text sample size
  stext <- paste("Sample size =", as.character(length(y)))
  mtext(text = stext, side = 3, cex = 1)
}

Code

set.seed(123)

# Small sample study
n <- 20
y <- rnorm(n, mean = 1, sd = 2)  # Difference score in pre-post experiment
errbar_plot(y)

Code

# Large sample study
n <- 800
y <- rnorm(n, mean = 1, sd = 2)  # Difference score in pre-post experiment
errbar_plot(y)

6.2 Interpretation of compatibility interval

Resist dichotomous thinking!

“… whether [compatibility] intervals include or exclude zero should play no role in their interpretation, because even with only random variation the intervals from different studies can easily be very different” (Amrhein et al., 2019)

Example in Amrhein et al. (2019): Result from study on unintended effects of anti-inflammatory drugs: Risk-ratio (RR) of 1.2 [-0.03, 0.48]. This non-significant result (p = 0.091) led the authors to conclude that exposure to the drugs was not associated with the outcome (atrial fibrillation), in contrast to results from an earlier study (RR = 1.2) with a statistically significant outcome.

Compare this to Amrhein et al’s interpretation of the result:

“Like a previous study, our results suggest a 20% increase in risk of new-onset atrial fibrillation in patients given the anti-inflammatory drugs. Nonetheless, a risk difference ranging from a 3% decrease, a small negative association, to a 48% increase, a substantial positive association, is also reasonably compatible with our data, given our assumptions.” (Amrhein et al., 2019)

This is a good example of how to interpret point estimates and intervals!

6.3 Sampling distribution and Standard error

A sampling distribution is the distribution of a sample statistic, for example the mean, that would be observed if we could repeat the sampling procedure infinitely many times, and for each time calculate the statistic. The standard error of the statistic is the standard deviation of its sampling distribution.

Standard error of the mean

Here simulation of random draws of samples from a normal distribution with mean = 100 and sd = 15. The figure shows the distribution of sample means for three sample sizes.

Code

# Create a function that does what you want to simulate one time
my_sample <- function(n = 25) {
  x <- rnorm(n, mean = 100, sd = 15)
  out <- mean(x)
  out
}

# Repeat my_sample() many times with the replicate() function
n25 <- replicate(n = 1e4, my_sample(n = 25))  # Replicate 100,000 times
n100 <- replicate(n = 1e4, my_sample(n = 100))
n900 <- replicate(n = 1e4, my_sample(n = 900))

# Plots
par(mfrow = c(1, 3))
hist(n25, breaks = 200, xlim = c(85, 115), main = "n = 25")
hist(n100, breaks = 200, xlim = c(85, 115), main = "n = 100")
hist(n900, breaks = 200, xlim = c(85, 115), main = "n = 900")

The standard deviation of the sampling distribution of means is the standard error of the mean: \(\sigma/\sqrt{n}\), where \(\sigma\) is the standard deviation in the population add \(n\) is the sample size.

The simulations were based on 10,000 random samples, so the standard deviation from the simulations would be a good approximation of the standard deviation (i.e., standard error) of the true sampling distribution (based on infinitely many random draws).

Code

# Standard errors (sd of sampling distributions)
se <- c(se_n25_approx = sd(n25), se_n100_approx = sd(n100), se_n900_approx = sd(n900))
round(se, 3)

 se_n25_approx se_n100_approx se_n900_approx 
         2.951          1.495          0.499

Code

se_exact <- 15 / sqrt(c(25, 100, 900))  # SE = SD/sqrt(n)
names(se_exact) <- c("se_n25", "se_n1005", "se_n900")
round(se_exact, 3)

  se_n25 se_n1005  se_n900 
     3.0      1.5      0.5

Typically, we don’t know \(\sigma\), and therefore use the sample standard deviation to estimate the (true) standard error. In most cases when people, including myself, use the term “standard error” (SE) they refer to an estimate of the true standard error. The estimate of the standard error of the mean, \(SE_{\bar{X}}\), is

\(SE_{\bar{X}} = \frac{SD(X)}{\sqrt{n}}\),

where \(X\) is a sample of observations \(x_1, x_2, ..., x_n\), \({\bar{X}}\) is the sample mean, and \(SD(X)\) is the sample standard deviation.

note: The term “standard error” may also be used with a more general meaning:
“… in our current practice we usually summarize uncertainty using simulation, and we give the term”standard error” a looser meaning to cover any measure of uncertainty that is compatible to the posterior standard deviation”, Gelman et al. (2021), p. 51 on the variability of the posterior distribution estimated using Bayesian statistics. Gelman et al. advocate the median absolute deviation (MAD) as measure of standard error (in the lose meaning of the word) as this measure of dispersion is more robust to extreme values.

Simulating a sampling distribution

Here a simulation of a sampling distribution: Samples of 25 observations drawn independently from a normal distribution with \(\mu = 100\) and \(\sigma = 15\). The standard deviation of the distribution of sample means, \(\sigma_{\bar x}\) is called the standard error:

\(\sigma_{\bar x} = \frac{\sigma_x}{\sqrt{n}}\)

So for this simulation with \(\sigma = 15\) and \(n = 15\), the estimated standard error from our simulation should be close to its true value:

\(\sigma_{\bar x} = \frac{15}{\sqrt{25}} = 3\).

Code

# Create a function that for one simulation
my_sample <- function(n = 25) {
  x <- rnorm(n, mean = 100, sd = 15)
  out <- mean(x)
  out
}

# Repeate my_sample() many times with the replicate() function
my_simul <- replicate(n = 1e5, my_sample())  # Replicate 100,000 times

# Illustrate results and calculate some stats
hist(my_simul, breaks = 200, main = "Sample means, n = 25")

Code

my_stats <- c(mean = mean(my_simul), se = sd(my_simul), 
              median = median(my_simul), mad = mad(my_simul))
round(my_stats, 1)

  mean     se median    mad 
   100      3    100      3

And here is what happens if you sample from a non-normal distribution, in this example a uniform distribution between 50 and 150:

\(X \sim Unif(a = 50, b = 150)\),

for which \(E(X) = 100\), \(SD(X) = 100/\sqrt{12} \approx 28.9\). With \(n = 25\), the standard error would be \(28.9/5 \approx 5.8\).

Code

# As above, but with a uniform distribution between 50 and 150
my_sample <- function(n = 25) {
  x <- runif(n, min = 50, max = 150)
  out <- mean(x)
  out
}

# Plot uniform distribution
plot(20:180, dunif(20:180, min = 50, max = 150), type = 'l', ylim = c(0, 0.015),
     xlab = "x", ylab = "probability density")

Code

# Repeat my_sample() many times with the replicate() function
my_simul <- replicate(n = 1e5, my_sample())  # Replicate 100,000 times

# Illustrate results and calculate some stats
hist(my_simul, breaks = 200, main = "Sample means, n = 25")

Code

my_stats <- c(mean = mean(my_simul), se = sd(my_simul), 
              median = median(my_simul), mad = mad(my_simul))
round(my_stats, 1)

  mean     se median    mad 
 100.0    5.8  100.0    5.8

Confidence intervals

The conventional confidence intervals around the mean are calculated using the T-distribution, with n-1 degrees of freedom, df (Gelman et al. (2021), p. 53). The T-distribution with few df has heavier tails than the normal distribution, but already at moderate sample sizes, say df = 49, the two distributions are very similar. This is the standard analytic method for calculating the confidence interval:

\(CI = \bar{X} \pm t_{\alpha/2}SE_{\bar{X}}\),

where \(t_{\alpha/2}\) is the critical t-value for a specified \(\alpha\) level. For example, if \(\alpha = .05\), then the critical t-value define the two cut-off points between which the middle 95 % of the probability is located. This is roughly -2 and 2 SEs from the mean, and that is why \(\pm 2SE\) often is a good approximation of a 95 % confidence interval. For example, the cutoffs would be \(-1.96SE\) and \(1.96SE\) for the normal distribution, and \(-2.01SE\) and \(2.01SE\) for a sample size of n = 50 (df = 49).

Example: Sample mean = 101, sd = 21, sample size = 50.

\(95\%CI = mean(x) \pm t_{.05/2}SE_x =\)

\(=101 \pm 2.01*\frac{21}{\sqrt{50}} =\)

= [95, 107].

Here is how you may use lm() in R to obtain the interval.

Code

set.seed(879)
# Simulate sample with mean = 101, sd = 21, and n = 50
m = 101  # sample mean
s = 21  # sample sd
n = 50  # sample size
x <- rnorm(n)  # 50 observations
z <- (x - mean(x))/sd(x)  # transform to z-scores: mean = 0, sd = 1
y <- (z*s) + m  # transform the z-scores to get mean = 101, sd = 21

# Use lm() to derive 95 % confidence interval, i.e., do
# linear regression without predictor, only the intercept = mean is estimated.
mfit <- lm(y ~ 1)
confint(mfit)

               2.5 %   97.5 %
(Intercept) 95.03187 106.9681

Note: Please try t.test(y), it is the same thing!

Bootstrapping

Another common method for estimating the confidence intervals is to do bootstrapping. This is a very powerful and general method that may be used to derive confidence intervals when no analytic solution exists or when assumptions behind analytic solutions seem hard to accept. See Gelman et al. (2021) chapter 5, on how bootstrapping is used to simulate a sampling distribution. Here is code to do it for the data simulated in the previous section.

Code

bootstrap <- function(y) {
  # Random draw from the sample with replacement, same n
  b <- sample(y, size = length(y), replace = TRUE)
  meanb <- mean(b)
  meanb
}

# Draw 10,000 bootstraps
bb <- replicate(1e4, bootstrap(y))

hist(bb, breaks = seq(84.5, 115.5, by = 1), main = "10k bootstraps",
     xlab = "Mean of bootstrapped sample")
bootstrap_summary <- c(mean = mean(bb), 
                       quantile(bb, probs = c(0.025, 0.975)))
lines(bootstrap_summary[c(2, 3)], c(40, 40), col = "blue")
points(bootstrap_summary[1], 40, pch = 21, bg = "blue")

Code

round(bootstrap_summary, 1)

 mean  2.5% 97.5% 
101.0  95.0 106.7

6.4 Likelihood funcion

The likelihood is the probability of the observed data considered as a function of unknown model parameter(s) \(\theta\)

Example: Likelihood function of the success probability \(\theta\) (theta) in a binomial experiment (“coin tosses”) with \(s = 6\) successes in \(n = 9\) trials. To make it more realistic, you may think of a sample of 9 students from a specific student population, 6 of which answered correctly on a given exam question. Your goal is to use this data to estimate the proportion of correct answers had the full population been tested.

Code

# Function that derives the likelihood function using brute force
binom_likelihood <- function(n, s) {
  # Grid approximation
  theta <- seq(from = 0, to = 1, by = 1/300) # Grid of parameters
  likelihood <- dbinom(x = s, size = n, prob = theta) # Likelihood
  ylab = 'likelihood'

  # Plot likelihood function
  plot(theta, likelihood, type = 'l', axes = FALSE, ylab = ylab)
  axis(1, pos = 0)
  axis(2, pos = 0, las = 2)
}

# Data
n <- 9
s <- 6

# Plot likelihood
binom_likelihood(n = n, s = s)
mtext(text = "Data fixed: 6 successes in 9 trials", side = 3, cex = 0.8)
lines(x = c(6/9,6/9), y = c(0, dbinom(x = 6, size = 9, prob = 6/9)),
      lty = 2, col = "blue")

The blue line show the maximum of the likelihood function (maximum likelihood) found at \(\theta = 6/9\). So, the data is most compatible with a true proportion of 6/9. But also highly compatible with proportions around 6/9.

Likelihood function and sampling distribution are different things, yet based on the same probability distribution. But the likelihood function treats data as fixed and parameters as variable, whereas sampling distributions treats the parameter as fixed and the data as variable. Here sampling distributions for \(\theta\) of 1/10, 1/2, 9/10

Code

# Probabilities for all possible outcomes, s = 0, 1, ..., n
n = 9
par(mfrow = c(1, 3))

# Function for ploting sampling distributions figures
sdist <- function(prob) {
  p <- dbinom(0:n, size = n, prob = prob)
  plot(0:n, p, type = 'h', xlab = 'data', ylab = 'P(data|theta)')
}

sdist(prob = 1/10)
mtext(text = "Parameter fixed, theta = 1/10", side = 3, cex = 0.7)

sdist(prob = 1/2)
mtext(text = "Parameter fixed, theta = 1/2", side = 3, cex = 0.7)

sdist(prob = 9/10)
mtext(text = "Parameter fixed, theta = 9/10", side = 3, cex = 0.7)

Advanced: Likelihood 2-parameter distribution

In the above example, there was only one parameter, \(\theta\), representing the proportion of correct answers to an exam question had the full population of students been tested. Assume instead that we have a sample of exam scores from 16 students, and we want to estimate the mean and standard deviation had we tested the full population of students. We may treat exam scores as a continuous variable and assume that it is normally distribution,
\(scores \sim N(\mu, \sigma)\). Here are our 16 exam scores.

Code

set.seed(123)
score <- rnorm(16, 50, 10) # Simulate exam scores from 16 students
round(score)

 [1] 44 48 66 51 51 67 55 37 43 46 62 54 54 51 44 68

Code

round(c(mean = mean(score), sd = sd(score)), 2)

 mean    sd 
52.55  9.13

The likelihood function is 2-dimensional, because we have to parameters, \(\mu\) and \(\sigma\). (Code only for nerds.)

Code

# Use grid approximation to obtain likelihood function

# Define grid
mu <- seq(from = 20, to = 80, length.out = 500)
sigma <- seq(from = 2, to = 18, length.out = 500)
grid <- expand.grid(mu = mu, sigma = sigma)

# Calculate likelihood (some tricks needed here using log = TRUE)
grid$loglike <- sapply(1:nrow(grid), function(i) sum(dnorm(
                                       score, 
                                       mean = grid$mu[i],
                                       sd = grid$sigma[i],
                                       log = TRUE)))
# Normalize to max = 1, and antilog
grid$likelihood <- exp((grid$loglike) - max(grid$loglike))

# Plot likelihood function, using contour_xyz from library rethinking
rethinking::contour_xyz(grid$mu, grid$sigma, grid$likelihood, 
            xlim = c(45, 60), ylim = c(5, 15), xlab = 'mu', ylab = 'sigma')
mtext(text = "Data fixed: n = 16, mean = 52.5, sd = 9.1", side = 3, cex = 0.9)

This contour plot shows the two-dimensional likelihood function. The maximum likelihood estimate is the mode (highest point) of this function.

Credible intervals

Whereas the previously discussed methods for deriving confidence intervals rely heavily on sampling distributions, Bayesian methods (covered below) use the likelihood function to estimate the corresponding interval, often referred to as a credible interval.

This code shows how to derive a credible interval in R using rstanarm, using the same simulated data as above, i.e., sample mean = 101, sd = 21, and n = 50.

Code

# Use library(rstanarm) and function stan_glm() to derive 95 % credible interval. 
# Variable y was created in the previous code block. 

y <- data.frame(y)  # stan_glm wants data in a data frame
sfit <- stan_glm(y ~ 1, data = y, seed = 123, refresh = 0)
print(sfit)

stan_glm
 family:       gaussian [identity]
 formula:      y ~ 1
 observations: 50
 predictors:   1
------
            Median MAD_SD
(Intercept) 101.0    3.0 

Auxiliary parameter(s):
      Median MAD_SD
sigma 21.2    2.1  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Estimated mean = 101 with a standard error (in the lose meaning of the term) of 3.0. So a rough 95 % interval would be \(101 \pm 2*3\), i.e. \([95, 107]\). We can also ask stan_glm for the 95 % interval of the posterior distribution:

Code

posterior_interval(sfit, prob = 0.95)

                2.5%     97.5%
(Intercept) 94.90063 107.33378
sigma       17.54435  26.45381

95 % interval around the mean: \([94.9, 107.3]\)

Note that stan_glm also provide a confidence interval for the standard deviation (sigma): \([17.5, 26.5]\), point estimate = \(21.2\). Note also that this interval is not symmetrical around the point estimate, because the posterior distribution of sigma is not symmetrical. This is an example of where the simple rule \(\pm 2 SE\), may be misleading.

Posterior distribution sigma estimated using stan_glm()

Code

# Extract samples from the posterior
samples <- as.data.frame(sfit)
sigma <- samples$sigma

# Plot density curve for sigma
plot(density(sigma), xlab = "Sigma", 
     main = "Posterior probability: sigma, with 95 and 50 % CI")

# Derive point estimate and compatibility interval
point_estimate <- median(sigma)
ci95 <- quantile(sigma, prob = c(0.025, 0.975))
ci50 <- quantile(sigma, prob = c(0.25, 0.75))

# Plot point estimate and compatibility interval
arrows(x0 = ci95[1], x1 = ci95[2], y0 = 0.01, y1 = 0.01, 
       length = 0, col = "blue")
arrows(x0 = ci50[1], x1 = ci50[2], y0 = 0.01, y1 = 0.01, 
       length = 0, col = "darkblue", lwd = 3)
points(point_estimate, 0.01, pch = 21, bg = 'lightblue')  # Add point"

Here I also added a 50 % CI (thick line) to the 95 % interval (thin line), to emphasize that the values closer to the point estimate are more compatible with the data than those close to the end of the 95 % interval (as always: given our statistical model). See Gelman et al. (2021) Fig. 19.4, p.367 for example of how they use combined 50 and 95 % interval.

6.5 Probability 3: Bayesian inference

The general goal of Bayesian inference is to update prior beliefs (hypotheses) in light of data to arrive at posterior beliefs (updated hypotheses). The key idea is to think about uncertainty as probabilities and use probability theory to be consistent in the way beliefs in hypotheses are updated.

Bayes rule for estimating a discrete parameter

Remember the twin problem:

A women is pregnant with twins, and her doctor tells her that about one third of twin births are identical and the remaining two thirds are fraternal. On the next visit to the doctor, the sonogram shows that she is pregnant with twin boys. Her doctor tells her that about half identical twins are twin boys and the other half are twin girls, whereas for fraternal twins about one quarter are twin boys, one quarter are twin girls, and the rest are one of each sex. How should she update her belief in identical versus fraternal twins given this information? (old Stat1 exam question)

Here the information in frequency format (think of 600 twin births):

	BB	Mix	GG	Marginal
Identical	100	0	100	200
Fraternal	100	200	100	400
Marginal	200	200	200	600

Solve using Bayes rule:

\(H_0\): Fraternal twins
\(H_1\): Identical twins
\(D\): Twin boys
\(Pr(H_0) = 2/3\), Prior probability fraternal
\(Pr(H_1) = 1/3\), Prior probability identical
\(Pr(D|H_0) = 1/4\), Likelihood fraternal
\(Pr(D|H_§) = 1/2\), Likelihood identical
\(Pr(D) = 1/3\), Probability of data
Bayes’ rule: \(Pr(H_i|D) = \frac{Pr(H_i)Pr(D|H_i)}{Pr(D)}\)
\(Pr(H_0|D) = (2/3 * 1/4)/(1/3) = 1/2\), Posterior probability fraternal
\(Pr(H_1|D) = (1/3 * 1/2)/(1/3) = 1/2\), Posterior probability identical

Solve with relative version of Bayes rule:

Bayes’ rule: \(Pr(H_i|D) \propto Pr(H_i)Pr(D|H_i)\)
\(Pr(H_0|D) \propto (2/3 * 1/4) = 1/6\), Relative posterior fraternal
\(Pr(H_1|D) \propto (1/3 * 1/2) = 1/6\), Relative posterior identical

So we can see that the posterior probability is equal for the two alternatives, although we know that the absolute posterior probability is not 1/6.

Here is a graphical illustration of the result:

Code

# Plot posterior, prior, likelihood
par(mfrow = c(1, 3))

prior <- c(2/3, 1/3)
likelihood <- c(1/4, 1/2)
relposterior <- c(1/6, 1/6)

# Make plot
mybar <- function(x, main, ylab) {
  plot(NULL, xlim = c(-0.3, 1.3), ylim = c(0, 1), axes = FALSE, 
     xlab = "Parameter", ylab = ylab, main = main)
  points(c(0, 1), x, type = "h", lwd = 2)
  axis(1, at = c(-0.3, 0, 1, 1.2), labels = c("", "Fraternal", "Identical", ""),
       cex.axis = 0.8, pos = 0)
  axis(2, las = 2, pos = -0.3)
}

# Draw plot
mybar(relposterior, main = "Relative posterior", ylab = "Relative posterior probaility")
mybar(prior, main = "Prior", ylab = "Prior probaility")
mybar(likelihood, main = "Likelihood", ylab = "Likelihood")

Note: The shape of the posterior is all we need, the absolute heights of the bars in the left-hand panel is not used for inference, only their relative height. In this case, the two bars have the same height, meaning that the two outcomes have the same posterior probability.

In this example, it was fairly easy to derive the absolute posterior because we could calculate \(Pr(D)\). But for most problems, \(Pr(D)\) is very hard or impossible to calculate, and that is why Bayesian inference typically only uses the relative version of Bayes’ rule.

Bayes rule for estimating a continous parameter

In the above example, we updated our beliefs in two discrete outcomes. More often we want to update our belief in a continuous parameter. For example, the probability of success in a binomial experiment with a known number of trials.

Here is a simple example: A logical problem that can be scored as either correct or incorrect. We have given the test to nine individuals, six of which answered correctly. We want to use this data to estimate the proportion of individuals who would answer correctly in the population from which the individuals were drawn. We will model the problem using the Binomial distribution, with parameters n = 9 (known), and p (unknown, to be estimated): \(y \sim Binom(n = 9, p)\)

Bayes rule: posterior \(\propto\) prior \(\times\) likelihood

We will first work with a “flat” prior, essentially saying that before we saw the data we would consider every possible value of \(p\) as equally likely.

Code

# Data (fixed)
n <- 9
s <- 6

# Function to draw three plots
bayes_plot <- function(pgrid, prior) {
  # Likelihood
  likelihood <- dbinom(x = s, size = n, prob = pgrid)

  # Relative posterior
  relative_posterior <- prior * likelihood 

  # Plot posterior, prior, likelihood
  par(mfrow = c(1, 3))
  plot(pgrid, relative_posterior, type = 'l', 
     xlab = "Parameter, p", ylab = "Relative posterior probability",
     main = "Reltive posterior")
  plot(pgrid, prior, type = 'l', xlab = "Parameter value, p", ylab = "Prior probability density",
       main = "Prior", xlim = c(-0.01, 1.01))
  plot(pgrid, likelihood, type = 'l', xlab = "Parameter value, p", ylab = "Likelihood",
       main = "Likelihood")
}

pgrid <- seq(from = 0, to = 1, by = 0.01) # Parameters grid
flat <- rep(1, length(pgrid))# Prior (flat)
bayes_plot(pgrid = pgrid, prior = flat)

Here with a weakly informed prior: true proportion probably around 0.5, but values less or greater than this are also plausible, but we know that the true proportion is not zero and not one (maybe because we know of at least one individual in the population who can solve the problem, and at least one who could not).

Code

# Data (fixed)
n <- 9
s <- 6

# Prior 
beta22 <- dbeta(pgrid, shape1 = 2 , shape2 = 2)
bayes_plot(pgrid = pgrid, prior = beta22) # pgrid defined above

Here with a stronger prior: True proportion probably around 0.2 (maybe based on previous test results), but values less or greater than this are also plausible.

Code

# Data (fixed)
n <- 9
s <- 6

# Prior 
beta28 <- dbeta(pgrid, shape1 = 2 , shape2 = 8)
bayes_plot(pgrid = pgrid, prior = beta28) # pgrid defined above

Here another informed prior: Maybe the test was a multiple-choice question with two alternatives, so proportion correct could not be worse than 50 % correct (assuming that everyone did their best), but we consider every value of \(p\) between 0.5 and 1 as equally likely.

Code

# Data (fixed)
n <- 9
s <- 6

# Prior 
lim50 <- ifelse(pgrid < 0.5, 0, 2)  # pgrid defined above
bayes_plot(pgrid = pgrid, prior = lim50)

Advanced: Sampling from the posterior

The relative posterior distribution may be hard or impossible to estimate analytically. However, it may be estimated using clever algorithms that draws a large number of random samples from the posterior and then uses these samples to estimate properties of the relative posterior, such as its median, median absolute deviation, and 95 % compatibility interval. stan_glm() uses an algorithm called Hamiltonian Monte Carlo. The code below show how to extract samples and use them to derive statistics of the posterior. The example uses the same simulated data as above, 16 test scores modeled as independent random draws from a normal distribution with unknown mean and standard deviation.

Test scores, n = 16.

Code

set.seed(123)
score <- rnorm(16, 50, 10) # Simulate exam scores from 16 students
round(score)

 [1] 44 48 66 51 51 67 55 37 43 46 62 54 54 51 44 68

Code

round(c(mean = mean(score), sd = sd(score)), 2)

 mean    sd 
52.55  9.13

Fit model:

\(y_i \sim Normal(\mu, \sigma)\)
\(y_i = b_0\)

using stan_glm(), and print output:

Code

dd <- data.frame(score)  # stan_glm() likes data in a data frame
mfit <- stan_glm(score ~ 1, data = dd, refresh = 0, seed = 123, iter = 1e4)

Found more than one class "stanfit" in cache; using the first, from namespace 'rstan'

Also defined by 'rethinking'

Found more than one class "stanfit" in cache; using the first, from namespace 'rstan'

Also defined by 'rethinking'

Code

print(mfit)

stan_glm
 family:       gaussian [identity]
 formula:      score ~ 1
 observations: 16
 predictors:   1
------
            Median MAD_SD
(Intercept) 52.5    2.3  

Auxiliary parameter(s):
      Median MAD_SD
sigma 9.3    1.6   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Extract samples, plot them, and do some stats to estimate properties of the posterior.

Code

# Save samples in data frame
samples <- as.data.frame(mfit)
# Rename columns, so "(intercept)""is called b0 (not needed)
names(samples) <- c("b0", "sigma")  # Rename columns, not needed
head(samples)  # Show first rows of data frame

        b0     sigma
1 52.32408 10.830253
2 48.21844  8.267484
3 54.49953 13.032090
4 55.00462 13.487413
5 53.30425 14.204507
6 52.14543 13.041902

Code

# Plot samples, using transparent colors. stan_glm() point estimates: white circle
plot(samples$b0, samples$sigma, pch = 19, col = rgb(0, 0, 1, 0.1),
     xlim = c(45, 60), ylim = c(5, 15))
points(median(samples$b0), median(samples$sigma), pch = 21, bg = "white", cex = 1)

Code

par(mfrow = c(1, 2))
# plot distribution for b0
plot(density(samples$b0), main = "Estimated posterior b0", xlab = "b0",
     ylab = "Posterior probability")
# plot distribution for sigma
plot(density(samples$sigma), main = "Estimated posterior sigma", xlab = "sigma",
     ylab = "Posterior probability")

Calculate median and median absolute deviation (mad) for b0 and sigma, compare to stan_glm() output, and derive 95 % compatibility intervals for the parameters. Note: “(Intercept)” is parameter b0.

Code

# Median and MAD
mdn_b0 <- median(samples$b0)
mad_b0 <- mad(samples$b0)
mdn_sigma <- median(samples$sigma)
mad_sigma <- mad(samples$sigma)
stats <- c(mdn_b0 = mdn_b0, mad_b0 = mad_b0, 
           mdn_sigma = mdn_sigma, mad_sigma = mad_sigma)
round(stats, 1)

   mdn_b0    mad_b0 mdn_sigma mad_sigma 
     52.5       2.3       9.3       1.6

Code

# Compare with stan_glm() output
print(mfit)

stan_glm
 family:       gaussian [identity]
 formula:      score ~ 1
 observations: 16
 predictors:   1
------
            Median MAD_SD
(Intercept) 52.5    2.3  

Auxiliary parameter(s):
      Median MAD_SD
sigma 9.3    1.6   

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Code

# Compatibility intervals, 95 %
b0_ci <- quantile(samples$b0, prob = c(0.025, 0.975))
sigma_ci <- quantile(samples$sigma, prob = c(0.025, 0.975))
round(c(b0_ci = b0_ci, sigma_ci = sigma_ci), 1)

    b0_ci.2.5%    b0_ci.97.5%  sigma_ci.2.5% sigma_ci.97.5% 
          47.7           57.3            6.8           13.8

Practice

The practice problems are labeled Easy (E), Medium (M), and Hard (H), (as in McElreath (2020)).

Easy

6E1.

Given the information in this table, would you say that previous research is in agreement or disagreement with regard to the effect of physical activity on working memory capacity? Motivate your answer.

(This is an old exam question)

6E2. Meta-analysis is widely consider the best method for accumulation of scientific knowledge. If a meta-analysis was conducted on the studies in 6E1, what use would it have of the p-values in the fourth column of the table above?

6E3. I simulated data from a randomized experiment with a control group and a treatment group (n = 30 per group). The conditions were nature sound (treatment) versus no sound exposure (control) during a test of directed attention. The test has been used in previous research and is known to have a mean around 50 with a standard deviation around 10; a difference greater than 5 units on this scale is generally considered a substantial difference, whereas a difference of less than 1 unit is generally considered negligible. The between-group difference was 5 units in favor of the treatment group with 95 % confidence interval of [-1, 12].

In a single sentence, summarize the result for the treatment effect, with reference to the point estimate and the compatibility interval. Phrase your sentence in line with the recommendations of Amrhein et al. (2019) , and with reference to the general understanding of the outcome measure described above.

6E4. The estimated treatment effect in 6E3 was not considered statistically significant (p > .05). However, it would be a mistake to conclude from this that the data supported no effect of treatment. Explain.

6E5. Which of the following tends to decrease as sample size increases: standard deviation, standard error, median absolute deviation, 95% confidence interval, interquartile range?

Medium

6M1. The code below simulates data from a randomized experiment with one control group and one experimental group. Run the code to obtain the data, then:

Calculate a 50 % and 90 % compatibility interval around each group mean. Feel free to use whatever method you like in R to calculate point estimates and intervals (t.test(), lm(), glm(), stan_glm(), …)
Visualize point estimates and intervals (50 and 90 %) for the two groups in a single graph (cf. Gelman et al. (2021), Fig. 4.2).
Calculate a 50 % and 90 % compatibility interval around the between-group difference.
Visualize point estimate and intervals (50 and 90 %) for he between-group difference in a single graph.

R.code:

set.seed(987)

control <- rnorm(30, mean = 20, sd = 2)

treatment <- rnorm(30, mean = 22, sd = 2)

6M2. In your own words: What is the difference between a sampling distribution and a likelihood function?

6M3. Below is the binomial probability distribution for number of successes in 7 trials, \(X \sim Binom(n = 7, p = 0.3)\). Does this remind you about a sampling distribution or a likelihood function. Motivate.

Code

plot(0:7, dbinom(0:7, size = 7, prob = 0.3), type = "h", xlab = "x = Number of successes",
     ylab = "Pr(X = x)")

6M4. Look at the figure in 6M3, and estimate by eye:

Pr(X = 3)
Pr(X < 3)
Pr(X > 3)
Pr(1 < X < 4)

Hard

6H1. Run the simulation and stan_glm() analysis from above, section “Advanced: Sampling from the posterior”, and calculate a 50 % and 90 % compatibility interval.

6H2.
(a) Search the internet to find the difference between “percentile interval” (PI) and “highest density interval” (HDI). Explain the difference in your own words. (b) Repeat 6H1, but now calculate a HDI. (The R package HDInterval may be useful.)

6H3.

Explain, in your own words, step by step, how you would derive a 90 % confidence interval for a Pearson correlation coefficient using bootstrapping.
Suggest R-code for each step, and evaluate using simulated data.

(Session Info)

Code

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default


locale:
[1] LC_COLLATE=Swedish_Sweden.utf8  LC_CTYPE=Swedish_Sweden.utf8   
[3] LC_MONETARY=Swedish_Sweden.utf8 LC_NUMERIC=C                   
[5] LC_TIME=Swedish_Sweden.utf8    

time zone: Europe/Stockholm
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstanarm_2.32.1 Rcpp_1.0.14    

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1     dplyr_1.1.4          farver_2.1.2        
 [4] loo_2.8.0            fastmap_1.2.0        tensorA_0.36.2.1    
 [7] shinystan_2.6.0      promises_1.3.3       shinyjs_2.1.0       
[10] digest_0.6.37        mime_0.13            lifecycle_1.0.4     
[13] StanHeaders_2.32.10  processx_3.8.6       survival_3.7-0      
[16] magrittr_2.0.3       posterior_1.6.1      compiler_4.4.2      
[19] rlang_1.1.6          tools_4.4.2          igraph_2.1.4        
[22] yaml_2.3.10          knitr_1.50           htmlwidgets_1.6.4   
[25] pkgbuild_1.4.8       curl_6.4.0           cmdstanr_0.9.0.9000 
[28] plyr_1.8.9           RColorBrewer_1.1-3   dygraphs_1.1.1.6    
[31] abind_1.4-8          miniUI_0.1.2         grid_4.4.2          
[34] stats4_4.4.2         xts_0.14.1           xtable_1.8-4        
[37] inline_0.3.21        ggplot2_3.5.2        rethinking_2.42     
[40] scales_1.4.0         gtools_3.9.5         MASS_7.3-61         
[43] mvtnorm_1.3-3        cli_3.6.5            rmarkdown_2.29      
[46] reformulas_0.4.1     generics_0.1.4       RcppParallel_5.1.10 
[49] rstudioapi_0.17.1    reshape2_1.4.4       minqa_1.2.8         
[52] rstan_2.32.7         stringr_1.5.1        shinythemes_1.2.0   
[55] splines_4.4.2        bayesplot_1.13.0     parallel_4.4.2      
[58] matrixStats_1.5.0    base64enc_0.1-3      vctrs_0.6.5         
[61] V8_6.0.4             boot_1.3-31          Matrix_1.7-1        
[64] jsonlite_2.0.0       crosstalk_1.2.1      glue_1.8.0          
[67] nloptr_2.2.1         ps_1.9.1             codetools_0.2-20    
[70] distributional_0.5.0 DT_0.33              shape_1.4.6.1       
[73] stringi_1.8.7        gtable_0.3.6         later_1.4.2         
[76] QuickJSR_1.8.0       lme4_1.1-37          tibble_3.3.0        
[79] colourpicker_1.3.0   pillar_1.10.2        htmltools_0.5.8.1   
[82] R6_2.6.1             Rdpack_2.6.4         evaluate_1.0.3      
[85] shiny_1.11.1         lattice_0.22-6       markdown_2.0        
[88] rbibutils_2.3        backports_1.5.0      threejs_0.3.4       
[91] httpuv_1.6.16        rstantools_2.4.0     coda_0.19-4.1       
[94] gridExtra_2.3        nlme_3.1-166         checkmate_2.3.2     
[97] xfun_0.52            zoo_1.8-14           pkgconfig_2.0.3