When confidence intervals stop meaning what you think they mean

confidence intervals
inference
visualization
On asymmetric confidence intervals and the limits of inference by eye
Author

Daniel Koska

Published

January 3, 2026

Every once in a while, I come across the same piece of statistical advice:

“Don’t judge differences by whether confidence intervals overlap”.

At first glance, that sounds like one of those annoyingly technical rules statisticians like to repeat without explaining properly. For a long time, I more or less accepted it, but never really sat down to work through why it is true, when it matters, and what exactly goes wrong. This post is my attempt to do that.

Asymmetric confidence intervals are everywhere

Confidence intervals are often asymmetric. This is not some strange edge case. It is completely routine. Think of:

  • odds ratios

  • hazard ratios

  • rate ratios

  • quantiles

  • bootstrap(d) intervals

  • profile-likelihood intervals

These intervals are often perfectly fine. They are computed correctly, reported correctly, and for a single estimate they are usually interpreted correctly. In particular, asking whether such an interval includes the null value is often a perfectly sensible inferential step.

Where things quietly go off the rails

Whether symmetric or not, the overlap between two separate confidence intervals is often used as a visual shortcut for deciding whether two effects differ. But that shortcut can easily disagree with the result of a corresponding statistical test.

The reason is simple: the statistical test does not ask whether two displayed intervals overlap. It asks whether the contrast between the two effects is compatible with zero on the relevant scale, for example whether \(A - B = 0\), whether \(\log(A/B) = 0\), or, in a subgroup analysis, whether an interaction term is zero. That is a different inferential question from “do the two displayed intervals overlap”?

Let’s have a closer look at this. Suppose a subgroup analysis reports something like this:

Subgroup A: \(\mathrm{HR} = 0.72\) (95% CI 0.57–0.92)

Subgroup B: \(\mathrm{HR} = 0.92\) (95% CI 0.64–1.31)

\(p_{\text{interaction}} = 0.23\)

Formally, the message is straightforward: there is no convincing evidence that the treatment effect differs between subgroups. Visually, however, the message feels different. One interval excludes 1, the other does not. That creates a strong pull toward a different interpretation, something like:

“Looks like the treatment works in subgroup A, but not in subgroup B.”

That conclusion is very tempting. It is also not what the interaction test says. And this is where the trouble starts. The eye is informally comparing two separate confidence intervals, while the formal analysis is answering a different question.

The key mistake

A confidence interval for effect A and a confidence interval for effect B are not the same thing as a confidence interval for the contrast between them. That contrast might be:

  • \(A - B\),
  • \(\log(A/B)\),
  • or some other difference depending on the model and effect measure.

These are different inferential objects. And this is the real reason why CI overlap is unreliable. Even for symmetric 95% confidence intervals, overlap does not correspond neatly to a 5% test of equality. Once intervals are asymmetric on the displayed scale, the visual shortcut becomes even less trustworthy.

Why asymmetry makes the problem worse

To be clear: asymmetry does not destroy inference. What it destroys is the comforting illusion that inference by eye is geometrically straightforward. Once intervals are asymmetric on the plotted scale, several things become harder to interpret visually:

  • the point estimate is no longer centered in the interval,
  • left and right distances no longer mean the same thing,
  • the apparent distance between intervals depends on the chosen scale,
  • and overlap is no longer tied to any simple testing rule.

That last point matters a lot.

For ratio measures such as odds ratios or hazard ratios, intervals are often constructed on the log scale and then back-transformed. On the log scale, the interval may be perfectly symmetric. On the original scale, it becomes asymmetric. So even the visual geometry depends on the parameterization. That alone should make us suspicious of the idea that overlap on the displayed scale has some stable inferential meaning. Usually, it does not.

A simple simulation

To make this more concrete, let us simulate a situation in which two subgroup-specific effects are estimated independently.

I simulate two log-effects, as one might obtain from two subgroup analyses. On the log scale, standard Wald inference is straightforward. I then exponentiate the estimates and intervals to obtain ratio measures with asymmetric confidence intervals on the original scale.

The question is: how often do two 95% confidence intervals overlap on the ratio scale even though the formal test comparing the two log-effects is statistically significant?

Code
library(tidyverse)
library(knitr)

set.seed(1)

n_sim <- 20000

# True effects on the log scale
mu1 <- log(0.60)
mu2 <- log(1.10)

# Standard errors of the subgroup-specific log-effects
se1 <- 0.17
se2 <- 0.17

sim <- tibble(
  theta1 = rnorm(n_sim, mu1, se1),
  theta2 = rnorm(n_sim, mu2, se2)
) %>%
  mutate(
    est1 = exp(theta1),
    est2 = exp(theta2),

    l1 = exp(theta1 - 1.96 * se1),
    u1 = exp(theta1 + 1.96 * se1),
    l2 = exp(theta2 - 1.96 * se2),
    u2 = exp(theta2 + 1.96 * se2),

    overlap = pmax(l1, l2) <= pmin(u1, u2),

    z_diff = (theta1 - theta2) / sqrt(se1^2 + se2^2),
    p_diff = 2 * pnorm(-abs(z_diff)),
    sig_diff = p_diff < 0.05
  )

summary_table_num <- sim %>%
  summarise(
    `Intervals overlap` = mean(overlap),
    `Formal comparison is significant` = mean(sig_diff),
    `Overlap despite significant difference` = mean(overlap & sig_diff),
    `Overlap among significant comparisons` = mean(overlap & sig_diff) / mean(sig_diff)
  )

summary_table <- summary_table_num %>%
  mutate(across(everything(), ~ paste0(round(100 * .x, 1), "%")))

kable(summary_table, align = "cccc")
Intervals overlap Formal comparison is significant Overlap despite significant difference Overlap among significant comparisons
59.7% 71.1% 30.9% 43.4%

The key result in Table 1 is Overlap despite significant difference. Here, it is 30.9%. So in nearly one third of all simulated datasets, the two asymmetric confidence intervals overlap even though the formal test indicates a statistically significant difference between effects.

That is not a small technical exception. It means that visual overlap can easily coexist with formal evidence for a difference, so “the intervals overlap” is clearly not a reliable shorthand for “there is no difference”.

The final column, Overlap among significant comparisons, shows the same issue from another angle. Although the formal comparison is significant in 71.1% of simulations, 43.4% of those significant cases still display overlapping confidence intervals. Put differently, even when the data support a difference statistically, the visual impression from the two separate intervals is often ambiguous or misleading.

One plotted example

We can also extract one concrete simulation run where this happens and plot it.

Code
example_row <- sim %>%
  filter(overlap, sig_diff) %>%
  slice(1)

example_plot_data <- tibble(
  subgroup = c("A", "B"),
  estimate = c(example_row$est1, example_row$est2),
  lower = c(example_row$l1, example_row$l2),
  upper = c(example_row$u1, example_row$u2)
) %>%
  mutate(across(-subgroup, ~ round(.x, 3)))

example_p <- example_row$p_diff[[1]]

knitr::kable(
  example_plot_data,
  align = c("c", "c", "c", "c"),
  caption = "One simulated example with overlapping asymmetric confidence intervals despite a statistically significant difference test."
)
One simulated example with overlapping asymmetric confidence intervals despite a statistically significant difference test.
subgroup estimate lower upper
A 0.619 0.444 0.864
B 1.147 0.822 1.600
Code
ggplot(example_plot_data, aes(x = estimate, y = subgroup)) +
  geom_point(size = 2.8) +
  geom_errorbar(
    aes(xmin = lower, xmax = upper),
    orientation = "y",
    width = 0.12
  ) +
  geom_vline(xintercept = 1, linetype = 2) +
  labs(
    x = "Effect estimate (ratio scale)",
    y = NULL,
    title = "Overlapping asymmetric confidence intervals",
    subtitle = paste0(
      "But the formal test comparing effects gives p = ",
      formatC(example_p, digits = 3, format = "f")
    )
  ) +
  theme_minimal(base_size = 12)

The table presents one concrete simulated dataset in which the two subgroup-specific confidence intervals overlap, even though the formal test comparing the effects is statistically significant. So while confidence intervals can often be interpreted against their null value, they generally should not be interpreted against each other by eye.

That is true in general and it becomes especially important when the intervals are asymmetric on the displayed scale. The more asymmetric the displayed intervals become, the easier it is to forget that the visual comparison is happening on a scale whose geometry may have little to do with the inferential question we actually care about.

What to do instead

If the real scientific question is whether two effects are differ, then the right response is not to stare harder at two separate confidence intervals. Instead, compute one of the following:

  • a confidence interval for the difference in effects,
  • a confidence interval for the ratio of effects,
  • or an explicit interaction test.

In other words: answer the question you actually care about with the inferential object that actually corresponds to it.

Asymmetric confidence intervals do not break inference. But they do break the intuition that visual overlap is telling us something simple. And that is exactly why they deserve a bit more caution than they usually get.