Home/Critical Appraisal
Critical Appraisal

P Values Confidence Intervals and Bias Explained Simply

P values, confidence intervals, and bias explained simply for FRCEM Critical Appraisal — plain-English definitions and the SBA-style traps to watch for.

P Values Confidence Intervals and Bias Explained Simply

P Values Confidence Intervals and Bias Explained Simply

TL;DR — P-value = probability of seeing this result if the null is true. CI gives the precision around the estimate. Both matter; neither proves causation.

Last updated: 30 May 2026


Critical Appraisal concepts at a glance

Critical Appraisal:
3 things FRCEM tests
1. p-values and CIs
p less than 0.05
= statistically significant
95% CI crossing 1 for ratios
or 0 for differences
= NOT significant
2. NNT and NNH
NNT = 1 divided by ARR
lower NNT = better treatment
NNH = 1 divided by ARI
lower NNH = more harm
3. Common biases
Selection bias
Recall bias
Attrition bias
Confounding
Observer bias
Publication bias
The 3 concept groups every FRCEM candidate must master for the Critical Appraisal paper.

P values, confidence intervals and bias are core critical appraisal topics in MRCEM SBA, FRCEM SBA and FRCEM OSCE journal appraisal stations. In emergency medicine, they matter because papers on diagnostics, treatments, prognostic tools and service redesign often look persuasive on first reading. The exam is not testing advanced statistics. It is testing whether you can interpret results safely, recognise weak evidence, and decide whether a study should change practice in a UK ED.

The safest approach is simple: never interpret a p value alone. Always combine effect size, confidence interval, study validity, bias, confounding and applicability to your patients.

Why P Values and Confidence Intervals Matter in FRCEM

Emergency clinicians regularly read studies on:

  • diagnostic pathways such as troponin rule-out strategies
  • clinical decision rules such as head injury or PE pathways
  • treatments such as analgesia, bronchodilators, sedation or sepsis interventions
  • service delivery changes such as streaming, ambulatory pathways or observation units

A statistically significant result does not automatically mean the intervention works, is clinically important, or is safe to adopt in the NHS. A non-significant result does not automatically mean there is no effect. Small studies may miss important benefit or harm. Large studies may detect trivial differences that do not matter to patients or departments.

In FRCEM and MRCEM, common candidate errors are:

  • saying p<0.05 means the result is important
  • saying p>0.05 proves no difference
  • misreading a confidence interval that crosses the null value
  • ignoring bias because the result is statistically significant
  • confusing bias with confounding
  • failing to comment on clinical significance and applicability

Key Definitions

Use these exam-safe definitions.

Term Safe exam definition What it does not tell you
P value The probability of observing results this extreme, or more extreme, if the null hypothesis were true. It does not tell you the probability that the null hypothesis is true, the size of the effect, or whether the result is clinically important.
95% confidence interval A range of values most compatible with the data and model assumptions; in exam practice, a range of plausible values for the true effect. It is not a guarantee that the true value has a 95% probability of lying within that exact interval.
Null hypothesis The assumption that there is no true difference or no true association. It is not the same as proving treatments are equivalent.
Bias Systematic error in study design, conduct, measurement, analysis or reporting that moves the result away from the truth. It is not fixed by increasing sample size.
Confounding Distortion of an exposure-outcome relationship by a third factor associated with both exposure and outcome. It is not the same as random error.
Precision How certain the estimate is, usually judged by the width of the confidence interval. Precise does not mean correct if the study is biased.
Type I error False positive: concluding there is a difference when none exists. Not the same as bias.
Type II error False negative: failing to detect a real difference. Does not prove no effect.
Power The probability that a study will detect a prespecified effect size if that effect truly exists. High power does not rescue a flawed study.

Exam-safe wording:

  • “The p value suggests how compatible the data are with the null hypothesis.”
  • “The confidence interval shows direction, size and precision of the estimate.”
  • “Bias may explain an apparently significant result.”
  • “A non-significant result does not prove equivalence or absence of effect.”

Essential Pathophysiology

This is a statistics topic rather than a disease process, but there is still a useful underlying framework: observed study results are shaped by chance, bias and confounding.

Concept Meaning Practical implication
Chance Random variation in samples Produces uncertainty; assessed partly by p values and confidence intervals
Bias Systematic error Can make a result wrong even if p is very small and the CI is narrow
Confounding Mixing of effects from another variable Common in observational studies; association may not be causal

Think of study interpretation as asking three questions:

  • Could this result be due to chance?
  • Could this result be due to bias?
  • Could this result be due to confounding?

If the answer to the second or third question is yes, a statistically significant result may still be unreliable.

Clinical Presentation

In exam terms, this topic usually presents as a critical appraisal task rather than a clinical syndrome. Typical formats are:

  • a trial abstract with p values and confidence intervals
  • a diagnostic paper reporting sensitivity and specificity
  • a forest plot or summary table
  • a short stem asking what a p value or confidence interval means
  • an OSCE station asking whether a paper should change ED practice

The examiner usually wants a structured interpretation, not a mathematical derivation.

A strong answer usually covers:

  • study design and internal validity
  • primary outcome
  • effect estimate
  • whether the confidence interval crosses the null
  • precision
  • clinical importance
  • bias and confounding
  • applicability to UK ED practice

Red Flags and High-Risk Features

These are the appraisal red flags that should make you cautious.

  • Primary outcome negative, but paper emphasises positive secondary outcomes
  • Multiple subgroup analyses with isolated significant findings
  • Wide confidence intervals despite a “positive” headline
  • Large relative effect but tiny absolute benefit
  • Observational study implying causation without adequate adjustment
  • High loss to follow-up or missing outcome data
  • Poorly described randomisation or allocation concealment
  • No blinding where outcome assessment is subjective
  • Outcome switching from protocol to publication
  • Selective reporting of favourable outcomes only
  • Diagnostic study with weak reference standard or partial verification
  • Study population unlike a UK ED case-mix

Exam phrase:

“Even if statistically significant, the result may not be reliable if there is important risk of bias, confounding or selective reporting.”

Differential Diagnosis

When a paper reports a striking result, the differential diagnosis for that result is:

  • true effect
  • chance finding
  • bias
  • confounding
  • measurement error
  • selective reporting or p-hacking
  • random high estimate from a small study

This is a useful OSCE mindset. Do not assume the observed effect is real just because the p value is small.

Initial ED Assessment

For exam appraisal, use a rapid first-pass structure.

30-second approach

  • Identify the study design
  • Find the primary outcome
  • Find the main effect estimate
  • Check the 95% confidence interval
  • Ask whether it crosses the null value
  • Comment on width and clinical importance
  • Look for obvious bias or confounding
  • Decide whether it applies to your ED patients

Two-minute OSCE structure

  1. State the design and whether it is appropriate for the question
  2. Comment on internal validity
  3. Interpret the effect estimate and confidence interval
  4. State whether the result is statistically significant
  5. Comment on precision and clinical significance
  6. Discuss bias, confounding and reporting issues
  7. Comment on external validity and whether practice should change

Safe model phrase:

“The result appears statistically significant because the 95% confidence interval does not cross the null value, but I would still want to assess the size of effect, precision, risk of bias, and applicability before changing practice.”

Investigations

In this context, the “investigations” are the statistical outputs and methodological features you should inspect.

1. Know the null value

Measure Null value Interpretation rule
Mean difference 0 If the 95% CI crosses 0, it is usually not statistically significant in a matching two-sided analysis
Risk difference / absolute risk reduction 0 If the 95% CI crosses 0, benefit may include no effect
Relative risk 1 If the 95% CI crosses 1, it is usually not statistically significant in a matching two-sided analysis
Odds ratio 1 Same rule
Hazard ratio 1 Same rule

This rule applies when the confidence interval and p value come from the same two-sided analysis.

2. Interpret p values safely

If p is small:

  • the data are relatively incompatible with the null hypothesis
  • this supports a real difference or association only if the study is valid and analysis appropriate

If p is large:

  • the study did not show a statistically significant difference
  • do not say there is definitely no effect
  • consider low power, few events and imprecision

Do not say:

  • “p=0.03 means there is a 97% chance the treatment works”
  • “p>0.05 means there is no difference”

Say instead:

  • “p=0.03 suggests the observed data would be relatively unlikely if the null hypothesis were true”
  • “p>0.05 means the study did not demonstrate a statistically significant difference”

3. Interpret confidence intervals properly

Ask three questions in order:

  1. Does it cross the null value?
  2. How wide is it?
  3. Are all plausible values clinically important, or do they include trivial effect, important benefit, or harm?

Examples:

  • RR 0.78, 95% CI 0.61 to 0.99: statistically significant, but effect may be modest
  • OR 0.84, 95% CI 0.65 to 1.10: not statistically significant; compatible with benefit or no effect
  • Mean difference 1.2, 95% CI -0.3 to 2.7: not statistically significant and imprecise

4. Absolute versus relative effects

Relative measures can exaggerate apparent importance. Always look for absolute risk reduction and, where relevant, number needed to treat.

Measure Why it matters
Relative risk reduction Can sound impressive but may hide a tiny absolute benefit
Absolute risk reduction Shows actual difference in event rates
NNT / NNH Helps judge practical value and harm

Example:

A treatment reduces admission from 2% to 1%. Relative risk reduction is 50%, which sounds large. Absolute risk reduction is 1%, so NNT is 100. That may or may not be worthwhile depending on cost, harms and context.

5. Primary outcome versus secondary outcomes

In appraisal, the primary outcome matters most.

  • If the primary outcome is negative but a secondary outcome is positive, be cautious
  • Secondary outcomes are more vulnerable to chance findings, especially if multiple are tested
  • Subgroup analyses are hypothesis-generating unless clearly prespecified and biologically plausible

6. Adjusted versus unadjusted results

In observational studies, adjusted analyses are usually more informative than crude unadjusted comparisons, but adjustment only works for measured confounders. Residual confounding may remain.

Exam phrase:

“The adjusted analysis may reduce confounding, but it cannot fully remove bias from unmeasured or poorly measured confounders.”

7. Diagnostic test studies

For a single sensitivity or specificity estimate, there is no null line equivalent to 0 or 1 in the same way as comparative effect measures. Focus on:

  • point estimate
  • confidence interval width
  • whether the lower bound is clinically acceptable
  • reference standard quality
  • spectrum of patients studied
  • whether the test is used alone or within a pathway

Likelihood ratios are often more clinically useful than sensitivity and specificity alone.

Metric Use
Sensitivity How often the test is positive when disease is present
Specificity How often the test is negative when disease is absent
Positive likelihood ratio How much a positive result increases probability of disease
Negative likelihood ratio How much a negative result reduces probability of disease

Example:

Sensitivity 92%, 95% CI 88% to 96% means the best estimate is 92%, but the true sensitivity could plausibly be as low as 88%. In a time-critical rule-out pathway, ask whether that lower bound is safe enough for the acceptable miss rate.

Management in the Emergency Department

For this topic, “management” means how to manage a paper, a result, or an exam question.

Step-by-step critical appraisal approach

Step 1: Check validity before numbers

Always start with internal validity.

For randomised controlled trials, ask:

  • Was randomisation truly random?
  • Was allocation concealed?
  • Were groups similar at baseline?
  • Were patients, clinicians and outcome assessors blinded where possible?
  • Was follow-up complete?
  • Was analysis by intention to treat?
  • Was the primary outcome prespecified?

For observational studies, ask:

  • How were participants selected?
  • Could selection bias explain the result?
  • Were important confounders measured and adjusted for?
  • Was exposure measured accurately?
  • Was outcome assessment objective and complete?

For diagnostic studies, ask:

  • Was there an appropriate reference standard?
  • Did all or most patients receive the same reference standard?
  • Was interpretation blinded?
  • Was the study population representative of ED practice?
  • Was there spectrum bias or verification bias?

Step 2: Identify the effect estimate

Do not focus first on the p value. Find the actual result:

  • mean difference
  • risk difference
  • relative risk
  • odds ratio
  • hazard ratio
  • sensitivity, specificity or likelihood ratios

Step 3: Assess precision

Use the confidence interval.

  • Does it cross the null?
  • Is it narrow or wide?
  • Does it include clinically important benefit or harm?

Step 4: Look for bias and confounding

Problem Meaning Example in EM research
Selection bias Groups differ because of how patients were chosen Convenience sampling of low-risk chest pain patients in office hours only
Performance bias Groups receive different co-interventions apart from the study intervention One sedation group gets more senior supervision
Detection bias Outcome assessment differs between groups Unblinded assessor rates pain scores
Attrition bias Loss to follow-up differs between groups More missing 30-day outcomes in one arm
Reporting bias Selective reporting of favourable outcomes Paper highlights positive secondary outcomes after a neutral primary outcome
Verification bias Not all patients receive the reference standard Only test-positive patients get CT or angiography
Spectrum bias Study population is unrepresentative Diagnostic rule tested only in obvious disease and obvious non-disease
Confounding Third factor distorts association Sicker patients preferentially receive a treatment in a cohort study

Important distinction:

  • Bias is systematic error
  • Confounding is distortion by another variable
  • Chance is random variation

Step 5: Decide whether it should change practice

Even a valid statistically significant result may not justify a change in ED practice unless:

  • the effect is clinically important
  • benefits outweigh harms
  • the outcome is patient-centred
  • the intervention is feasible in an NHS ED
  • the population resembles your patients
  • the result fits with wider evidence and guidance

Immediate versus later care

Immediate exam response:

  • state significance correctly
  • comment on precision
  • identify obvious bias
  • avoid overclaiming

Later, fuller appraisal:

  • review protocol and prespecified outcomes
  • check sample size calculation
  • compare absolute and relative effects
  • look for systematic reviews and guideline context
  • consider implementation, cost and harms

Disposition, Referral and Follow-Up

For a paper rather than a patient, disposition means what you do with the evidence.

  • Adopt cautiously if the study is valid, effect clinically important, precision acceptable, and findings consistent with wider evidence
  • Do not change practice on the basis of a single biased or underpowered study
  • Escalate to local governance, guideline or specialty discussion before implementing major pathway changes
  • In exams, conclude with a cautious practice statement rather than a binary yes or no

Safe conclusion phrases:

  • “This study suggests possible benefit, but limitations in precision and risk of bias mean it is insufficient on its own to change practice.”
  • “The result is statistically significant and reasonably precise, but I would still consider clinical importance, harms and applicability to our ED population.”
  • “This is a neutral study rather than proof of no effect, because the confidence interval still includes clinically important benefit and harm.”

Special Groups

This is not a patient-group topic in the usual sense, but applicability often differs across populations.

Paediatrics

  • Adult evidence may not apply to children
  • Decision rules and diagnostic thresholds often differ
  • Small paediatric studies may be underpowered

Pregnancy

  • Pregnant patients are often excluded from trials
  • External validity may therefore be poor
  • Diagnostic pathways may differ because of imaging and risk considerations

Older adults

  • Frailty, multimorbidity and polypharmacy may reduce applicability of trial results
  • Outcomes important to older adults may differ from trial endpoints

Immunosuppressed or complex patients

  • Often under-represented in trials
  • Diagnostic test performance may differ because disease spectrum differs

Exam phrase:

“External validity is limited if important ED subgroups such as older adults, pregnant patients or immunosuppressed patients were excluded or under-represented.”

Common Pitfalls

  • Equating statistical significance with clinical importance
  • Saying a non-significant result proves no effect
  • Ignoring confidence interval width
  • Forgetting the null value differs by measure
  • Assuming a large sample automatically means a trustworthy study
  • Ignoring absolute risk reduction and focusing only on relative effects
  • Accepting subgroup findings uncritically
  • Confusing association with causation in observational studies
  • Failing to distinguish bias from confounding
  • Ignoring missing data and loss to follow-up
  • Using sensitivity and specificity without considering prevalence, likelihood ratios and pathway use
  • Overlooking whether the primary outcome was actually positive

FRCEM and MRCEM Exam Tips

What the examiner wants to hear

A concise, structured interpretation. For most questions, include:

  • what the p value means
  • what the confidence interval means
  • whether it crosses the null
  • what the width says about precision
  • whether the effect is clinically important
  • whether bias or confounding could explain the finding
  • whether the result applies to UK ED practice

Model OSCE answer stem

“The study reports a statistically significant result because the 95% confidence interval does not cross the null value. However, significance alone is not enough. I would also consider the size of effect, the width of the confidence interval, whether the primary outcome was prespecified and positive, and the risk of bias or confounding. If the interval is wide, precision is limited. Even if the result is statistically significant, the effect may be clinically trivial or not applicable to our ED population, so this would not automatically justify a change in practice.”

High-yield one-line rules

  • Never interpret the p value alone
  • Confidence intervals are usually more informative than p values
  • Crosses 0 for differences, crosses 1 for ratios
  • Wide confidence interval means imprecision
  • Non-significant does not mean no effect
  • Statistical significance does not prove causation
  • Bias can invalidate a precise significant result
  • Primary outcome matters more than secondary outcomes
  • Absolute effects matter more than relative headlines

Unsafe versus safe phrases

Unsafe phrase Safer phrase
There is no effect The study did not demonstrate a statistically significant difference
The treatment works The results are compatible with benefit, subject to study validity and bias
The null hypothesis is false The data are relatively incompatible with the null hypothesis
This proves equivalence This does not show a significant difference; equivalence requires an appropriate design and margin
The result is important because p<0.05 The result is statistically significant, but clinical importance depends on effect size, precision and harms

How This Appears in SBA Questions

Typical question stems

  • “A trial reports RR 0.82, 95% CI 0.68 to 0.99. Which is the best interpretation?”
  • “A study finds p=0.08 for the primary outcome. What is the most appropriate conclusion?”
  • “Which statement about a 95% confidence interval is correct?”
  • “Which bias is most likely if only test-positive patients receive the reference standard?”
  • “A large observational study shows a statistically significant association. What is the main limitation?”

Key discriminator clues

  • Crossing the null means usually not statistically significant in a matching two-sided analysis
  • Wide confidence interval means uncertainty, not reassurance
  • Observational association does not prove causation
  • Diagnostic studies need reference standard and verification checks
  • Primary outcome outweighs post-hoc subgroup excitement

Common wrong answer traps

  • Interpreting p as the probability the null is true
  • Calling a non-significant study “negative” without checking the confidence interval
  • Ignoring absolute risk reduction
  • Assuming adjustment removes all confounding
  • Accepting selective secondary outcomes as definitive

Mini SBA examples

Question 1

A treatment trial reports an odds ratio for admission of 0.74 with 95% CI 0.52 to 1.05. Which is the best interpretation?

  • A. The treatment definitely reduces admission
  • B. The result is statistically significant
  • C. The study did not show a statistically significant reduction in admission
  • D. There is no possible benefit

Best answer: C.

Why: The CI crosses 1, so the result is not statistically significant in the usual two-sided interpretation. The interval still includes possible benefit.

Question 2

A diagnostic study reports sensitivity 97% with 95% CI 89% to 99%. Which is the most appropriate comment?

  • A. The test is safe to rule out disease in all settings
  • B. The lower confidence limit may still be too low for a time-critical rule-out pathway
  • C. The result is invalid because sensitivity has no p value
  • D. The test is superior because sensitivity is above 95%

Best answer: B.

Question 3

An observational study finds a statistically significant association between early antibiotics and lower mortality in sepsis. Which limitation most threatens causal interpretation?

  • A. Type I error only
  • B. Confounding by indication
  • C. Confidence interval width only
  • D. Lack of a p value

Best answer: B.

Key Takeaways

  • A p value tells you how compatible the data are with the null hypothesis, not whether the treatment works or matters clinically.
  • A 95% confidence interval shows direction, size and precision of the estimate.
  • For differences, the null value is 0. For ratios, the null value is 1.
  • If a confidence interval crosses the null, the result is usually not statistically significant in a matching two-sided analysis.
  • Wide confidence intervals mean imprecision and uncertainty.
  • Non-significant does not mean no effect, no harm, or equivalence.
  • Statistical significance is not the same as clinical significance.
  • Bias can make a result wrong even if p is very small.
  • Confounding is especially important in observational studies.
  • Primary outcomes and prespecified analyses matter more than secondary or post-hoc findings.
  • Always look at absolute effects, harms and applicability to NHS ED practice.
  • In the exam, use the triad: effect size, confidence interval, bias.

Further Reading

  • NICE. Developing NICE guidelines: the manual.
  • RCEM Learning and RCEM examination resources on critical appraisal and evidence-based medicine.
  • SIGN 50. A guideline developer’s handbook.
  • CEBM, University of Oxford. Critical appraisal tools and evidence resources.
  • CONSORT 2010 Statement for reporting randomised trials.
  • STARD 2015 Statement for reporting diagnostic accuracy studies.
  • STROBE Statement for reporting observational studies.

Related on EM Final Exams

Authoritative Sources


Ready to build your plan? EMF Premium gives you all 40,000 questions and 20 mocks for £59 — one payment, six months' access.

Share
0
    0
    Your Cart
    Your cart is emptyReturn to Shop