What does a p-value of 0.05 actually mean?

P=0.05 means there is a 5% probability of seeing this result (or a more extreme one) if the null hypothesis (no real effect) were true. It is not the probability that the null is true, nor that the alternative is true.

How do you interpret a 95% confidence interval?

A 95% CI gives the range within which the true population value lies, with 95% confidence. If the CI crosses the null value (1 for ratios, 0 for differences), the result is not statistically significant. Wide CIs suggest imprecision.

What is the difference between selection bias and confounding?

Selection bias arises when the study sample is not representative of the target population. Confounding arises when a third variable independently influences both exposure and outcome, distorting the apparent effect. Randomisation tackles confounding.

P Values Confidence Intervals and Bias Explained Simply

TL;DR — P-value = probability of seeing this result if the null is true. CI gives the precision around the estimate. Both matter; neither proves causation.

Last updated: 30 May 2026

Critical Appraisal concepts at a glance

Critical Appraisal:
3 things FRCEM tests

↓

1. p-values and CIs

↓

p less than 0.05
= statistically significant

↓

95% CI crossing 1 for ratios
or 0 for differences
= NOT significant

↓

2. NNT and NNH

↓

NNT = 1 divided by ARR
lower NNT = better treatment

↓

NNH = 1 divided by ARI
lower NNH = more harm

↓

3. Common biases

↓

Selection bias

↓

Recall bias

↓

Attrition bias

↓

Confounding

↓

Observer bias

↓

Publication bias

The 3 concept groups every FRCEM candidate must master for the Critical Appraisal paper.

P values, confidence intervals and bias are core critical appraisal topics in MRCEM SBA, FRCEM SBA and FRCEM OSCE journal appraisal stations. In emergency medicine, they matter because papers on diagnostics, treatments, prognostic tools and service redesign often look persuasive on first reading. The exam is not testing advanced statistics. It is testing whether you can interpret results safely, recognise weak evidence, and decide whether a study should change practice in a UK ED.

The safest approach is simple: never interpret a p value alone. Always combine effect size, confidence interval, study validity, bias, confounding and applicability to your patients.

Why P Values and Confidence Intervals Matter in FRCEM

Emergency clinicians regularly read studies on:

diagnostic pathways such as troponin rule-out strategies
clinical decision rules such as head injury or PE pathways
treatments such as analgesia, bronchodilators, sedation or sepsis interventions
service delivery changes such as streaming, ambulatory pathways or observation units

A statistically significant result does not automatically mean the intervention works, is clinically important, or is safe to adopt in the NHS. A non-significant result does not automatically mean there is no effect. Small studies may miss important benefit or harm. Large studies may detect trivial differences that do not matter to patients or departments.

In FRCEM and MRCEM, common candidate errors are:

saying p<0.05 means the result is important
saying p>0.05 proves no difference
misreading a confidence interval that crosses the null value
ignoring bias because the result is statistically significant
confusing bias with confounding
failing to comment on clinical significance and applicability

Key Definitions

Use these exam-safe definitions.

Term	Safe exam definition	What it does not tell you
P value	The probability of observing results this extreme, or more extreme, if the null hypothesis were true.	It does not tell you the probability that the null hypothesis is true, the size of the effect, or whether the result is clinically important.
95% confidence interval	A range of values most compatible with the data and model assumptions; in exam practice, a range of plausible values for the true effect.	It is not a guarantee that the true value has a 95% probability of lying within that exact interval.
Null hypothesis	The assumption that there is no true difference or no true association.	It is not the same as proving treatments are equivalent.
Bias	Systematic error in study design, conduct, measurement, analysis or reporting that moves the result away from the truth.	It is not fixed by increasing sample size.
Confounding	Distortion of an exposure-outcome relationship by a third factor associated with both exposure and outcome.	It is not the same as random error.
Precision	How certain the estimate is, usually judged by the width of the confidence interval.	Precise does not mean correct if the study is biased.
Type I error	False positive: concluding there is a difference when none exists.	Not the same as bias.
Type II error	False negative: failing to detect a real difference.	Does not prove no effect.
Power	The probability that a study will detect a prespecified effect size if that effect truly exists.	High power does not rescue a flawed study.

Exam-safe wording:

“The p value suggests how compatible the data are with the null hypothesis.”
“The confidence interval shows direction, size and precision of the estimate.”
“Bias may explain an apparently significant result.”
“A non-significant result does not prove equivalence or absence of effect.”

Essential Pathophysiology

This is a statistics topic rather than a disease process, but there is still a useful underlying framework: observed study results are shaped by chance, bias and confounding.

Concept	Meaning	Practical implication
Chance	Random variation in samples	Produces uncertainty; assessed partly by p values and confidence intervals
Bias	Systematic error	Can make a result wrong even if p is very small and the CI is narrow
Confounding	Mixing of effects from another variable	Common in observational studies; association may not be causal

Think of study interpretation as asking three questions:

Could this result be due to chance?
Could this result be due to bias?
Could this result be due to confounding?

If the answer to the second or third question is yes, a statistically significant result may still be unreliable.

Clinical Presentation

In exam terms, this topic usually presents as a critical appraisal task rather than a clinical syndrome. Typical formats are:

a trial abstract with p values and confidence intervals
a diagnostic paper reporting sensitivity and specificity
a forest plot or summary table
a short stem asking what a p value or confidence interval means
an OSCE station asking whether a paper should change ED practice

The examiner usually wants a structured interpretation, not a mathematical derivation.

A strong answer usually covers:

study design and internal validity
primary outcome
effect estimate
whether the confidence interval crosses the null
precision
clinical importance
bias and confounding
applicability to UK ED practice

Red Flags and High-Risk Features

These are the appraisal red flags that should make you cautious.

Primary outcome negative, but paper emphasises positive secondary outcomes
Multiple subgroup analyses with isolated significant findings
Wide confidence intervals despite a “positive” headline
Large relative effect but tiny absolute benefit
Observational study implying causation without adequate adjustment
High loss to follow-up or missing outcome data
Poorly described randomisation or allocation concealment
No blinding where outcome assessment is subjective
Outcome switching from protocol to publication
Selective reporting of favourable outcomes only
Diagnostic study with weak reference standard or partial verification
Study population unlike a UK ED case-mix

Exam phrase:

“Even if statistically significant, the result may not be reliable if there is important risk of bias, confounding or selective reporting.”

Differential Diagnosis

When a paper reports a striking result, the differential diagnosis for that result is:

true effect
chance finding
bias
confounding
measurement error
selective reporting or p-hacking
random high estimate from a small study

This is a useful OSCE mindset. Do not assume the observed effect is real just because the p value is small.

Initial ED Assessment

For exam appraisal, use a rapid first-pass structure.

30-second approach

Identify the study design
Find the primary outcome
Find the main effect estimate
Check the 95% confidence interval
Ask whether it crosses the null value
Comment on width and clinical importance
Look for obvious bias or confounding
Decide whether it applies to your ED patients

Two-minute OSCE structure

State the design and whether it is appropriate for the question
Comment on internal validity
Interpret the effect estimate and confidence interval
State whether the result is statistically significant
Comment on precision and clinical significance
Discuss bias, confounding and reporting issues
Comment on external validity and whether practice should change

Safe model phrase:

“The result appears statistically significant because the 95% confidence interval does not cross the null value, but I would still want to assess the size of effect, precision, risk of bias, and applicability before changing practice.”

Investigations

In this context, the “investigations” are the statistical outputs and methodological features you should inspect.

1. Know the null value

Measure	Null value	Interpretation rule
Mean difference	0	If the 95% CI crosses 0, it is usually not statistically significant in a matching two-sided analysis
Risk difference / absolute risk reduction	0	If the 95% CI crosses 0, benefit may include no effect
Relative risk	1	If the 95% CI crosses 1, it is usually not statistically significant in a matching two-sided analysis
Odds ratio	1	Same rule
Hazard ratio	1	Same rule

This rule applies when the confidence interval and p value come from the same two-sided analysis.

2. Interpret p values safely

If p is small:

the data are relatively incompatible with the null hypothesis
this supports a real difference or association only if the study is valid and analysis appropriate

If p is large:

the study did not show a statistically significant difference
do not say there is definitely no effect
consider low power, few events and imprecision

Do not say:

“p=0.03 means there is a 97% chance the treatment works”
“p>0.05 means there is no difference”

Say instead:

“p=0.03 suggests the observed data would be relatively unlikely if the null hypothesis were true”
“p>0.05 means the study did not demonstrate a statistically significant difference”

3. Interpret confidence intervals properly

Ask three questions in order:

Does it cross the null value?
How wide is it?
Are all plausible values clinically important, or do they include trivial effect, important benefit, or harm?

Examples:

RR 0.78, 95% CI 0.61 to 0.99: statistically significant, but effect may be modest
OR 0.84, 95% CI 0.65 to 1.10: not statistically significant; compatible with benefit or no effect
Mean difference 1.2, 95% CI -0.3 to 2.7: not statistically significant and imprecise

4. Absolute versus relative effects

Relative measures can exaggerate apparent importance. Always look for absolute risk reduction and, where relevant, number needed to treat.

Measure	Why it matters
Relative risk reduction	Can sound impressive but may hide a tiny absolute benefit
Absolute risk reduction	Shows actual difference in event rates
NNT / NNH	Helps judge practical value and harm

Example:

A treatment reduces admission from 2% to 1%. Relative risk reduction is 50%, which sounds large. Absolute risk reduction is 1%, so NNT is 100. That may or may not be worthwhile depending on cost, harms and context.

5. Primary outcome versus secondary outcomes

In appraisal, the primary outcome matters most.

If the primary outcome is negative but a secondary outcome is positive, be cautious
Secondary outcomes are more vulnerable to chance findings, especially if multiple are tested
Subgroup analyses are hypothesis-generating unless clearly prespecified and biologically plausible

6. Adjusted versus unadjusted results

In observational studies, adjusted analyses are usually more informative than crude unadjusted comparisons, but adjustment only works for measured confounders. Residual confounding may remain.

Exam phrase:

“The adjusted analysis may reduce confounding, but it cannot fully remove bias from unmeasured or poorly measured confounders.”

7. Diagnostic test studies

For a single sensitivity or specificity estimate, there is no null line equivalent to 0 or 1 in the same way as comparative effect measures. Focus on:

point estimate
confidence interval width
whether the lower bound is clinically acceptable
reference standard quality
spectrum of patients studied
whether the test is used alone or within a pathway

Likelihood ratios are often more clinically useful than sensitivity and specificity alone.

Metric	Use
Sensitivity	How often the test is positive when disease is present
Specificity	How often the test is negative when disease is absent
Positive likelihood ratio	How much a positive result increases probability of disease
Negative likelihood ratio	How much a negative result reduces probability of disease

Example:

Sensitivity 92%, 95% CI 88% to 96% means the best estimate is 92%, but the true sensitivity could plausibly be as low as 88%. In a time-critical rule-out pathway, ask whether that lower bound is safe enough for the acceptable miss rate.

Management in the Emergency Department

For this topic, “management” means how to manage a paper, a result, or an exam question.

Step-by-step critical appraisal approach

Step 1: Check validity before numbers

Always start with internal validity.

For randomised controlled trials, ask:

Was randomisation truly random?
Was allocation concealed?
Were groups similar at baseline?
Were patients, clinicians and outcome assessors blinded where possible?
Was follow-up complete?
Was analysis by intention to treat?
Was the primary outcome prespecified?

For observational studies, ask:

How were participants selected?
Could selection bias explain the result?
Were important confounders measured and adjusted for?
Was exposure measured accurately?
Was outcome assessment objective and complete?

For diagnostic studies, ask:

Was there an appropriate reference standard?
Did all or most patients receive the same reference standard?
Was interpretation blinded?
Was the study population representative of ED practice?
Was there spectrum bias or verification bias?

Step 2: Identify the effect estimate

Do not focus first on the p value. Find the actual result:

mean difference
risk difference
relative risk
odds ratio
hazard ratio
sensitivity, specificity or likelihood ratios

Step 3: Assess precision

Use the confidence interval.

Does it cross the null?
Is it narrow or wide?
Does it include clinically important benefit or harm?

Step 4: Look for bias and confounding

Problem	Meaning	Example in EM research
Selection bias	Groups differ because of how patients were chosen	Convenience sampling of low-risk chest pain patients in office hours only
Performance bias	Groups receive different co-interventions apart from the study intervention	One sedation group gets more senior supervision
Detection bias	Outcome assessment differs between groups	Unblinded assessor rates pain scores
Attrition bias	Loss to follow-up differs between groups	More missing 30-day outcomes in one arm
Reporting bias	Selective reporting of favourable outcomes	Paper highlights positive secondary outcomes after a neutral primary outcome
Verification bias	Not all patients receive the reference standard	Only test-positive patients get CT or angiography
Spectrum bias	Study population is unrepresentative	Diagnostic rule tested only in obvious disease and obvious non-disease
Confounding	Third factor distorts association	Sicker patients preferentially receive a treatment in a cohort study

Important distinction:

Bias is systematic error
Confounding is distortion by another variable
Chance is random variation

Step 5: Decide whether it should change practice

Even a valid statistically significant result may not justify a change in ED practice unless:

the effect is clinically important
benefits outweigh harms
the outcome is patient-centred
the intervention is feasible in an NHS ED
the population resembles your patients
the result fits with wider evidence and guidance

Immediate versus later care

Immediate exam response:

state significance correctly
comment on precision
identify obvious bias
avoid overclaiming

Later, fuller appraisal:

review protocol and prespecified outcomes
check sample size calculation
compare absolute and relative effects
look for systematic reviews and guideline context
consider implementation, cost and harms

Disposition, Referral and Follow-Up

For a paper rather than a patient, disposition means what you do with the evidence.

Adopt cautiously if the study is valid, effect clinically important, precision acceptable, and findings consistent with wider evidence
Do not change practice on the basis of a single biased or underpowered study
Escalate to local governance, guideline or specialty discussion before implementing major pathway changes
In exams, conclude with a cautious practice statement rather than a binary yes or no

Safe conclusion phrases:

“This study suggests possible benefit, but limitations in precision and risk of bias mean it is insufficient on its own to change practice.”
“The result is statistically significant and reasonably precise, but I would still consider clinical importance, harms and applicability to our ED population.”
“This is a neutral study rather than proof of no effect, because the confidence interval still includes clinically important benefit and harm.”

Special Groups

This is not a patient-group topic in the usual sense, but applicability often differs across populations.

Paediatrics

Adult evidence may not apply to children
Decision rules and diagnostic thresholds often differ
Small paediatric studies may be underpowered

Pregnancy

Pregnant patients are often excluded from trials
External validity may therefore be poor
Diagnostic pathways may differ because of imaging and risk considerations

Older adults

Frailty, multimorbidity and polypharmacy may reduce applicability of trial results
Outcomes important to older adults may differ from trial endpoints

Immunosuppressed or complex patients

Often under-represented in trials
Diagnostic test performance may differ because disease spectrum differs

Exam phrase:

“External validity is limited if important ED subgroups such as older adults, pregnant patients or immunosuppressed patients were excluded or under-represented.”

Common Pitfalls

Equating statistical significance with clinical importance
Saying a non-significant result proves no effect
Ignoring confidence interval width
Forgetting the null value differs by measure
Assuming a large sample automatically means a trustworthy study
Ignoring absolute risk reduction and focusing only on relative effects
Accepting subgroup findings uncritically
Confusing association with causation in observational studies
Failing to distinguish bias from confounding
Ignoring missing data and loss to follow-up
Using sensitivity and specificity without considering prevalence, likelihood ratios and pathway use
Overlooking whether the primary outcome was actually positive

FRCEM and MRCEM Exam Tips

What the examiner wants to hear

A concise, structured interpretation. For most questions, include:

what the p value means
what the confidence interval means
whether it crosses the null
what the width says about precision
whether the effect is clinically important
whether bias or confounding could explain the finding
whether the result applies to UK ED practice

Model OSCE answer stem

“The study reports a statistically significant result because the 95% confidence interval does not cross the null value. However, significance alone is not enough. I would also consider the size of effect, the width of the confidence interval, whether the primary outcome was prespecified and positive, and the risk of bias or confounding. If the interval is wide, precision is limited. Even if the result is statistically significant, the effect may be clinically trivial or not applicable to our ED population, so this would not automatically justify a change in practice.”

High-yield one-line rules

Never interpret the p value alone
Confidence intervals are usually more informative than p values
Crosses 0 for differences, crosses 1 for ratios
Wide confidence interval means imprecision
Non-significant does not mean no effect
Statistical significance does not prove causation
Bias can invalidate a precise significant result
Primary outcome matters more than secondary outcomes
Absolute effects matter more than relative headlines

Unsafe versus safe phrases

Unsafe phrase	Safer phrase
There is no effect	The study did not demonstrate a statistically significant difference
The treatment works	The results are compatible with benefit, subject to study validity and bias
The null hypothesis is false	The data are relatively incompatible with the null hypothesis
This proves equivalence	This does not show a significant difference; equivalence requires an appropriate design and margin
The result is important because p<0.05	The result is statistically significant, but clinical importance depends on effect size, precision and harms

How This Appears in SBA Questions

Typical question stems

“A trial reports RR 0.82, 95% CI 0.68 to 0.99. Which is the best interpretation?”
“A study finds p=0.08 for the primary outcome. What is the most appropriate conclusion?”
“Which statement about a 95% confidence interval is correct?”
“Which bias is most likely if only test-positive patients receive the reference standard?”
“A large observational study shows a statistically significant association. What is the main limitation?”

Key discriminator clues

Crossing the null means usually not statistically significant in a matching two-sided analysis
Wide confidence interval means uncertainty, not reassurance
Observational association does not prove causation
Diagnostic studies need reference standard and verification checks
Primary outcome outweighs post-hoc subgroup excitement

Common wrong answer traps

Interpreting p as the probability the null is true
Calling a non-significant study “negative” without checking the confidence interval
Ignoring absolute risk reduction
Assuming adjustment removes all confounding
Accepting selective secondary outcomes as definitive

Mini SBA examples

Question 1

A treatment trial reports an odds ratio for admission of 0.74 with 95% CI 0.52 to 1.05. Which is the best interpretation?

A. The treatment definitely reduces admission
B. The result is statistically significant
C. The study did not show a statistically significant reduction in admission
D. There is no possible benefit

Best answer: C.

Why: The CI crosses 1, so the result is not statistically significant in the usual two-sided interpretation. The interval still includes possible benefit.

Question 2

A diagnostic study reports sensitivity 97% with 95% CI 89% to 99%. Which is the most appropriate comment?

A. The test is safe to rule out disease in all settings
B. The lower confidence limit may still be too low for a time-critical rule-out pathway
C. The result is invalid because sensitivity has no p value
D. The test is superior because sensitivity is above 95%

Best answer: B.

Question 3

An observational study finds a statistically significant association between early antibiotics and lower mortality in sepsis. Which limitation most threatens causal interpretation?

A. Type I error only
B. Confounding by indication
C. Confidence interval width only
D. Lack of a p value

Best answer: B.

Key Takeaways

A p value tells you how compatible the data are with the null hypothesis, not whether the treatment works or matters clinically.
A 95% confidence interval shows direction, size and precision of the estimate.
For differences, the null value is 0. For ratios, the null value is 1.
If a confidence interval crosses the null, the result is usually not statistically significant in a matching two-sided analysis.
Wide confidence intervals mean imprecision and uncertainty.
Non-significant does not mean no effect, no harm, or equivalence.
Statistical significance is not the same as clinical significance.
Bias can make a result wrong even if p is very small.
Confounding is especially important in observational studies.
Primary outcomes and prespecified analyses matter more than secondary or post-hoc findings.
Always look at absolute effects, harms and applicability to NHS ED practice.
In the exam, use the triad: effect size, confidence interval, bias.

Related on EM Final Exams

Authoritative Sources

Ready to build your plan? EMF Premium gives you all 40,000 questions and 20 mocks for £59 — one payment, six months' access.

P Values Confidence Intervals and Bias Explained Simply

P Values Confidence Intervals and Bias Explained Simply

Critical Appraisal concepts at a glance

Why P Values and Confidence Intervals Matter in FRCEM

Key Definitions

Essential Pathophysiology

Clinical Presentation

Red Flags and High-Risk Features

Differential Diagnosis

Initial ED Assessment

30-second approach

Two-minute OSCE structure

Investigations

1. Know the null value

2. Interpret p values safely

3. Interpret confidence intervals properly

4. Absolute versus relative effects

5. Primary outcome versus secondary outcomes

6. Adjusted versus unadjusted results

7. Diagnostic test studies

Management in the Emergency Department

Step-by-step critical appraisal approach

Step 1: Check validity before numbers

Step 2: Identify the effect estimate

Step 3: Assess precision

Step 4: Look for bias and confounding

Step 5: Decide whether it should change practice

Immediate versus later care

Disposition, Referral and Follow-Up

Special Groups

Paediatrics

Pregnancy

Older adults

Immunosuppressed or complex patients

Common Pitfalls

FRCEM and MRCEM Exam Tips

What the examiner wants to hear

Model OSCE answer stem

High-yield one-line rules

Unsafe versus safe phrases

How This Appears in SBA Questions

Typical question stems

Key discriminator clues

Common wrong answer traps

Mini SBA examples

Key Takeaways

Further Reading

Related on EM Final Exams

Authoritative Sources

Reading SBA stems: spotting the answer in the question

Broad-complex tachycardia: an exam-focused refresher

Revising for finals without burning out on nights