P Values Confidence Intervals and Bias Explained Simply
TL;DR — P-value = probability of seeing this result if the null is true. CI gives the precision around the estimate. Both matter; neither proves causation.
Last updated: 30 May 2026
Critical Appraisal concepts at a glance
3 things FRCEM tests
= statistically significant
or 0 for differences
= NOT significant
lower NNT = better treatment
lower NNH = more harm
P values, confidence intervals and bias are core critical appraisal topics in MRCEM SBA, FRCEM SBA and FRCEM OSCE journal appraisal stations. In emergency medicine, they matter because papers on diagnostics, treatments, prognostic tools and service redesign often look persuasive on first reading. The exam is not testing advanced statistics. It is testing whether you can interpret results safely, recognise weak evidence, and decide whether a study should change practice in a UK ED.
The safest approach is simple: never interpret a p value alone. Always combine effect size, confidence interval, study validity, bias, confounding and applicability to your patients.
Why P Values and Confidence Intervals Matter in FRCEM
Emergency clinicians regularly read studies on:
- diagnostic pathways such as troponin rule-out strategies
- clinical decision rules such as head injury or PE pathways
- treatments such as analgesia, bronchodilators, sedation or sepsis interventions
- service delivery changes such as streaming, ambulatory pathways or observation units
A statistically significant result does not automatically mean the intervention works, is clinically important, or is safe to adopt in the NHS. A non-significant result does not automatically mean there is no effect. Small studies may miss important benefit or harm. Large studies may detect trivial differences that do not matter to patients or departments.
In FRCEM and MRCEM, common candidate errors are:
- saying p<0.05 means the result is important
- saying p>0.05 proves no difference
- misreading a confidence interval that crosses the null value
- ignoring bias because the result is statistically significant
- confusing bias with confounding
- failing to comment on clinical significance and applicability
Key Definitions
Use these exam-safe definitions.
| Term | Safe exam definition | What it does not tell you |
|---|---|---|
| P value | The probability of observing results this extreme, or more extreme, if the null hypothesis were true. | It does not tell you the probability that the null hypothesis is true, the size of the effect, or whether the result is clinically important. |
| 95% confidence interval | A range of values most compatible with the data and model assumptions; in exam practice, a range of plausible values for the true effect. | It is not a guarantee that the true value has a 95% probability of lying within that exact interval. |
| Null hypothesis | The assumption that there is no true difference or no true association. | It is not the same as proving treatments are equivalent. |
| Bias | Systematic error in study design, conduct, measurement, analysis or reporting that moves the result away from the truth. | It is not fixed by increasing sample size. |
| Confounding | Distortion of an exposure-outcome relationship by a third factor associated with both exposure and outcome. | It is not the same as random error. |
| Precision | How certain the estimate is, usually judged by the width of the confidence interval. | Precise does not mean correct if the study is biased. |
| Type I error | False positive: concluding there is a difference when none exists. | Not the same as bias. |
| Type II error | False negative: failing to detect a real difference. | Does not prove no effect. |
| Power | The probability that a study will detect a prespecified effect size if that effect truly exists. | High power does not rescue a flawed study. |
Exam-safe wording:
- “The p value suggests how compatible the data are with the null hypothesis.”
- “The confidence interval shows direction, size and precision of the estimate.”
- “Bias may explain an apparently significant result.”
- “A non-significant result does not prove equivalence or absence of effect.”
Essential Pathophysiology
This is a statistics topic rather than a disease process, but there is still a useful underlying framework: observed study results are shaped by chance, bias and confounding.
| Concept | Meaning | Practical implication |
|---|---|---|
| Chance | Random variation in samples | Produces uncertainty; assessed partly by p values and confidence intervals |
| Bias | Systematic error | Can make a result wrong even if p is very small and the CI is narrow |
| Confounding | Mixing of effects from another variable | Common in observational studies; association may not be causal |
Think of study interpretation as asking three questions:
- Could this result be due to chance?
- Could this result be due to bias?
- Could this result be due to confounding?
If the answer to the second or third question is yes, a statistically significant result may still be unreliable.
Clinical Presentation
In exam terms, this topic usually presents as a critical appraisal task rather than a clinical syndrome. Typical formats are:
- a trial abstract with p values and confidence intervals
- a diagnostic paper reporting sensitivity and specificity
- a forest plot or summary table
- a short stem asking what a p value or confidence interval means
- an OSCE station asking whether a paper should change ED practice
The examiner usually wants a structured interpretation, not a mathematical derivation.
A strong answer usually covers:
- study design and internal validity
- primary outcome
- effect estimate
- whether the confidence interval crosses the null
- precision
- clinical importance
- bias and confounding
- applicability to UK ED practice
Red Flags and High-Risk Features
These are the appraisal red flags that should make you cautious.
- Primary outcome negative, but paper emphasises positive secondary outcomes
- Multiple subgroup analyses with isolated significant findings
- Wide confidence intervals despite a “positive” headline
- Large relative effect but tiny absolute benefit
- Observational study implying causation without adequate adjustment
- High loss to follow-up or missing outcome data
- Poorly described randomisation or allocation concealment
- No blinding where outcome assessment is subjective
- Outcome switching from protocol to publication
- Selective reporting of favourable outcomes only
- Diagnostic study with weak reference standard or partial verification
- Study population unlike a UK ED case-mix
Exam phrase:
“Even if statistically significant, the result may not be reliable if there is important risk of bias, confounding or selective reporting.”
Differential Diagnosis
When a paper reports a striking result, the differential diagnosis for that result is:
- true effect
- chance finding
- bias
- confounding
- measurement error
- selective reporting or p-hacking
- random high estimate from a small study
This is a useful OSCE mindset. Do not assume the observed effect is real just because the p value is small.
Initial ED Assessment
For exam appraisal, use a rapid first-pass structure.
30-second approach
- Identify the study design
- Find the primary outcome
- Find the main effect estimate
- Check the 95% confidence interval
- Ask whether it crosses the null value
- Comment on width and clinical importance
- Look for obvious bias or confounding
- Decide whether it applies to your ED patients
Two-minute OSCE structure
- State the design and whether it is appropriate for the question
- Comment on internal validity
- Interpret the effect estimate and confidence interval
- State whether the result is statistically significant
- Comment on precision and clinical significance
- Discuss bias, confounding and reporting issues
- Comment on external validity and whether practice should change
Safe model phrase:
“The result appears statistically significant because the 95% confidence interval does not cross the null value, but I would still want to assess the size of effect, precision, risk of bias, and applicability before changing practice.”
Investigations
In this context, the “investigations” are the statistical outputs and methodological features you should inspect.
1. Know the null value
| Measure | Null value | Interpretation rule |
|---|---|---|
| Mean difference | 0 | If the 95% CI crosses 0, it is usually not statistically significant in a matching two-sided analysis |
| Risk difference / absolute risk reduction | 0 | If the 95% CI crosses 0, benefit may include no effect |
| Relative risk | 1 | If the 95% CI crosses 1, it is usually not statistically significant in a matching two-sided analysis |
| Odds ratio | 1 | Same rule |
| Hazard ratio | 1 | Same rule |
This rule applies when the confidence interval and p value come from the same two-sided analysis.
2. Interpret p values safely
If p is small:
- the data are relatively incompatible with the null hypothesis
- this supports a real difference or association only if the study is valid and analysis appropriate
If p is large:
- the study did not show a statistically significant difference
- do not say there is definitely no effect
- consider low power, few events and imprecision
Do not say:
- “p=0.03 means there is a 97% chance the treatment works”
- “p>0.05 means there is no difference”
Say instead:
- “p=0.03 suggests the observed data would be relatively unlikely if the null hypothesis were true”
- “p>0.05 means the study did not demonstrate a statistically significant difference”
3. Interpret confidence intervals properly
Ask three questions in order:
- Does it cross the null value?
- How wide is it?
- Are all plausible values clinically important, or do they include trivial effect, important benefit, or harm?
Examples:
- RR 0.78, 95% CI 0.61 to 0.99: statistically significant, but effect may be modest
- OR 0.84, 95% CI 0.65 to 1.10: not statistically significant; compatible with benefit or no effect
- Mean difference 1.2, 95% CI -0.3 to 2.7: not statistically significant and imprecise
4. Absolute versus relative effects
Relative measures can exaggerate apparent importance. Always look for absolute risk reduction and, where relevant, number needed to treat.
| Measure | Why it matters |
|---|---|
| Relative risk reduction | Can sound impressive but may hide a tiny absolute benefit |
| Absolute risk reduction | Shows actual difference in event rates |
| NNT / NNH | Helps judge practical value and harm |
Example:
A treatment reduces admission from 2% to 1%. Relative risk reduction is 50%, which sounds large. Absolute risk reduction is 1%, so NNT is 100. That may or may not be worthwhile depending on cost, harms and context.
5. Primary outcome versus secondary outcomes
In appraisal, the primary outcome matters most.
- If the primary outcome is negative but a secondary outcome is positive, be cautious
- Secondary outcomes are more vulnerable to chance findings, especially if multiple are tested
- Subgroup analyses are hypothesis-generating unless clearly prespecified and biologically plausible
6. Adjusted versus unadjusted results
In observational studies, adjusted analyses are usually more informative than crude unadjusted comparisons, but adjustment only works for measured confounders. Residual confounding may remain.
Exam phrase:
“The adjusted analysis may reduce confounding, but it cannot fully remove bias from unmeasured or poorly measured confounders.”
7. Diagnostic test studies
For a single sensitivity or specificity estimate, there is no null line equivalent to 0 or 1 in the same way as comparative effect measures. Focus on:
- point estimate
- confidence interval width
- whether the lower bound is clinically acceptable
- reference standard quality
- spectrum of patients studied
- whether the test is used alone or within a pathway
Likelihood ratios are often more clinically useful than sensitivity and specificity alone.
| Metric | Use |
|---|---|
| Sensitivity | How often the test is positive when disease is present |
| Specificity | How often the test is negative when disease is absent |
| Positive likelihood ratio | How much a positive result increases probability of disease |
| Negative likelihood ratio | How much a negative result reduces probability of disease |
Example:
Sensitivity 92%, 95% CI 88% to 96% means the best estimate is 92%, but the true sensitivity could plausibly be as low as 88%. In a time-critical rule-out pathway, ask whether that lower bound is safe enough for the acceptable miss rate.
Management in the Emergency Department
For this topic, “management” means how to manage a paper, a result, or an exam question.
Step-by-step critical appraisal approach
Step 1: Check validity before numbers
Always start with internal validity.
For randomised controlled trials, ask:
- Was randomisation truly random?
- Was allocation concealed?
- Were groups similar at baseline?
- Were patients, clinicians and outcome assessors blinded where possible?
- Was follow-up complete?
- Was analysis by intention to treat?
- Was the primary outcome prespecified?
For observational studies, ask:
- How were participants selected?
- Could selection bias explain the result?
- Were important confounders measured and adjusted for?
- Was exposure measured accurately?
- Was outcome assessment objective and complete?
For diagnostic studies, ask:
- Was there an appropriate reference standard?
- Did all or most patients receive the same reference standard?
- Was interpretation blinded?
- Was the study population representative of ED practice?
- Was there spectrum bias or verification bias?
Step 2: Identify the effect estimate
Do not focus first on the p value. Find the actual result:
- mean difference
- risk difference
- relative risk
- odds ratio
- hazard ratio
- sensitivity, specificity or likelihood ratios
Step 3: Assess precision
Use the confidence interval.
- Does it cross the null?
- Is it narrow or wide?
- Does it include clinically important benefit or harm?
Step 4: Look for bias and confounding
| Problem | Meaning | Example in EM research |
|---|---|---|
| Selection bias | Groups differ because of how patients were chosen | Convenience sampling of low-risk chest pain patients in office hours only |
| Performance bias | Groups receive different co-interventions apart from the study intervention | One sedation group gets more senior supervision |
| Detection bias | Outcome assessment differs between groups | Unblinded assessor rates pain scores |
| Attrition bias | Loss to follow-up differs between groups | More missing 30-day outcomes in one arm |
| Reporting bias | Selective reporting of favourable outcomes | Paper highlights positive secondary outcomes after a neutral primary outcome |
| Verification bias | Not all patients receive the reference standard | Only test-positive patients get CT or angiography |
| Spectrum bias | Study population is unrepresentative | Diagnostic rule tested only in obvious disease and obvious non-disease |
| Confounding | Third factor distorts association | Sicker patients preferentially receive a treatment in a cohort study |
Important distinction:
- Bias is systematic error
- Confounding is distortion by another variable
- Chance is random variation
Step 5: Decide whether it should change practice
Even a valid statistically significant result may not justify a change in ED practice unless:
- the effect is clinically important
- benefits outweigh harms
- the outcome is patient-centred
- the intervention is feasible in an NHS ED
- the population resembles your patients
- the result fits with wider evidence and guidance
Immediate versus later care
Immediate exam response:
- state significance correctly
- comment on precision
- identify obvious bias
- avoid overclaiming
Later, fuller appraisal:
- review protocol and prespecified outcomes
- check sample size calculation
- compare absolute and relative effects
- look for systematic reviews and guideline context
- consider implementation, cost and harms
Disposition, Referral and Follow-Up
For a paper rather than a patient, disposition means what you do with the evidence.
- Adopt cautiously if the study is valid, effect clinically important, precision acceptable, and findings consistent with wider evidence
- Do not change practice on the basis of a single biased or underpowered study
- Escalate to local governance, guideline or specialty discussion before implementing major pathway changes
- In exams, conclude with a cautious practice statement rather than a binary yes or no
Safe conclusion phrases:
- “This study suggests possible benefit, but limitations in precision and risk of bias mean it is insufficient on its own to change practice.”
- “The result is statistically significant and reasonably precise, but I would still consider clinical importance, harms and applicability to our ED population.”
- “This is a neutral study rather than proof of no effect, because the confidence interval still includes clinically important benefit and harm.”
Special Groups
This is not a patient-group topic in the usual sense, but applicability often differs across populations.
Paediatrics
- Adult evidence may not apply to children
- Decision rules and diagnostic thresholds often differ
- Small paediatric studies may be underpowered
Pregnancy
- Pregnant patients are often excluded from trials
- External validity may therefore be poor
- Diagnostic pathways may differ because of imaging and risk considerations
Older adults
- Frailty, multimorbidity and polypharmacy may reduce applicability of trial results
- Outcomes important to older adults may differ from trial endpoints
Immunosuppressed or complex patients
- Often under-represented in trials
- Diagnostic test performance may differ because disease spectrum differs
Exam phrase:
“External validity is limited if important ED subgroups such as older adults, pregnant patients or immunosuppressed patients were excluded or under-represented.”
Common Pitfalls
- Equating statistical significance with clinical importance
- Saying a non-significant result proves no effect
- Ignoring confidence interval width
- Forgetting the null value differs by measure
- Assuming a large sample automatically means a trustworthy study
- Ignoring absolute risk reduction and focusing only on relative effects
- Accepting subgroup findings uncritically
- Confusing association with causation in observational studies
- Failing to distinguish bias from confounding
- Ignoring missing data and loss to follow-up
- Using sensitivity and specificity without considering prevalence, likelihood ratios and pathway use
- Overlooking whether the primary outcome was actually positive
FRCEM and MRCEM Exam Tips
What the examiner wants to hear
A concise, structured interpretation. For most questions, include:
- what the p value means
- what the confidence interval means
- whether it crosses the null
- what the width says about precision
- whether the effect is clinically important
- whether bias or confounding could explain the finding
- whether the result applies to UK ED practice
Model OSCE answer stem
“The study reports a statistically significant result because the 95% confidence interval does not cross the null value. However, significance alone is not enough. I would also consider the size of effect, the width of the confidence interval, whether the primary outcome was prespecified and positive, and the risk of bias or confounding. If the interval is wide, precision is limited. Even if the result is statistically significant, the effect may be clinically trivial or not applicable to our ED population, so this would not automatically justify a change in practice.”
High-yield one-line rules
- Never interpret the p value alone
- Confidence intervals are usually more informative than p values
- Crosses 0 for differences, crosses 1 for ratios
- Wide confidence interval means imprecision
- Non-significant does not mean no effect
- Statistical significance does not prove causation
- Bias can invalidate a precise significant result
- Primary outcome matters more than secondary outcomes
- Absolute effects matter more than relative headlines
Unsafe versus safe phrases
| Unsafe phrase | Safer phrase |
|---|---|
| There is no effect | The study did not demonstrate a statistically significant difference |
| The treatment works | The results are compatible with benefit, subject to study validity and bias |
| The null hypothesis is false | The data are relatively incompatible with the null hypothesis |
| This proves equivalence | This does not show a significant difference; equivalence requires an appropriate design and margin |
| The result is important because p<0.05 | The result is statistically significant, but clinical importance depends on effect size, precision and harms |
How This Appears in SBA Questions
Typical question stems
- “A trial reports RR 0.82, 95% CI 0.68 to 0.99. Which is the best interpretation?”
- “A study finds p=0.08 for the primary outcome. What is the most appropriate conclusion?”
- “Which statement about a 95% confidence interval is correct?”
- “Which bias is most likely if only test-positive patients receive the reference standard?”
- “A large observational study shows a statistically significant association. What is the main limitation?”
Key discriminator clues
- Crossing the null means usually not statistically significant in a matching two-sided analysis
- Wide confidence interval means uncertainty, not reassurance
- Observational association does not prove causation
- Diagnostic studies need reference standard and verification checks
- Primary outcome outweighs post-hoc subgroup excitement
Common wrong answer traps
- Interpreting p as the probability the null is true
- Calling a non-significant study “negative” without checking the confidence interval
- Ignoring absolute risk reduction
- Assuming adjustment removes all confounding
- Accepting selective secondary outcomes as definitive
Mini SBA examples
Question 1
A treatment trial reports an odds ratio for admission of 0.74 with 95% CI 0.52 to 1.05. Which is the best interpretation?
- A. The treatment definitely reduces admission
- B. The result is statistically significant
- C. The study did not show a statistically significant reduction in admission
- D. There is no possible benefit
Best answer: C.
Why: The CI crosses 1, so the result is not statistically significant in the usual two-sided interpretation. The interval still includes possible benefit.
Question 2
A diagnostic study reports sensitivity 97% with 95% CI 89% to 99%. Which is the most appropriate comment?
- A. The test is safe to rule out disease in all settings
- B. The lower confidence limit may still be too low for a time-critical rule-out pathway
- C. The result is invalid because sensitivity has no p value
- D. The test is superior because sensitivity is above 95%
Best answer: B.
Question 3
An observational study finds a statistically significant association between early antibiotics and lower mortality in sepsis. Which limitation most threatens causal interpretation?
- A. Type I error only
- B. Confounding by indication
- C. Confidence interval width only
- D. Lack of a p value
Best answer: B.
Key Takeaways
- A p value tells you how compatible the data are with the null hypothesis, not whether the treatment works or matters clinically.
- A 95% confidence interval shows direction, size and precision of the estimate.
- For differences, the null value is 0. For ratios, the null value is 1.
- If a confidence interval crosses the null, the result is usually not statistically significant in a matching two-sided analysis.
- Wide confidence intervals mean imprecision and uncertainty.
- Non-significant does not mean no effect, no harm, or equivalence.
- Statistical significance is not the same as clinical significance.
- Bias can make a result wrong even if p is very small.
- Confounding is especially important in observational studies.
- Primary outcomes and prespecified analyses matter more than secondary or post-hoc findings.
- Always look at absolute effects, harms and applicability to NHS ED practice.
- In the exam, use the triad: effect size, confidence interval, bias.
Further Reading
- NICE. Developing NICE guidelines: the manual.
- RCEM Learning and RCEM examination resources on critical appraisal and evidence-based medicine.
- SIGN 50. A guideline developer’s handbook.
- CEBM, University of Oxford. Critical appraisal tools and evidence resources.
- CONSORT 2010 Statement for reporting randomised trials.
- STARD 2015 Statement for reporting diagnostic accuracy studies.
- STROBE Statement for reporting observational studies.
Related on EM Final Exams
- How to Pass the FRCEM Critical Appraisal Section
- Most Common FRCEM Critical Appraisal SBA Questions
- How Hard is the FRCEM Exam
- FRCEM Pass Rates Explained
Authoritative Sources
Ready to build your plan? EMF Premium gives you all 40,000 questions and 20 mocks for £59 — one payment, six months' access.
