Quantitative Methods · Hypothesis Testing · LO 1 of 3
Why do statisticians reject hypotheses instead of accepting them?
Understand how hypothesis testing uses sample evidence to make decisions under uncertainty, and why some errors are more acceptable than others.
⏱ 8min-15min
·
3 questions
·
LOW PRIORITYUNDERSTAND
Why this LO matters
Understand how hypothesis testing uses sample evidence to make decisions under uncertainty, and why some errors are more acceptable than others.
INSIGHT
You cannot prove a hypothesis true with sample data.
You can only gather evidence against it. Hypothesis testing is built on a game where the null hypothesis gets a trial. We assume it is true unless the sample evidence contradicts it strongly enough. If the evidence is convincing, we reject the null. If it is not convincing enough, we fail to reject it. That is not the same as accepting it. The alternative hypothesis is what we suspect, but we can never prove it with certainty. We can only accumulate evidence in its favour by rejecting the null.
The six parts of every hypothesis test
Think of hypothesis testing like a courtroom. The defendant (the null hypothesis) is presumed innocent until proven guilty. The prosecution (the sample data) presents evidence. The judge (the significance level) sets the standard of proof required. The verdict is binary: guilty (reject) or not proven (fail to reject). "Not proven" is never the same as "innocent beyond doubt."
This courtroom has six moving parts that work together every time.
Hypothesis Testing Framework
1
Null hypothesis (H₀). The statement you assume is true unless the sample proves otherwise. It always contains an equality sign (=, ≥, or ≤). You either reject it or fail to reject it, never "accept" it.
2
Alternative hypothesis (Hₐ). The condition you suspect to be true instead. It is mutually exclusive with the null and together with the null covers all possible values. Use it to identify whether a test is one-tailed or two-tailed.
3
Test statistic. A number calculated from your sample measuring how far the sample result strays from what the null hypothesis claims. The larger this number in absolute value, the stronger the evidence against the null. Different hypotheses use different test statistics (t, z, F, chi-square).
4
Significance level (α). The probability you are willing to accept of rejecting a true null hypothesis. Typical values: 5%, 1%, or 10%. Its complement, (1 − α), is the confidence level.
5
Critical value. The boundary that separates "reject the null" from "fail to reject the null." If your calculated test statistic is more extreme than the critical value, you reject. Otherwise, you do not.
6
Type I error and Type II error. Type I (probability = α) is rejecting a true null. Type II (probability = β) is failing to reject a false null. These errors trade off: lowering α raises β. The power of a test is (1 − β), the probability of correctly rejecting a false null.
One-tailed versus two-tailed tests
The shape of the null hypothesis determines the test direction.
If the null contains only an equals sign (H₀: µ = 5), the alternative is "not equal" (Hₐ: µ ≠ 5). You conduct a two-tailed test and split the significance level between both tails of the distribution.
If the null contains ≥ or ≤ (H₀: µ ≥ 5 or H₀: µ ≤ 5), the alternative points in one direction. You conduct a one-tailed test and the entire significance level sits in one tail.
FORWARD REFERENCE
Student's t-distribution, what you need for this LO only
A bell-shaped curve, symmetrical around zero, used to find critical values when testing claims about population means. It has fatter tails than the normal distribution and depends on a parameter called degrees of freedom (usually n − 1). For this LO, you only need to recognise that critical values come from this distribution using degrees of freedom and significance level. Full treatment: Quantitative Methods, Module 2.
→ Quantitative Methods
FORWARD REFERENCE
p-value, what you need for this LO only
The smallest significance level at which you could reject the null hypothesis given your calculated test statistic. A smaller p-value means stronger evidence against the null. For this LO, you only need to know: reject H₀ when the p-value is less than α. Full treatment: Quantitative Methods, Module 3.
→ Quantitative Methods
Worked examples: putting the framework into practice
The three examples below move from the most common conceptual question (stating hypotheses) through to applying the decision rule two ways. Read through them in order. Each one builds on the last.
Worked Example 1
Stating the hypotheses correctly
Priya Nair is a junior analyst at Meridian Asset Management. She suspects the firm's emerging-markets fund has produced mean monthly excess returns below the advertised benchmark of 0.5%. Her supervisor asks her to state the null and alternative hypotheses before any data is collected.
🧠Thinking Flow — stating null and alternative hypotheses
The question asks
Which hypothesis contains the equality sign, and which direction should the alternative point?
Key concept needed
Null hypothesis structure. The suspected or hoped-for condition belongs in the alternative hypothesis, not the null.
Step 1, Recognise the wrong approach
Many candidates write H₀: µ < 0.5% because Priya suspects returns have fallen. This reverses the framework. The null is what you assume true until evidence overthrows it. The suspected condition belongs in the alternative.
Step 2, Build the correct hypotheses
The alternative is Hₐ: µ < 0.5% (what Priya suspects).
The null must cover everything else: H₀: µ ≥ 0.5%.
The null and alternative must be mutually exclusive (no overlap) and collectively exhaustive (cover all possible values). A value cannot simultaneously satisfy µ ≥ 0.5% and µ < 0.5%. Every possible value of µ falls in one or the other.
Step 3, Sanity check
Confirm the null contains an equality sign. It does: the ≥ sign includes equality. Confirm mutual exclusivity. Confirm exhaustiveness. All three checks pass.
✓ Answer: H₀: µ ≥ 0.5% versus Hₐ: µ < 0.5%. This is a one-tailed (left-tailed) test because the alternative points in a single direction. because the alternative points in a single direction.
Worked Example 2
Type I error, Type II error, and their trade-off
Carlos Mendoza operates quality control at Velanta Pharmaceuticals. A machine doses active ingredients into capsules. The target dose is exactly 50 mg. Carlos runs a hypothesis test each shift to determine whether the machine needs recalibration. He sets his significance level at 5%.
🧠Thinking Flow — identifying error types and the trade-off
The question asks
What does each type of error mean in this context, and what happens when Carlos lowers his significance level to 1%?
Key concept needed
Type I and Type II errors. Candidates frequently swap the two definitions. The anchor: Type I is a false alarm (you act when you should not have). Type II is a missed signal (you stay still when you should have acted).
Step 1, Apply the definitions to the scenario
Carlos's null hypothesis is H₀: µ = 50 mg (the machine is dosing correctly).
A Type I error means Carlos rejects H₀ when it is actually true. In practice: Carlos stops the production line and recalibrates even though the machine was fine. The probability of this error equals α. At α = 5%, there is a 5% chance of this false alarm.
A Type II error means Carlos fails to reject H₀ when it is actually false. In practice: the machine really is misdosing, but the test does not catch it. The probability of this error is β. The power of the test, (1 − β), is the probability of correctly catching the problem.
Step 2, Apply the trade-off logic
Carlos's supervisor suggests tightening the significance level to α = 1% to reduce unnecessary production stoppages.
Lowering α reduces the probability of a Type I error. The rejection region shrinks. The critical value moves further from zero. The test demands stronger evidence before rejecting H₀.
The consequence: with a tighter rejection region, some false nulls will no longer be rejected. β rises. The power of the test, (1 − β), falls. Carlos will stop the line less often when it is fine, but he will also miss more true dosing problems.
Step 3, Sanity check
The direction of every effect must run consistently in one direction: α down, rejection region smaller, harder to reject H₀, more false nulls survive, β up, power (1 − β) down. The chain holds. The only way to reduce both error probabilities simultaneously is to increase the sample size n, which provides more information without sacrificing either error rate.
✓ Answer: Lowering α from 5% to 1% decreases the probability of a Type I error and increases the probability of a Type II error (decreases the power of the test). The two error probabilities trade off in opposite directions at any fixed sample size.
Worked Example 3
Statistical significance and the decision rule
Tomás Reyes is a risk analyst at Solvara Capital. He tests whether the mean daily return on a new algorithmic strategy equals zero. He sets α = 5% and uses a two-tailed test. His sample of 36 trading days produces a calculated t-statistic of −2.45. The critical values at the 5% significance level with 35 degrees of freedom are ±2.030. The p-value for his test statistic is 1.97%.
🧠Thinking Flow — applying the decision rule two ways
The question asks
Should Tomás reject H₀, and how do both the critical-value method and the p-value method lead to the same conclusion?
Key concept needed
Statistical significance and the decision rule. Candidates sometimes reject using the p-value when it is larger than α, or fail to reject when the test statistic lies outside the critical values. The two methods always agree if applied correctly.
Step 1, Apply the critical-value method
The decision rule for a two-tailed test: reject H₀ if the calculated test statistic is more extreme than the critical value in either direction.
Tomás's calculated t-statistic: −2.45. Critical values: ±2.030.
Is −2.45 more extreme than −2.030? Yes. −2.45 < −2.030.
The test statistic falls in the left rejection region. Reject H₀.
Step 2, Apply the p-value method
The p-value is 1.97%. Compare it to α: 1.97% < 5%.
Because the p-value is smaller than α, reject H₀. The evidence against the null is stronger than the threshold Tomás set.
Step 3, Sanity check
Both methods must agree. Critical-value method: reject. P-value method: reject. They agree.
A p-value of 1.97% means Tomás could have set his significance level as low as 1.97% and still rejected. Since his chosen α of 5% is more generous than 1.97%, rejection holds easily. If the p-value had been 6%, it would exceed α = 5%, and both methods would have led to "fail to reject." The two methods are two expressions of the same underlying logic.
✓ Answer: Tomás rejects H₀. Both the critical-value method (|−2.45| > 2.030) and the p-value method (1.97% < 5%) lead to the same conclusion. The result is statistically significant at the 5% level.
⚠️
Watch out for this
The "Accept the Null" trap.
A candidate who concludes "we accept H₀" when the test statistic does not fall in the rejection region has stated a conclusion the framework never permits, and will lose marks on any question that offers "accept the null" as an answer choice alongside "fail to reject."
The correct conclusion is always "fail to reject H₀," which acknowledges that the evidence was insufficient to overturn the default assumption, not that the null has been proven true.
Candidates make this error because they treat hypothesis testing like a binary verdict, guilty or not guilty, when the framework is asymmetric: strong evidence can convict (reject), but weak evidence does not acquit (accept); it only means the case was not proven.
Before writing any conclusion, check: did I reject, or did I merely lack enough evidence to reject? If the latter, write "fail to reject," never "accept."
🧠
Memory Aid
ACRONYM
P
Power (1 − β) — The probability of correctly rejecting a false null. Higher is better. Increasing sample size raises it.
A
Alpha (α) — The significance level. The probability of a Type I error. Set before the test, not after.
N
Null hypothesis (H₀) — Always contains an equality sign (=, ≥, or ≤). It is what you assume true until evidence says otherwise.
T
Two error types — Type I = false alarm (reject a true null). Type II = missed signal (fail to reject a false null). They trade off at fixed sample size.
Use PANT to recall the four components of any hypothesis test: When a question asks what "fail to reject" means, run PANT: you have not proven the Null true. You only lacked the evidence (Alpha was not breached) to trigger a rejection. Never write "accept." Write "fail to reject" every time.
Practice Questions · LO1
3 Questions LO1
Score: — / 3
Q 1 of 3 — REMEMBER
Which of the following correctly defines a Type I error in hypothesis testing?
CORRECT: C
CORRECT: C, A Type I error occurs when the null hypothesis is actually true, but the test statistic falls in the rejection region by chance, leading you to reject it anyway. The probability of this error is exactly α, the significance level you set before the test.
Why not A? This describes a Type II error, not Type I. A Type II error is a missed signal: the null is false (something real is happening), but your test fails to detect it. The probability of a Type II error is β. Candidates frequently swap the two definitions under pressure. The anchor: Type I is a false alarm (you acted when you should not have). Type II is a missed signal (you stayed still when you should have acted).
Why not B? Rejecting a false null hypothesis is the correct outcome, not an error at all. This is what the test is designed to do. Its probability is (1 − β), known as the power of the test. A powerful test does this frequently. Calling a correct rejection an error is the reverse of the framework's logic.
---
Q 2 of 3 — UNDERSTAND
An analyst wants to test whether a portfolio's mean return differs from a benchmark. She states H₀: µ = 4% versus Hₐ: µ ≠ 4%. Which of the following best describes why the null hypothesis must contain an equality sign?
CORRECT: B
CORRECT: B, The null hypothesis must specify a single, precise value (or boundary) so that a test statistic can be calculated measuring how far the sample deviates from it. The equality sign anchors that reference point. Without it, you cannot compute a test statistic because there is no fixed value to measure distance from.
Why not A? This reverses the requirement. Null and alternative hypotheses must be mutually exclusive, they must never overlap. The equality sign does not create overlap. It creates a boundary. The null gets one side (including the boundary value), and the alternative gets the other side. Together they are collectively exhaustive, covering all possible values of µ.
Why not C? One-tailed tests still require an equality sign in the null. For example, H₀: µ ≤ 4% contains ≤, which includes equality. A null stated as a strict inequality alone (H₀: µ < 4%) would leave the value µ = 4% uncovered, violating collective exhaustiveness. The rule holds for both test directions without exception.
---
Q 3 of 3 — APPLY
A clinical researcher at Delvance Biotech tests whether a new compound reduces average recovery time from 14 days. He sets α = 5% for a one-tailed (left-tailed) test. After collecting data, he calculates a test statistic that does not fall in the rejection region. Which conclusion is correct?
CORRECT: B
CORRECT: B, "Fail to reject" is the only correct language when a test statistic does not reach the rejection region. The null hypothesis is the default assumption. Insufficient evidence does not prove it true. It simply means the sample did not provide strong enough grounds to overturn it. The framework is asymmetric: strong evidence can lead to rejection, but weak evidence only means the case was not proven.
Why not A? This is the exact error the framework forbids. Saying "accept the null" implies the null has been proven true, which sample evidence can never do. Hypothesis testing cannot confirm the null. It can only fail to reject it. On exam questions, whenever "accept the null hypothesis" appears as an answer option, it is always wrong. The correct phrase is "fail to reject the null hypothesis," which preserves the uncertainty the framework requires.
Why not C? The framework makes no provision for accepting the alternative hypothesis when the test statistic falls outside the rejection region. If the test statistic does not exceed the critical value, neither hypothesis is "accepted." The only two possible conclusions are "reject H₀" (when the statistic is in the rejection region) or "fail to reject H₀" (when it is not). Proximity to the critical value has no formal meaning in the decision rule.
---
Glossary
Null hypothesis
The default assumption you make about a population parameter, the condition you assume is true unless sample evidence proves otherwise. It always contains an equality sign (=, ≥, or ≤). Think of it as the "defendant is innocent" baseline in a trial: the burden of proof lies with the evidence, not the assumption.
Alternative hypothesis
The condition you suspect to be true instead of the null hypothesis. It is what you hope to find support for by collecting data. Unlike the null, it contains a strict inequality (<, >, or ≠). Think of it as the prosecution's claim in a trial: "the defendant is guilty."
Test statistic
A number you calculate from your sample data that measures how far the sample result deviates from what the null hypothesis claims. The larger this number in absolute value, the stronger the evidence against the null. It is like a distance marker: the further your sample lands from the null's claim, the more suspicious the null becomes.
Significance level
The probability you are willing to accept of making a false alarm, rejecting a true null hypothesis by chance. Commonly set at 5%, 1%, or 10% before any data is collected. It is the threshold you decide on at the start, like setting a confidence bar for evidence before a trial begins.
confidence level
The complement of the significance level, expressed as a percentage. If the significance level is 5%, the confidence level is 95%. It represents the probability that your method will make a correct decision (not make a Type I error) if you repeat the test many times.
Critical value
The boundary point in the distribution of the test statistic that separates the rejection region from the non-rejection region. If your calculated test statistic is more extreme than this value, you reject the null. It is like the finish line in a race: you either cross it (reject) or you do not (fail to reject).
Type I error
Rejecting a true null hypothesis, a false alarm. The probability of a Type I error equals the significance level α. Think of a smoke detector going off when there is no fire.
Type II error
Failing to reject a false null hypothesis, a missed signal. The probability of a Type II error is β. Think of a smoke detector staying silent during a real fire.
power of a test
The probability of correctly rejecting a false null hypothesis, equal to (1 − β). A test with high power is good at detecting real effects. Increasing sample size increases power.
two-tailed test
A hypothesis test where the alternative hypothesis is non-directional (Hₐ: µ ≠ value). The significance level is split equally between both tails of the distribution. Used when you want to detect a difference in either direction.
one-tailed test
A hypothesis test where the alternative hypothesis points in one direction (Hₐ: µ < value or Hₐ: µ > value). The entire significance level sits in one tail. Used when you have a specific directional suspicion.
Statistical significance
The conclusion that a test result is unlikely to have occurred by chance under the null hypothesis. A result is statistically significant when the test statistic exceeds the critical value (or when the p-value is less than α). Statistical significance does not necessarily imply practical importance.
p-value
The smallest significance level at which you could reject the null hypothesis given your calculated test statistic. A p-value of 3% means you could have set α as low as 3% and still rejected H₀. Compare directly to α: reject if the p-value is less than α.
LO 1 Done ✓
Ready for the next learning objective.
🔒 PRO Feature
How analysts use this at work
Real-world applications and interview questions from top firms.
Quantitative Methods · Hypothesis Testing · LO 2 of 3
Why do some tests catch the guilty when they are guilty, but miss them when they actually committed the crime?
Master the logic of hypothesis testing: when you reject and when you fail to reject, what errors you can make, and how likely each one actually is.
⏱ 8min-15min
·
6 questions
·
HIGH PRIORITYAPPLY🧮 Calculator
Why this LO matters
Master the logic of hypothesis testing: when you reject and when you fail to reject, what errors you can make, and how likely each one actually is.
INSIGHT
A test cannot be perfect.
You can make two completely different errors: reject something true, or fail to reject something false. The exam expects you to name which error you made, estimate its probability, and understand that you cannot minimise both errors at once. Tightening the net to catch one lets the other slip away.
The test's power measures your ability to catch a real effect. Everything in this LO hinges on that trade-off.
How hypothesis testing works: the six-step process
Think about how a criminal trial works. The jury starts with a presumption of innocence. They do not start from "probably guilty." The prosecutor must present evidence strong enough to overturn that default. If the evidence is weak, the defendant walks free even if they are actually guilty. If the evidence is strong, the jury convicts even if there is a small chance of error.
Hypothesis testing follows exactly the same logic. The null hypothesis is the presumption of innocence. The alternative hypothesis is the prosecution's claim. Your sample data is the evidence. The significance level is the threshold of doubt you are willing to tolerate.
The wrong approach is to work backwards: look at the data first, then decide what you are testing. Every step must be committed to before you calculate anything.
The Six Steps of Hypothesis Testing
1
State the hypotheses. Write two mutually exclusive statements: the null (default assumption) and the alternative (what you suspect). One uses "greater than," "less than," or "not equal to." Never both directions at once.
2
Identify the appropriate test statistic. Choose the formula based on what you are testing and what information you have. Is the population standard deviation known? Are the samples independent or paired?
3
Specify the level of significance. State α. This is your Type I error rate, usually 0.05 or 0.01. This number determines where you draw the rejection boundary.
4
State the decision rule. Calculate the critical value(s) from the distribution table using α and degrees of freedom. Write the exact condition for rejection before you calculate anything.
5
Calculate the test statistic. Plug your sample data into the formula from Step 2. This produces one number: your calculated test statistic.
6
Make a decision. Compare the calculated statistic (Step 5) to the critical value(s) (Step 4). If the statistic falls in the rejection region, reject the null. Otherwise, fail to reject. Write a plain-English conclusion.
One-tailed vs. two-tailed: which direction are you looking?
Most people assume every test is two-sided. That is the wrong default. The direction of the test comes directly from the research question, not from personal preference.
Test Direction and Rejection Regions
1
Two-tailed test. The alternative hypothesis uses "not equal to" (≠). The rejection region splits equally across both tails. Use this when the question asks "is it different?" with no direction specified. Critical values always come in pairs: ±1.96, ±2.069, etc.
2
One-tailed test, right tail (greater than). The alternative hypothesis uses ">". The entire rejection region sits in the right tail. Use this when the question asks "is it higher?" or "does it exceed the target?" Critical value is a single positive number.
3
One-tailed test, left tail (less than). The alternative hypothesis uses "<". The entire rejection region sits in the left tail. Use this when the question asks "is it lower?" or "does it fall short?" Critical value is a single positive number, but you reject when the calculated statistic is sufficiently negative.
Placing the hypotheses correctly
The most common misconception is reversing these. The null is the skeptic's default. The alternative is what the researcher suspects.
Hypothesis Assignment Rules
1
The null hypothesis (H₀). States "no change" or "no difference." It is the presumption of innocence. It uses =, ≥, or ≤. Example: "The fund's mean return equals 1.4% per month."
2
The alternative hypothesis (Hₐ). States the research claim. It always contains the direction signal: ">", "<", or "≠". Example: "The fund's mean return is less than 1.4% per month."
3
Signal word rule. In the question, look for "greater," "less," "higher," "lower," "different," "changed," or "outperforming." These words describe the alternative hypothesis, never the null. If the question says "test whether the fund underperformed," then Hₐ states the return is less than the target.
Type I error, Type II error, and power
Think of a smoke detector. A Type I error is the detector screaming when there is no fire. Annoying, costly, but you are safe. A Type II error is silence while the building burns. Catastrophic. You can make the detector more sensitive to avoid missing fires, but then it screams at toast. You can make it less sensitive to avoid false alarms, but then it might miss real smoke. That trade-off is the entire story of α, β, and power.
Test Errors and Test Power
1
Type I error (false positive). You reject the null when it is true. You conclude something changed when it did not. The probability of this is exactly α. If α = 0.05, there is a 5% chance you reject a true null by random chance.
2
Type II error (false negative). You fail to reject the null when it is false. You miss a real effect. The probability is β. You do not choose β directly. It emerges from your sample size, the true effect size, and your choice of α. Small samples produce large β.
3
Power of the test. The probability of correctly rejecting a false null. Power = 1 − β. Higher power means fewer missed effects. Power increases when sample size increases, when α increases, or when the true effect is large.
Choosing the right test statistic
Matching the test to the question is a mechanical skill. The wrong test statistic produces a completely wrong answer, even if you execute the arithmetic perfectly.
Test Selection by Scenario
1
Test of a single mean. One sample, testing whether the population mean equals a specific value. Use the t-statistic: t = (X̄ − μ₀) / (s / √n) Degrees of freedom = n − 1.
2
Test of a single variance. One sample, testing whether the population variance equals a specific value. Use the chi-square statistic: χ² = (n − 1)s² / σ₀² Degrees of freedom = n − 1. Chi-square values are always positive.
3
Test of the difference between two independent means. Two separate samples from two different populations. Use the pooled t-statistic with degrees of freedom = n₁ + n₂ − 2.
4
Test of the difference between two dependent means. The same observational units measured twice, or matched pairs. Use the paired t-statistic on the differences d̄: t = (d̄ − μ_d0) / s_d̄ where s_d̄ = s_d / √n. Degrees of freedom = n − 1.
5
Test of the equality of two variances. Two samples, testing whether their population variances are equal. Use the F-statistic: F = s₁² / s₂² (larger variance always in the numerator) Degrees of freedom = (n₁ − 1, n₂ − 1).
With the framework established, let us work through each test type using a complete six-step example.
FORWARD REFERENCE
The t-distribution, what you need for this LO only
The t-distribution is a symmetric, bell-shaped distribution used when the population standard deviation is estimated from sample data. It has fatter tails than the normal distribution, and those tails shrink as sample size grows. For this LO, you only need to: look up the critical t-value in the t-table using α and degrees of freedom (n − 1 for single-sample tests, n₁ + n₂ − 2 for two-sample tests), then compare your calculated t-statistic to that critical value. Full treatment: Quantitative Methods Learning Module 2.
→ Quantitative Methods
FORWARD REFERENCE
The chi-square distribution, what you need for this LO only
The chi-square distribution is not symmetric. It skews right and takes only positive values. Its shape depends on degrees of freedom. For this LO, you only need to: look up the critical chi-square value in the chi-square table using α and degrees of freedom (n − 1), then compare your calculated statistic to that critical value. Full treatment: Quantitative Methods Learning Module 2.
→ Quantitative Methods
FORWARD REFERENCE
The F-distribution, what you need for this LO only
The F-distribution is formed by the ratio of two chi-square variables. F-values are always positive, and two sets of degrees of freedom determine its shape: one for the numerator, one for the denominator. For this LO, you only need to: look up the critical F-value using α and both degrees of freedom, then compare your calculated F-statistic to that critical value. Full treatment: Quantitative Methods Learning Module 2.
→ Quantitative Methods
Worked Example 1
Testing Whether a Fund Meets Its Stated Objective (Single Mean, One-Tailed t-Test)
Priya Menon works as a performance analyst at Northgate Capital in Singapore. The firm's flagship fund has a stated objective of delivering a mean monthly return of at least 1.4%. Priya has gathered 28 months of actual returns. The sample mean is 1.22% and the sample standard deviation is 0.73%. Using a 5% significance level, she wants to test whether the fund is genuinely underperforming its stated objective.
🧠Thinking Flow — Single mean, one-tailed t-test
The question asks
Is the fund's true mean monthly return significantly less than 1.4%? Is the gap between 1.22% and 1.4% too large to be explained by random sampling variation across 28 months?
Key concept needed
The Six Steps of Hypothesis Testing. Many candidates jump straight to calculating the test statistic, skipping the decision rule (Step 4). Without Step 4, you have no standard to compare your calculated number to, and you cannot make a valid rejection decision.
Step 1, State the hypotheses
The fund's stated objective is ≥ 1.4%. We want to know whether the fund is falling short. The researcher's suspicion ("fund is underperforming") goes in the alternative. The skeptic's default ("fund meets its objective") goes in the null.
H₀: μ ≥ 1.4% versus Hₐ: μ < 1.4%
This is a one-tailed, left-tail test. We reject only if the fund looks significantly below 1.4%.
Step 2, Identify the test statistic
One sample, testing a mean, population standard deviation unknown. Use the t-statistic:
t = (X̄ − μ₀) / (s / √n)
Degrees of freedom = n − 1 = 28 − 1 = 27.
Step 3, Specify significance level
α = 5% (given). One-tailed test, so the entire 5% sits in the left tail.
Step 4, State the decision rule
From the t-table with 27 degrees of freedom and 5% in one tail:
Critical value = 1.703.
Because this is a left-tail test, reject H₀ if the calculated t-statistic is less than −1.703.
The calculated t-statistic is −1.3047. The critical value is −1.703. Because −1.3047 is not less than −1.703, the statistic does not fall in the rejection region.
Fail to reject H₀. There is insufficient evidence at the 5% level to conclude that the fund's mean monthly return is less than 1.4%. The observed shortfall of 0.18 percentage points could plausibly be due to chance across 28 months.
Sanity check
The fund's sample mean (1.22%) is below target (1.40%), so the sign of the test statistic should be negative. It is (−1.3047). Direction confirmed. The magnitude (1.30) is less than the critical value (1.703), placing the result inside the non-rejection region. This makes intuitive sense: 28 months is a small sample, and a gap of 0.18 percentage points in monthly returns is modest.
✓ Answer: Fail to reject H₀. The fund's underperformance is not statistically significant at the 5% level with 28 months of data.
🧮 BA II Plus Keystrokes
`2ND``FV`
Clear all TVM registers → 0.00
`0.73``÷``28``2ND``√x``=`
Calculates s / √n = 0.73 / 5.2915 → 0.13795
`STO``1`
Stores the standard error → 0.13795
`1.22``−``1.40``=`
Calculates X̄ − μ₀ = −0.18 → −0.18
`÷``RCL``1``=`
Divides by the stored standard error → −1.3047
⚠️ Forgetting the square root. Dividing s by n instead of √n gives 0.73 / 28 = 0.02607, producing a t-statistic of −6.906. That number falls far into the rejection region and reverses the conclusion. Always divide s by √n, not by n.
Worked Example 2
Testing Whether a Fund's Variance Is Below a Risk Trigger (Single Variance, Chi-Square Test)
Marcus Osei is a risk officer at Windward Asset Management in Accra. The firm's equity fund uses a risk budget that flags any portfolio with variance of monthly returns above 16 (equivalent to a standard deviation of 4%). Marcus has 24 months of return data for a new fund. The sample standard deviation is 3.60%, giving a sample variance of 12.96. He tests at the 5% significance level whether the true variance is below the trigger level of 16.
🧠Thinking Flow — Single variance, one-tailed chi-square test
The question asks
Is the fund's true variance significantly less than 16? Does the sample variance of 12.96 provide strong enough evidence to confirm the fund is operating below the risk trigger?
Key concept needed
Test of a Single Variance using the chi-square statistic. A common wrong move is to set up H₀ as σ² = 16 and use a two-tailed test. The research question asks specifically whether the variance is less than 16. That direction belongs in the alternative. The correct null is H₀: σ² ≥ 16.
Step 1, State the hypotheses
The risk trigger concern is whether variance is below 16. "Less than" signals the alternative.
H₀: σ² ≥ 16 versus Hₐ: σ² < 16
Left-tail test. Reject only if the chi-square statistic is small enough.
Step 2, Identify the test statistic
Testing a single variance. Use the chi-square statistic:
χ² = (n − 1)s² / σ₀²
Degrees of freedom = n − 1 = 24 − 1 = 23.
Step 3, Specify significance level
α = 5%, one-tailed, left side.
Step 4, State the decision rule
From the chi-square table with 23 degrees of freedom and 5% in the left tail:
Critical value = 13.091.
Reject H₀ if the calculated χ² statistic is less than 13.091.
The calculated χ² = 18.630. The critical value is 13.091. Because 18.630 is not less than 13.091, the statistic does not fall in the left-tail rejection region.
Fail to reject H₀. There is insufficient evidence at the 5% level to conclude that the fund's true variance is below 16.
Sanity check
Chi-square values are always positive. ✓ (18.630 > 0.)
The sample variance (12.96) is indeed below the hypothesised 16, but the test statistic is not far enough into the left tail to cross the critical value. This reflects the small sample: 24 months is not enough data to distinguish a true variance of 12.96 from a true variance of 16 with high confidence.
✓ Answer: χ² = 18.630. Fail to reject H₀. The sample does not provide sufficient evidence that the fund's true variance is below 16 at the 5% level.
🧮 BA II Plus Keystrokes
`2ND``FV`
Clear all TVM registers → 0.00
`24``−``1``=`
Calculates n − 1 = 23 → 23.00
`×``12.96``=`
Multiplies 23 × 12.96 → 298.08
`÷``16``=`
Divides by σ₀² = 16 → 18.63
⚠️ Using s (3.60) instead of s² (12.96) in the numerator. Plugging in 3.60 gives χ² = 23 × 3.60 / 16 = 5.175. That number is less than 13.091, which produces the wrong conclusion (reject H₀). Always square the standard deviation before substituting into the formula.
Worked Example 3
Comparing Two Analysts' Forecast Accuracy (Difference Between Two Independent Means)
Sofia Reyes is a research director at Meridian Analytics in Mexico City. She has collected quarterly EPS forecast errors for two analysts: Analyst Kenji (telecom sector, 10 forecasts, mean error 0.05, standard deviation 0.10) and Analyst Amara (automotive parts sector, 15 forecasts, mean error 0.02, standard deviation 0.09). Sofia tests at the 5% level whether Kenji's mean forecast error is significantly larger than Amara's.
🧠Thinking Flow — Difference between two independent means, one-tailed pooled t-test
The question asks
Is Kenji's mean forecast error significantly greater than Amara's? Is the observed difference (0.05 − 0.02 = 0.03) too large to be explained by sampling variation?
Key concept needed
Test of the Difference Between Two Independent Means. The wrong move here is to use a paired-comparisons test. The two analysts cover different industries, and their forecasts are not matched to the same companies or periods. The samples are independent. Use the pooled variance approach.
Step 1, State the hypotheses
Sofia suspects Kenji is more biased (higher mean error). The direction belongs in the alternative.
H₀: μ_Kenji − μ_Amara ≤ 0 versus Hₐ: μ_Kenji − μ_Amara > 0
Right-tail test. Reject only if Kenji's mean is significantly higher.
Step 2, Identify the test statistic
Two independent samples, unknown but assumed equal variances. Use the pooled t-statistic:
t = [(X̄₁ − X̄₂) − (μ₁ − μ₂)] / √(s²_p/n₁ + s²_p/n₂)
where s²_p = [(n₁ − 1)s₁² + (n₂ − 1)s₂²] / (n₁ + n₂ − 2)
Degrees of freedom = n₁ + n₂ − 2 = 10 + 15 − 2 = 23.
Step 3, Specify significance level
α = 5%, one-tailed, right side.
Step 4, State the decision rule
From the t-table with 23 degrees of freedom and 5% in the right tail:
Critical value = 1.714.
Reject H₀ if the calculated t-statistic is greater than 1.714.
Step 5, Calculate the test statistic
First, calculate the pooled variance:
s²_p = [(10 − 1)(0.10²) + (15 − 1)(0.09²)] / (10 + 15 − 2)
= [(9)(0.01) + (14)(0.0081)] / 23
= [0.09 + 0.1134] / 23
= 0.2034 / 23
= 0.008843
Next, calculate the standard error of the difference:
SE = √(0.008843/10 + 0.008843/15)
= √(0.0008843 + 0.0005895)
= √0.0014739
= 0.038392
Finally, calculate the t-statistic:
t = (0.05 − 0.02 − 0) / 0.038392
= 0.03 / 0.038392
= 0.7814
Step 6, Make a decision
The calculated t = 0.7814. The critical value is 1.714. Because 0.7814 < 1.714, the statistic does not fall in the rejection region.
Fail to reject H₀. There is insufficient evidence at the 5% level to conclude that Kenji's mean forecast error is greater than Amara's.
Sanity check
Kenji's mean error (0.05) is higher than Amara's (0.02), so the numerator should be positive. It is (0.03). Direction confirmed. The test statistic (0.78) is well below the critical value (1.714). Small samples (10 and 15 forecasts) limit the test's power to detect modest differences.
✓ Answer: t = 0.7814. Fail to reject H₀. Kenji's larger average forecast error is not statistically significant at the 5% level.
🧮 BA II Plus Keystrokes
`2ND``FV`
Clear all TVM registers → 0.00
`9``×``.01``=`
Calculates (n₁−1)s₁² = 9 × 0.01 → 0.09
`STO``1`
Stores first term → 0.09
`14``×``.0081``=`
Calculates (n₂−1)s₂² = 14 × 0.0081 → 0.1134
`+``RCL``1``=`
Adds both terms → 0.2034
`÷``23``=`
Divides by df = 23 to get s²_p → 0.008843
`STO``2`
Stores the pooled variance → 0.008843
`RCL``2``÷``10``=`
Calculates s²_p / n₁ → 0.00088435
`STO``3`
Stores first fraction → 0.00088435
`RCL``2``÷``15``=`
Calculates s²_p / n₂ → 0.00058957
`+``RCL``3``=`
Adds both fractions → 0.00147392
`2ND``√x`
Takes square root for standard error → 0.038392
`STO``4`
Stores standard error → 0.038392
`0.05``−``0.02``=`
Calculates X̄₁ − X̄₂ → 0.03
`÷``RCL``4``=`
Divides by standard error to get t → 0.7814
⚠️ Using the individual sample variances directly without pooling. Using SE = √(0.01/10 + 0.0081/15) = 0.03924 gives t = 0.7645. This is close but technically wrong. It ignores the equal-variance assumption that justifies the pooled estimator. On a borderline question, this error can change the conclusion.
Worked Example 4
Comparing Returns of Matched Indexes (Difference Between Two Dependent Means, Paired Comparison)
Tomás Varga is a quantitative analyst at BlueStar Funds in Budapest. He compares the daily returns of two bond indexes over the same 1,304 trading days: the BlueStar High Yield Index and the BlueStar Investment Grade Index. Because both indexes are measured on identical days, the samples are paired. The mean daily return difference (High Yield minus Investment Grade) is −0.0021%, and the standard deviation of those differences is 0.3622%. Tomás tests at the 5% significance level whether the two indexes have different mean daily returns.
🧠Thinking Flow — Difference between two dependent means, paired t-test
The question asks
Is the mean of the daily return differences significantly different from zero? If the indexes had the same expected return, day-to-day differences would average to zero. Is −0.0021% far enough from zero to be statistically significant?
Key concept needed
Test of the Mean of the Differences (Paired Comparisons Test). The wrong approach is to treat these as two independent samples. Both indexes are observed on the same days. The same market conditions affect both simultaneously. Using an independent-samples pooled test ignores this link, inflates the standard error, and reduces power. The paired test is more powerful because it removes variation caused by shared market conditions.
Step 1, State the hypotheses
No direction is specified. Tomás asks whether the returns differ at all.
H₀: μ_d = 0 versus Hₐ: μ_d ≠ 0
Two-tailed test. Reject if the mean difference is significantly far from zero in either direction.
Step 2, Identify the test statistic
Paired observations. Use the paired t-statistic:
t = (d̄ − μ_d0) / s_d̄
where s_d̄ = s_d / √n
Degrees of freedom = n − 1 = 1,304 − 1 = 1,303.
Step 3, Specify significance level
α = 5%, two-tailed, so α/2 = 2.5% in each tail.
Step 4, State the decision rule
With 1,303 degrees of freedom (very large sample), the t-distribution approaches the standard normal. Critical values at 2.5% in each tail:
Critical values = ±1.962.
Reject H₀ if the calculated t < −1.962 or t > +1.962.
Step 5, Calculate the test statistic
First, calculate the standard error of the mean difference:
s_d̄ = 0.3622 / √1304
= 0.3622 / 36.111
= 0.010028
Then calculate the t-statistic:
t = (−0.0021 − 0) / 0.010028
= −0.2094
Step 6, Make a decision
The calculated t = −0.2094. The critical values are ±1.962. Because −0.2094 falls between −1.962 and +1.962, it does not fall in either rejection region.
Fail to reject H₀. There is insufficient evidence at the 5% level to conclude that the two indexes have different mean daily returns. The average daily difference of −0.0021% is indistinguishable from zero given the variability in the data.
Sanity check
High Yield returned slightly less than Investment Grade on average (d̄ is negative), so the test statistic should be negative. It is (−0.2094). Direction confirmed. The magnitude (0.21) is far below the critical value (1.962), confirming no significant difference.
✓ Answer: t = −0.2094. Fail to reject H₀. The mean daily return difference between the two indexes is not statistically significant at the 5% level.
🧮 BA II Plus Keystrokes
`2ND``FV`
Clear all TVM registers → 0.00
`1304``2ND``√x`
Calculates √1304 → 36.1110
`STO``1`
Stores √n → 36.1110
`0.3622``÷``RCL``1``=`
Calculates s_d̄ = 0.3622 / 36.111 → 0.010028
`STO``2`
Stores the standard error → 0.010028
`0.0021``+/-``÷``RCL``2``=`
Calculates t = −0.0021 / 0.010028 → −0.2094
⚠️ Dividing by n (1,304) instead of √n (36.111). That gives s_d̄ = 0.3622 / 1304 = 0.0002778 and t = −0.0021 / 0.0002778 = −7.559. That number falls deep in the rejection region and produces the wrong conclusion (reject H₀). The square root step is easy to skip under pressure.
Worked Example 5
Testing Whether Two Strategy Variances Differ (F-Test for Equality of Variances)
Dmitri Solis is a risk analyst at Crestline Capital in Madrid. He compares the daily return variance of two independent equity strategies over the same evaluation period. Strategy A has 21 daily return observations and a sample standard deviation of 1.20%. Strategy B has 31 daily return observations and a sample standard deviation of 0.85%. Dmitri tests at the 5% significance level whether Strategy A's variance is greater than Strategy B's variance.
🧠Thinking Flow — Two variances, one-tailed F-test
The question asks
Is Strategy A's variance significantly greater than Strategy B's? The sample evidence shows s_A > s_B, but is the difference statistically meaningful?
Key concept needed
F-Test for Equality of Two Variances. The wrong move is to use a chi-square test. Chi-square tests involve a single variance compared to a fixed hypothesised value. The F-test compares two sample variances against each other.
Step 1, State the hypotheses
Dmitri suspects Strategy A is riskier. "Greater than" belongs in the alternative.
H₀: σ²_A / σ²_B ≤ 1 versus Hₐ: σ²_A / σ²_B > 1
One-tailed, right-tail test.
Step 2, Identify the test statistic
Two independent samples, testing variance equality. Use the F-statistic with the larger variance in the numerator:
F = s²_A / s²_B
Degrees of freedom = (n_A − 1, n_B − 1) = (20, 30).
Step 3, Specify significance level
α = 5%, one-tailed.
Step 4, State the decision rule
From the F-table with (20, 30) degrees of freedom at α = 5%:
Critical value ≈ 1.93.
Reject H₀ if the calculated F-statistic is greater than 1.93.
The calculated F = 1.994. The critical value is 1.93. Because 1.994 > 1.93, the statistic falls in the rejection region.
Reject H₀. There is sufficient evidence at the 5% level to conclude that Strategy A's variance is greater than Strategy B's variance.
Sanity check
Strategy A has the larger variance, so the F-statistic should be greater than 1. It is (1.994). Direction confirmed. The result just crosses the critical value, so the conclusion is sensitive to the exact table value used.
✓ Answer: F = 1.994. Reject H₀. Strategy A's variance is significantly greater than Strategy B's at the 5% level.
🧮 BA II Plus Keystrokes
`2ND``FV`
Clear all TVM registers → 0.00
`1.20``x²`
Calculates s²_A = 1.44 → 1.44
`STO``1`
Stores s²_A → 1.44
`0.85``x²`
Calculates s²_B = 0.7225 → 0.7225
`STO``2`
Stores s²_B → 0.7225
`RCL``1``÷``RCL``2``=`
Calculates F = 1.44 / 0.7225 → 1.994
⚠️ Using the standard deviations (1.20 and 0.85) directly instead of squaring them first. That gives F = 1.20 / 0.85 = 1.412, which falls below the critical value of 1.93 and produces the wrong conclusion (fail to reject H₀). Always square the standard deviations before forming the ratio.
With all five test types covered, here is the one mistake that trips up the most candidates on exam day.
⚠️
Watch out for this
The Type I / Type II Reversal Trap.
A candidate who swaps the error types answers that "failing to reject a false null" is a Type I error, and selects "rejecting a true null" as the definition of Type II error.
The correct definitions: Type I error is rejecting a true null hypothesis (a false positive), and Type II error is failing to reject a false null hypothesis (a false negative).
The correct approach: before confirming your answer, check the two conditions. Type I fires when the null is true but you reject it. Type II fires when the null is false but you keep it.
Candidates make this reversal because both errors involve a mismatch between reality and the test decision, and the labels "Type I" and "Type II" carry no meaning that signals which direction the error runs.
🧠
Memory Aid
FORMULA HOOK
Reject truth, that is Type I. Miss the lie, that is Type II. Power is the chance you catch the lie.
Practice Questions · LO2
6 Questions LO2
Score: — / 6
Q 1 of 6 — REMEMBER
Which of the following is the correct definition of a Type I error in hypothesis testing?
CORRECT: C
Correct: C, A Type I error occurs when the null hypothesis is true in reality, but the test produces a result extreme enough to trigger rejection. You concluded there is an effect when none actually exists. The probability of this happening is exactly α, the significance level you chose before the test.
Why not A? That is the definition of a Type II error. Failing to reject a false null means you missed a real effect. The null was wrong, but your test did not detect it. The probability of this error is β. Type I and Type II are mirror images: Type I fires when the null is true and you reject it. Type II fires when the null is false and you keep it.
Why not B? Correctly rejecting a false null hypothesis is not an error at all. It is the desired outcome. The probability of this correct rejection is the power of the test, equal to 1 − β. Confusing a correct decision with a Type I error means mixing up "what the null actually is" with "what the test concluded."
---
Q 2 of 6 — UNDERSTAND
A researcher wants to test whether a new portfolio construction strategy produces a mean annual return greater than the benchmark return of 8%. Which pair of hypotheses correctly sets up this one-tailed test?
CORRECT: A
Correct: A, The researcher's suspicion is that the strategy outperforms (mean return greater than 8%). That direction belongs in the alternative hypothesis. The null is the skeptic's default: the strategy does not beat the benchmark, expressed as μ ≤ 8%. This produces a one-tailed, right-tail test. The entire rejection region sits in the right tail; you reject only if the sample mean is substantially above 8%.
Why not B? This sets up a two-tailed test using "not equal to" in the alternative. That test would detect any difference from 8% in either direction. But the question specifically asks whether the strategy beats the benchmark. Using a two-tailed test with α = 5% splits the rejection region to 2.5% in each tail, making it harder to detect the upside effect the researcher cares about. The question's directional language signals a one-tailed test.
Why not C? This reverses the hypotheses. The researcher's claim ("the strategy returns more than 8%") is placed in the null, and the skeptical default is placed in the alternative. The null is always the assumed-true baseline. Placing the research claim in the null means you would need evidence to keep that claim, which reverses the logic of hypothesis testing entirely.
---
Q 3 of 6 — APPLY
Clara Mbeki is a performance analyst at Zenith Pension Fund in Nairobi. The fund's mandate requires a mean quarterly return of at least 2.5%. Clara has 36 quarters of return data. The sample mean is 2.31% and the sample standard deviation is 0.84%. She tests at the 5% significance level whether the fund is underperforming its mandate. What is the calculated t-statistic, and what is Clara's conclusion?
CORRECT: B
Correct: B, The test statistic formula is t = (X̄ − μ₀) / (s / √n). The standard error is 0.84 / √36 = 0.84 / 6 = 0.14. The t-statistic is (2.31 − 2.50) / 0.14 = −0.19 / 0.14 = −1.357. For a one-tailed left-tail test with 35 degrees of freedom at α = 5%, the critical value is approximately −1.690. Because −1.357 is not less than −1.690, it does not fall in the rejection region. Clara fails to reject H₀. The underperformance is not statistically significant with 36 quarters of data.
Why not A? The value −1.356 results from a rounding slip in the arithmetic. The correct calculation is exact: √36 = 6, 0.84/6 = 0.14000 exactly, giving t = −0.19/0.14 = −1.3571. The conclusion (fail to reject) is the same here, but on a borderline question, a rounding error can shift the result across the critical value. Always carry enough decimal places through the intermediate steps.
Why not C? A t-statistic of −8.143 results from dividing s by n (36) rather than √n (6). Using 0.84/36 = 0.02333 as the standard error gives t = −0.19/0.02333 = −8.143. That number falls far into the rejection region and produces the opposite conclusion (reject H₀). This is the most common calculator error on this type of question. Always divide the standard deviation by the square root of n, not by n itself.
---
Q 4 of 6 — APPLY+
Dmitri Solis at Crestline Capital is now examining a second pair of strategies. Strategy C has 21 daily return observations and a sample standard deviation of 1.20%. Strategy D has 31 daily return observations and a sample standard deviation of 0.85%. He tests at the 5% significance level whether Strategy C's variance is greater than Strategy D's variance, and obtains an F-statistic. What is the correct F-statistic?
CORRECT: C
Correct: C, The F-statistic places the larger variance in the numerator: F = s²_C / s²_D. Strategy C has the larger standard deviation (1.20% > 0.85%), so it goes in the numerator. s²_C = 1.20² = 1.44 and s²_D = 0.85² = 0.7225. Therefore F = 1.44 / 0.7225 = 1.994. Degrees of freedom are (n_C − 1, n_D − 1) = (20, 30). Always put the larger variance in the numerator to ensure the F-statistic is ≥ 1 and to allow use of standard right-tail F-tables.
Why not A? A value of 1.412 results from using the standard deviations directly instead of squaring them first: 1.20 / 0.85 = 1.412. This skips the squaring step. The F-statistic is defined as a ratio of variances (squared values). Using standard deviations instead produces a systematic understatement of the test statistic whenever the ratio exceeds 1.
Why not B? A value of 0.503 results from placing the smaller variance in the numerator: 0.7225 / 1.44 = 0.503. Inverting the ratio produces a number less than 1, which cannot be compared to right-tail F-tables in the standard way. Convention requires the larger variance in the numerator so that only one tail of the distribution needs to be consulted.
---
Q 5 of 6 — ANALYZE
Two analysts debate the appropriate test for a study comparing portfolio returns before and after a regulatory change. The same 50 portfolios are measured in both periods. Analyst Farida says: "We should use the pooled t-test for two independent samples. We have two sets of return observations from different time periods, so the samples are separate." Analyst Jorge says: "We should use the paired comparisons t-test. The portfolios are the same in both periods, so each before-return is linked to its after-return." Which analyst is correct, and why?
CORRECT: B
Correct: B, The key question for choosing between independent and paired tests is whether observations in the two samples are linked. When the same 50 portfolios appear before and after, each portfolio's before-return is naturally paired with its after-return. The paired comparisons test computes the difference for each portfolio, then tests whether the mean difference equals zero. This removes variation caused by portfolio-level characteristics (size, sector, risk) that affect returns in both periods the same way. Removing that shared variation increases the precision of the test and makes it more powerful.
Why not A? Different market environments are irrelevant to the choice of test. What matters is whether the same observational units (the 50 portfolios) appear in both samples. They do. Farida's pooled t-test treats the 50 before-returns and 50 after-returns as 100 independent observations. This discards the portfolio-level link, inflates the standard error, and reduces the test's power to detect a real regulatory effect.
Why not C? The two tests are not interchangeable. The paired t-test has degrees of freedom = n − 1 = 49 and uses the standard deviation of the 50 differences. The independent pooled t-test has degrees of freedom = n₁ + n₂ − 2 = 98 and uses a pooled variance that does not account for the within-portfolio link. Using the wrong test produces a different calculated statistic and can reverse the rejection decision on a borderline result.
---
Q 6 of 6 — TRAP
An analyst at Blackmoor Quantitative Research runs a hypothesis test and obtains a test statistic that falls outside the critical value boundary. The null hypothesis in the test is false in reality. A colleague reviews the report and states: "Because you rejected the null and the null is actually false, you have committed a Type I error, you acted on a false result." Is the colleague correct?
CORRECT: A
Correct: A, Rejecting a false null hypothesis is exactly what a well-designed test is supposed to do. This is a correct decision, not an error. The probability of making this correct rejection is the power of the test, equal to 1 − β. The colleague's statement is wrong.
Why not B? This is the core confusion the trap box names. A Type I error requires two simultaneous conditions: the null must be true, and you must reject it. Here the null is false, so only one condition holds (rejection). The label "Type I error" cannot apply when the null is actually false. Type I errors are false positives: you concluded there is an effect when there is none. Here there genuinely is an effect, so detecting it is not a false positive.
Why not C? A Type II error also requires two specific conditions: the null must be false, and you must fail to reject it. Here the null is false, the first condition matches. But the analyst rejected the null, so the second condition does not match. You cannot commit a Type II error by rejecting the null. Type II fires when the null is false but you keep it. Here the analyst correctly discarded it. Calling this a Type II error confuses the direction of the mistake.
---
Glossary
null hypothesis
The default assumption in a hypothesis test, the claim treated as true unless evidence contradicts it strongly enough. Like a legal presumption of innocence: the accused is innocent until proven guilty. Written as H₀.
alternative hypothesis
The statement the researcher suspects or hopes to prove. It always contains a directional signal: "greater than," "less than," or "not equal to." Written as Hₐ or H₁.
Type I error
Rejecting a null hypothesis that is actually true. A false positive. Like a smoke detector screaming when there is no fire. The probability of this error is α, the significance level you choose before the test.
Type II error
Failing to reject a null hypothesis that is actually false. A false negative. Like a smoke detector staying silent during a real fire. The probability of this error is β. You do not choose β directly; it emerges from sample size, effect size, and your choice of α.
power
The probability of correctly rejecting a false null hypothesis. Power = 1 − β. Higher power means fewer missed real effects. Power increases when sample size increases, when α increases, or when the true effect is large.
significance level
The maximum probability of making a Type I error that you are willing to tolerate, chosen before the test. Denoted α. If α = 0.05, you accept a 5% chance of a false positive. The significance level determines where the critical value boundary sits.
α
Alpha. The significance level. The probability of rejecting a true null hypothesis that you set before running the test. If α = 0.05, the critical value cuts off the most extreme 5% of the distribution.
β
Beta. The probability of failing to reject a false null hypothesis. You do not choose β directly. It depends on sample size, the true effect size, and your chosen α. A small sample produces a larger β.
critical value
The boundary on the test statistic distribution that separates the rejection region from the non-rejection region. Found in statistical tables using α and degrees of freedom. If your calculated test statistic crosses this boundary, you reject the null.
rejection region
The range of values of the test statistic that lead to rejecting the null hypothesis. It sits in the tail(s) of the distribution beyond the critical value(s). For a left-tail test, it is everything below the (negative) critical value. For a right-tail test, everything above the (positive) critical value.
degrees of freedom
The number of independent pieces of information in a sample that are free to vary. For a single-sample t-test, degrees of freedom = n − 1. For a two-sample pooled t-test, degrees of freedom = n₁ + n₂ − 2. Used to look up critical values in statistical tables.
t-statistic
The test statistic used to test hypotheses about population means when the population standard deviation is unknown. Calculated as (sample mean − hypothesised mean) / (sample standard deviation / √n). Compared to a critical value from the t-distribution.
chi-square statistic
The test statistic used to test hypotheses about a single population variance. Calculated as (n − 1) × sample variance / hypothesised variance. Chi-square values are always positive or zero, never negative.
pooled t-statistic
The test statistic used when comparing two independent sample means and assuming the two populations have equal (but unknown) variances. The pooled variance combines information from both samples into a single variance estimate.
paired t-statistic
The test statistic used when comparing two samples that are linked, such as the same subjects measured at two points in time. Computed on the differences between paired observations rather than on the two sets of observations separately.
F-statistic
The test statistic used to test whether two population variances are equal. Calculated as the ratio of two sample variances, with the larger variance always placed in the numerator. F-values are always positive.
LO 2 Done ✓
Ready for the next learning objective.
🔒 PRO Feature
How analysts use this at work
Real-world applications and interview questions from top firms.
Quantitative Methods · Hypothesis Testing · LO 3 of 3
⏱ 10 min
·
6 questions
·
LOW PRIORITY
```markdown
module: "Quantitative Methods · Hypothesis Testing · LO 3 of 3" lo: "8c" title: "Why do statisticians sometimes throw away information to get more reliable answers?" subtitle: "Know which test to use when the data violate the assumptions that make powerful tests work." priority: HIGH blooms: ANALYZE time_estimate: "8min-15min" calculator_required: false tags: ["parametric-tests", "nonparametric-tests", "statistical-assumptions", "hypothesis-testing"]
INSIGHT
Every test you have learned so far is a parametric test.
They all focus on population parameters, the mean, the variance, because those parameters tell you something concrete about the population. Parametric tests demand conditions in return: the data must come from a normally distributed population, the sample must be large enough or the population must genuinely be normal, and there must be no extreme outliers distorting the mean.
When those conditions are met, parametric tests are more powerful. They have a better chance of detecting a real effect if one exists.
When the conditions break, you face a choice. Run a weaker test that makes almost no demands on your data. Or run a stronger test knowing that violated assumptions make the answer unreliable.
The exam tests whether you know which choice serves the question you are actually answering.
The core distinction: what the test is actually testing
Think about a weather forecasting model. A sophisticated model that assumes clear atmospheric conditions makes precise predictions, when conditions are right. Feed it hurricane data and it breaks. A simpler model that makes no assumptions about atmospheric conditions gives rougher predictions, but it keeps working in almost any weather.
Parametric and nonparametric tests have the same relationship.
A parametric test always makes a claim about a specific, named, measurable property of the population, usually the mean or the variance. And it requires specific assumptions about what the population looks like, usually that it is normally distributed. When those assumptions hold, it is the more powerful tool.
A nonparametric test either does not concern a population parameter at all, or makes minimal assumptions about what the population looks like. It is the rougher model. More flexible. Less powerful.
The wrong move is to default to nonparametric because it feels safer. The right move is to check whether the parametric conditions are met, and only switch to nonparametric when they are not.
The parametric-nonparametric choice
1
Parametric test. Tests a specific population parameter, like the mean or variance, and requires distributional assumptions, usually normality. Use when you have a clear hypothesis about a population parameter and your data meets the distributional requirements.
2
Nonparametric test. Either does not concern a population parameter at all, or makes minimal assumptions about the population distribution. Use when distributional assumptions are violated, when outliers are present, when data are ranked, or when the hypothesis does not involve a parameter.
3
The trade-off: power versus flexibility. Parametric tests have greater power when their assumptions are met. Nonparametric tests are more flexible but typically weaker. Choose parametric if assumptions are satisfied. Otherwise, choose nonparametric.
4
Running both tests together. Applying both tests to the same dataset reveals how sensitive your conclusion is to the parametric assumptions. If both tests reach the same conclusion, your finding is robust. If they diverge, the parametric conclusion depends heavily on those assumptions being exactly right.
When to switch to a nonparametric test
Most candidates know that nonparametric tests exist for when "assumptions are violated." That answer is too vague to be useful on an exam question. There are four specific triggering conditions. Learn to identify them by name.
Four circumstances for nonparametric testing
1
Distributional assumptions violated. The data do not meet the normality or other distributional requirements of the parametric test. A small sample from a markedly skewed population is the classic case. Apply a nonparametric test to avoid unreliable results. The parametric test statistic will not follow its assumed distribution, making the p-value meaningless.
2
Outliers present. Extreme values distort the mean but not the median. When outliers are influencing the mean, a parametric test of the mean is answering the wrong question. Use a nonparametric test of the median instead.
3
Data are ranked or ordinal. The observations are given as ranks or categories, not as precise measurements. Investment manager performance rankings are a common example. Parametric tests require stronger measurement scales. Use nonparametric procedures when the data are ordinal data.
4
Hypothesis does not concern a parameter. You are testing randomness, testing goodness of fit to a distribution, or asking another question unrelated to a population mean or variance. Nonparametric tests address these questions. Parametric tests structurally cannot, not because of a limitation, but because they are designed for a different question entirely.
The parallel structure: which nonparametric test replaces which parametric test
When the parametric assumptions fail, a nonparametric alternative exists for each common testing scenario. You do not need to compute these tests. You need to recognise the scenario and match it to the correct alternative.
Testing scenario
Parametric test
Nonparametric alternative
Single population mean or median
t-test or z-test
Wilcoxon signed-rank test
Difference between two independent group means
Two-sample t-test
Mann-Whitney U test (Wilcoxon rank-sum test)
Paired mean differences
Paired t-test
Wilcoxon signed-rank test or sign test
Randomness of a sequence
No parametric equivalent
Runs test
How to identify the right test in practice
Worked Example 1
Identifying a parametric test by its defining characteristics
Priya Mehta is a junior analyst at Vanguard Crest Capital, a mid-sized asset manager in Singapore. Her supervisor asks her to review a colleague's analysis. The colleague ran a two-sample t-test to compare the mean quarterly returns of two equity funds, assuming both fund return series are drawn from normally distributed populations. Priya's supervisor asks: "What makes this a parametric test rather than a nonparametric one?" Priya must identify both defining characteristics.
🧠Thinking Flow — What makes a test parametric
The question asks
Which characteristics define a parametric test, and does this scenario exhibit them?
Key concept needed
A parametric test requires two things: it must concern a population parameter, and it must require specific distributional assumptions. Many candidates answer "it uses a formula" or "it uses a sample." Both are wrong. Every statistical test uses a formula and a sample.
Step 1, Identify the parameter being tested
Ask: what quantity is the test making a claim about?
The colleague is comparing mean quarterly returns. The mean is a population parameter, a specific, named, measurable property of the population. That is the first defining characteristic.
Step 2, Identify the distributional assumption
The scenario explicitly states that returns are assumed to be drawn from normally distributed populations. The t-test is only valid under that assumption.
This is the second defining characteristic: the test requires specific assumptions about the shape of the population distribution.
Both characteristics are present. This is unambiguously a parametric test.
Step 3, Sanity check
Ask the reverse: could a nonparametric test do the same job here?
Yes, the Mann-Whitney U test could compare the two funds without requiring normality. But because the normality assumption is stated as valid, the parametric t-test is preferred. It has more power: a greater ability to detect a real difference in means if one exists. Choosing nonparametric here would sacrifice power without gaining anything.
✓ Answer: The colleague's test is parametric because it concerns a population parameter (the mean return) and its validity depends on a specific distributional assumption (normality of the population). Both conditions together define a parametric test.
Worked Example 2
Choosing between parametric and nonparametric when assumptions are violated
Tomás Reyes is a risk analyst at Nordvik Investment Bank in Oslo. He is testing whether the median daily loss for a trading desk differs significantly from zero. His sample contains 18 observations, several of which are extreme outliers following a market disruption. The underlying population is clearly not normally distributed. Tomás must decide whether to run a parametric t-test or a nonparametric alternative.
🧠Thinking Flow — Selecting the appropriate test when outliers and non-normality are present
The question asks
Given outliers and a non-normal population, which test type is more appropriate, and why?
Key concept needed
Two of the four nonparametric triggering conditions apply here: violated distributional assumptions and outliers present.
Step 1, Check the distributional assumption
Many candidates see "18 observations" and immediately think "small sample, use the t-test." That reasoning is incomplete.
The t-test is valid for small samples only when the population is approximately normally distributed, or when the sample is large enough for the Central Limit Theorem to apply. Tomás has 18 observations from a clearly non-normal population. The distributional assumption is not met. A t-test applied here would produce a test statistic that does not follow a t-distribution, making the p-value and rejection decision unreliable.
Step 2, Assess the impact of outliers
The scenario also mentions extreme outliers. Outliers distort the mean but not the median.
Tomás's hypothesis concerns the median daily loss, not the mean. When outliers are present and the hypothesis concerns central tendency, a nonparametric test of the median is more informative than a parametric test of the mean.
Two of the four triggering conditions are active here. Either one alone would justify choosing a nonparametric test.
Step 3, Sanity check
Ask: would a parametric test be tempting here?
Yes. The t-test is familiar and the data concern a central location. But the combination of small sample, non-normal population, and outliers means the t-test's assumptions are broken. Running it would produce a test that appears more powerful but whose conclusions cannot be trusted.
The Wilcoxon signed-rank test is the appropriate nonparametric equivalent for a hypothesis about a single location.
✓ Answer: Tomás should use a nonparametric test. The distributional assumption required for the t-test is violated (small sample, non-normal population), and outliers distort the mean on which the t-test is based. The Wilcoxon signed-rank test is appropriate in both respects.
Worked Example 3
Recognising when a hypothesis is nonparametric by nature
Amara Osei is a quantitative researcher at Helion Analytics in Accra. She wants to test whether daily price changes in a West African equity index follow a random pattern, in other words, whether today's price change gives any information about tomorrow's price change. Her colleague suggests using a standard t-test to check this. Amara is not sure that is the right tool.
🧠Thinking Flow — Identifying when a hypothesis does not concern a parameter
The question asks
Is Amara's hypothesis about a population parameter? Which type of test is appropriate?
Key concept needed
Triggering condition (4): the hypothesis does not concern a population parameter at all.
Step 1, Define what the hypothesis is actually testing
Many candidates assume that any statistical test of a financial variable must concern the mean or variance. That assumption fails here.
Amara is not asking "what is the mean price change?" She is not asking "is the variance above a threshold?" She is asking "is the sequence of price changes random?"
Randomness is a property of the sequence of observations, not of any population parameter. No mean or variance test can answer a question about sequential randomness.
Step 2, Match the hypothesis to the correct test type
When a hypothesis does not concern a parameter, the appropriate tool is a nonparametric procedure.
The specific test for randomness in a sequence is a runs test, a nonparametric procedure that counts consecutive runs of positive and negative values to assess whether the sequence is random.
The t-test Amara's colleague suggests is designed to test a hypothesis about a mean. It cannot address sequential randomness, regardless of sample size or distributional properties.
Step 3, Sanity check
Ask: could any modification of the t-test answer Amara's question?
No. The t-test statistic is built from the sample mean and sample variance. It has no mechanism for capturing the order of observations, which is exactly what randomness testing requires.
The nonparametric test is not just preferable here. It is the only structurally valid option. Amara's colleague's suggestion is the wrong tool for the job.
✓ Answer: Amara's hypothesis concerns randomness, a property of the sequence of observations, not any population parameter. A parametric t-test is structurally incapable of answering this question. The appropriate tool is a nonparametric runs test. This is triggering condition (4): the hypothesis does not concern a parameter.
Worked Example 4
Running both tests together and interpreting the comparison
Fatima Al-Hassan is a portfolio strategist at Meridian Funds in Dubai. She is testing whether the mean monthly alpha generated by an active equity strategy differs from zero. Her data consist of 24 monthly observations. The data appear roughly normal. Fatima's manager asks her to run both a parametric t-test and a nonparametric equivalent side by side, and to explain what the comparison tells them.
🧠Thinking Flow — Interpreting the result of running parametric and nonparametric tests simultaneously
The question asks
Why would an analyst run both a parametric and a nonparametric test on the same data? What does the comparison reveal?
Key concept needed
Running tests side by side is a robustness check. It reveals how sensitive the parametric conclusion is to the normality assumption when that assumption is approximately but not perfectly satisfied.
Step 1, Identify why both tests are being run
Many candidates assume running two tests is redundant. If assumptions are met, use the parametric test. If not, use the nonparametric one. That logic covers the easy cases.
When assumptions are approximately satisfied, not perfectly, not obviously violated either, running both tests reveals how much the parametric conclusion depends on those assumptions being exactly right. Fatima's data are "roughly normal." That phrase is the signal that a robustness check is warranted.
Step 2, Interpret the two possible outcomes
Outcome A: Both tests reject the null hypothesis, or both fail to reject. The conclusion is robust. The parametric test's assumptions are not driving the result. Fatima can report her finding with confidence.
Outcome B: The parametric test rejects the null but the nonparametric test does not, or vice versa. The conclusions diverge. This signals that the result depends heavily on the normality assumption, and since that assumption may only be approximately true, the finding is sensitive. Fatima should report the divergence and exercise caution.
Step 3, Sanity check
Ask: does running both tests give Fatima two chances to reject the null, making rejection more likely overall?
No. She is not combining p-values or picking whichever test rejects. She is using the nonparametric test as a diagnostic tool. If both tests agree, she trusts the parametric result more. If they disagree, she knows the parametric assumptions matter for this specific dataset.
✓ Answer: Running both tests simultaneously is a robustness check. Agreement between the two tests means the finding does not depend on whether normality holds exactly. Divergence means the conclusion is sensitive to the parametric assumptions, a signal to report cautiously. The parametric test remains preferred when assumptions hold, because it has more power.
⚠️
Watch out for this
The "Fewer Assumptions Means Better Test" trap.
A candidate who concludes that nonparametric tests are always preferable will argue that analysts should default to nonparametric procedures whenever possible. That is the wrong direction entirely.
When parametric assumptions are satisfied, the parametric test is preferred because it has greater statistical power, a higher probability of correctly rejecting a false null hypothesis. Choosing nonparametric by default throws away that power without gaining anything in return.
Candidates make this error because they conflate "fewer assumptions" with "more reliable," when the actual trade-off is assumptions versus power: relaxing assumptions costs you the ability to detect real effects in the data.
Before choosing nonparametric, confirm that at least one triggering condition is present: violated distributional assumptions, outliers, ranked or ordinal data, or a hypothesis that does not concern a parameter.
🧠
Memory Aid
CONTRAST ANCHOR
Parametric tests demand assumptions and reward you with power. Nonparametric tests ask little and give less.
Practice Questions · LO3
6 Questions LO3
Score: — / 6
Q 1 of 6 — REMEMBER
Which of the following most accurately describes a defining characteristic of a parametric test?
CORRECT: B
CORRECT: B, A parametric test has two defining characteristics that must both be understood: it makes a claim about a specific population parameter (such as the mean or variance), and its validity depends on specific distributional assumptions, most commonly that the population is normally distributed. Both features are present in the definition.
Why not A? This describes a nonparametric test, not a parametric one. Parametric tests require specific distributional assumptions, usually normality. A test that can be applied to data of any distribution without restriction is precisely what nonparametric tests are designed for. Reversing the two definitions is the most common error on this type of question.
Why not C? This option correctly identifies characteristics of a nonparametric test, minimal assumptions and suitability for ranked data, but attributes them to a parametric test. Parametric tests require more assumptions and are not designed for ranked or ordinal data. If you selected C, review the defining characteristics of each test type before moving on.
---
Q 2 of 6 — UNDERSTAND
A colleague argues that analysts should always prefer nonparametric tests because they require fewer assumptions and are therefore more reliable. Which of the following best explains why this argument is flawed?
CORRECT: C
CORRECT: C, The colleague's argument confuses "fewer assumptions" with "more reliable." The trade-off is not assumptions versus reliability, it is assumptions versus power. Statistical power is the probability of correctly rejecting a false null hypothesis. When a parametric test's assumptions are met, it is more powerful than its nonparametric equivalent. Choosing nonparametric by default throws away that power without gaining anything in return. Nonparametric tests are preferred only when the parametric assumptions are violated or the hypothesis does not concern a parameter.
Why not A? This statement is factually incorrect. Nonparametric tests can be applied to small samples. In fact, small samples from non-normal populations are one of the four situations where nonparametric tests are specifically recommended. The critique of the colleague's argument is not about sample size.
Why not B? This overcorrects in the opposite direction. Parametric tests are not always superior. When distributional assumptions are violated, when outliers distort the mean, when data are ranked, or when the hypothesis does not concern a parameter, nonparametric tests are the more appropriate choice. The correct answer is conditional: parametric tests are preferred when their assumptions hold.
---
Q 3 of 6 — APPLY
Kofi Asante is a research analyst at Baobab Capital in Nairobi. He is testing whether the median weekly return of a small-cap fund differs from zero. His sample contains 15 weekly observations drawn from a population he knows to be heavily right-skewed. Which test type should Kofi use, and why?
CORRECT: A
CORRECT: A, Two of the four nonparametric triggering conditions are present here. First, the population is heavily right-skewed and the sample is small, so the distributional assumption required for a parametric t-test is violated. Second, Kofi's hypothesis concerns the median, and skewness that distorts the mean makes a test of the median more appropriate than a test of the mean. A nonparametric test such as the Wilcoxon signed-rank test is the correct choice.
Why not B? The Central Limit Theorem states that the distribution of sample means approaches normality as sample size grows large. However, 15 observations is not large enough for the CLT to reliably override heavy right-skewness. Analysts typically require samples of at least 25 to 30 before invoking the CLT as justification for using a t-test with a non-normal population.
Why not C? Whether the hypothesis concerns the median does not make both tests equally valid. The parametric t-test is built around the mean and its associated distributional properties, it cannot simply be redirected at the median. The nonparametric test is appropriate here specifically because the parametric assumptions are violated, not merely because the hypothesis uses the word "median."
---
Q 4 of 6 — APPLY+
Yuki Tanaka is a quantitative analyst at Shirogane Asset Management in Tokyo. She is running a hypothesis test on 30 monthly excess returns from a momentum strategy. The data appear roughly normal, but Yuki's supervisor asks her to run both a parametric t-test and a nonparametric equivalent on the same data. The parametric t-test rejects the null hypothesis at the 5% significance level. The nonparametric test fails to reject the null at the same level. What is the most appropriate interpretation of this outcome?
CORRECT: C
CORRECT: C, When both tests are run side by side, the goal is to assess robustness: how sensitive is the parametric conclusion to the assumptions underlying it? When both tests agree, the conclusion is robust. When they diverge, as they do here, it signals that the parametric test's rejection of the null depends meaningfully on the normality assumption being correct. Since Yuki's data are only "roughly normal," she cannot be confident the assumption fully holds. The divergence is a warning: report the finding cautiously and note that conclusions differ depending on the test chosen.
Why not A? The parametric test does have more power when assumptions are met, but that power advantage is only valid when the assumptions actually hold. When the two tests disagree, the divergence is precisely the signal that the parametric assumptions may be driving the result. Automatically accepting the parametric conclusion because it is "more powerful" ignores the diagnostic value of running both tests.
Why not B? Defaulting to the nonparametric result because it is "more conservative" confuses statistical caution with analytical rigour. The point of running both tests is not to pick the one that fails to reject, it is to assess robustness. Neither test should be automatically preferred. The divergence itself is the finding: the conclusion is unstable, and that instability must be reported.
---
Q 5 of 6 — ANALYZE
An analyst is reviewing three testing scenarios and must decide which test type is most appropriate for each. Which of the following correctly classifies the test type most appropriate for each scenario?
Scenario
Description
P
Testing whether the variance of a large, normally distributed portfolio's returns exceeds a regulatory threshold
Q
Testing whether a sequence of daily price changes in an emerging market index appears random
R
Testing whether the mean annual return of a fund differs from a benchmark, using 40 observations from a roughly normal population
CORRECT: B
CORRECT: B, Scenario P concerns the variance of a normally distributed population, a specific population parameter tested under valid distributional assumptions. Parametric is correct. Scenario Q asks whether a sequence of price changes is random. Randomness is a property of the sequence of observations, not of any population parameter. No parametric test can answer this question; a nonparametric runs test is required. Scenario R tests a mean using 40 observations from a roughly normal population, the parametric assumptions are adequately satisfied, and the parametric t-test is preferred for its greater power.
Why not A? This option misclassifies Scenario P as nonparametric. Scenario P explicitly concerns the variance, a population parameter, and the population is normally distributed. Both defining conditions for a parametric test are met. There is no triggering condition for nonparametric testing here.
Why not C? This option misclassifies Scenario R as nonparametric. With 40 observations and a roughly normal population, the assumptions of the parametric t-test are adequately satisfied. Choosing nonparametric in this case sacrifices statistical power without gaining anything, there is no violated assumption, no outlier problem, no ranked data, and the hypothesis does concern a population parameter (the mean).
---
Q 6 of 6 — TRAP
Sebastián Varga is an analyst at Fortis Research in Budapest. He tells a colleague: "Our dataset has some outliers and the population may not be perfectly normal, so I have been using nonparametric tests for all our hypothesis testing going forward. Nonparametric tests require fewer assumptions, so they are always the safer choice." Which of the following best evaluates Sebastián's reasoning?
CORRECT: C
CORRECT: C, Sebastián has fallen into the "fewer assumptions means better test" trap. The relevant trade-off is not assumptions versus reliability, it is assumptions versus power. Statistical power is the probability of correctly detecting a real effect in the data. When parametric assumptions are satisfied, even approximately, the parametric test's greater power means it is better at finding real differences. By adopting nonparametric tests as a permanent default, Sebastián sacrifices that power on every test where assumptions actually hold, without any analytical gain. The nonparametric triggering conditions should be evaluated test by test, not resolved once with a blanket policy.
Why not A? "Broader applicability" does not mean "universally preferable." A Swiss Army knife is more versatile than a chef's knife, but a chef uses the chef's knife when precision matters. Nonparametric tests are more flexible, but flexibility comes at the cost of power. When the parametric tool fits the job, use it.
Why not B? Outliers and potential non-normality are legitimate reasons to choose nonparametric tests in specific cases, but they do not justify eliminating parametric tests from the toolkit permanently. If Sebastián encounters a future dataset that is clearly normally distributed with no outliers, abandoning the parametric test costs him power for no reason. The correct response to outliers is case-by-case assessment, not a standing policy.
---
Glossary
parametric test
A statistical test that makes a specific claim about a population parameter (such as the mean or variance) and requires assumptions about the shape of the population distribution, usually that it is normally distributed. Like a recipe that works perfectly when you have exactly the right ingredients, powerful when conditions are met, unreliable when they are not.
nonparametric test
A statistical test that either does not concern a population parameter at all, or makes minimal assumptions about the population distribution. Like a recipe that works with almost any ingredients, more flexible, but often produces a less precise result than the specialised version.
power
In hypothesis testing, the probability that a test correctly rejects a false null hypothesis. A test with high power is better at detecting real effects in the data. Think of it as the sensitivity of a smoke detector: a more powerful detector catches real fires; a less powerful one misses them more often.
distributional assumption
A requirement that the data or population follow a specific statistical distribution (usually the normal distribution) for a test's results to be valid. Like the terms and conditions of a warranty: the tool works as promised only if certain conditions are met.
outlier
An observation that lies unusually far from the bulk of the data. Outliers distort the mean but have less effect on the median. Like one very tall person in a room full of average-height people: the average height shoots up, but the median height barely moves.
ordinal data
Data that can be ranked or ordered but where the differences between ranks are not necessarily equal or meaningful. Investment manager rankings are a common example: first place is better than second, but the gap between first and second is not the same as the gap between second and third.
runs test
A nonparametric procedure that tests whether a sequence of observations is random by counting the number of consecutive "runs", uninterrupted sequences of the same outcome, such as consecutive positive returns. It answers the question "does the order of these observations carry information?" without making any claim about means or variances.
Wilcoxon signed-rank test
A nonparametric test used as an alternative to the paired t-test or single-sample t-test when distributional assumptions are violated. It ranks the absolute values of differences and tests whether the population median differs from a specified value, without assuming normality.
Mann-Whitney U test
A nonparametric test used to compare two independent groups when the assumptions of the two-sample t-test are not met. Also called the Wilcoxon rank-sum test. It tests whether one group's values tend to be systematically higher or lower than the other group's values, using ranks rather than raw values.
Central Limit Theorem
A statistical principle stating that the distribution of sample means approaches a normal distribution as sample size increases, regardless of the population's shape. This is why large samples can justify using parametric tests even when the population is not perfectly normal. "Large" is typically interpreted as 25 to 30 or more observations in practice.
robustness check
Running an alternative test (often nonparametric) alongside a primary test (often parametric) to see whether the conclusion changes. If both tests agree, the finding is robust to the parametric assumptions. If they diverge, the conclusion is sensitive to those assumptions and should be reported cautiously.
statistical power
The probability of correctly rejecting a null hypothesis that is actually false. Higher power means fewer missed real effects. Parametric tests typically have more power than nonparametric tests when their assumptions are satisfied. ```
LO 3 Done ✓
You have completed all learning objectives for this module.
🔒 PRO Feature
How analysts use this at work
Real-world applications and interview questions from top firms.
Statistical decision-making in quantitative research, portfolio management, and risk analysis.
🔒 This is a PRO session. You are previewing it. Unlock full access to get all LO sections, interview questions from named firms, and one-line positioning statements.
Why this session exists
Why this session exists: The exam tests whether you can define a Type I error, choose the right test statistic, and identify when a nonparametric test is appropriate. Interviewers ask whether you understand what those choices actually mean in practice. This section bridges the gap between the definition and the judgment call.
Hypothesis testing appears throughout quantitative finance. Researchers use it to validate a trading signal before committing capital. Portfolio managers use it to assess whether a fund is genuinely outperforming its benchmark or simply lucky. Risk teams use it to confirm that the distributions underlying their models behave as assumed. The mechanics are learnable. The professional judgment about which test to run, what alpha to tolerate, and when to switch to a nonparametric approach is what interviews probe.
LO 1
"Statistical decision-making: the framework analysts use before they run a single test"
How analysts use this at work
Credit analysts at firms like BlackRock and Goldman Sachs use hypothesis testing every time they validate a signal before making a trading recommendation. Before any backtest result is presented to a portfolio committee, the analyst must frame the null hypothesis, set the significance level, and decide what Type I and Type II errors actually cost the firm in this specific context. The wrong framing produces a recommendation that looks statistically sound but misleads the committee about what the data can prove. The output is a trade decision supported by a rigorous decision rule, not just a reported p-value.
Quantitative researchers at Citadel and Two Sigma apply the same framework when testing whether an equity factor has real predictive power. They do not simply run a regression and report the results. They specify the test direction upfront, decide whether a one-tailed or two-tailed test is appropriate, and document what it means if they fail to reject the null. The key professional habit here is separating the decision to reject from the decision to accept. A failed rejection is not evidence that the null is true. It means the data were insufficient to overturn it, which is a different conclusion entirely and leads to a different trading action.
Interview questions
BlackRock Quantitative Research Analyst "A manager's TWR is 11% and their MWR is 6% for the same period. What does this tell you about the timing of client cash flows relative to the strategy's performance?"
Goldman Sachs Investment Management Division "Explain the relationship between Type I error and Type II error. In the context of a trading strategy backtest, which error would you be more concerned about and why?"
Morgan Stanley Wealth Management Analyst "A colleague concludes that because a hypothesis test failed to reject the null hypothesis, the null hypothesis must be true. Walk through why this is incorrect and what the correct interpretation of a failed rejection actually is."
One-line to use in your interview
Interviewers listen for industry-specific language. It signals you understand the concept, not just the definition. Use the plain English version to adapt it in your own words.
In practice, I frame the null as the skeptic's default and the alternative as the researcher's claim before collecting any data, because hypothesis testing is asymmetric: strong evidence can reject the null, but weak evidence only means the case was not proven, not that the null is true.
In plain English
I start by assuming nothing has changed, then I check whether the data gives me enough evidence to overturn that assumption. If the data is weak, I cannot say the assumption is correct. I can only say the data was not strong enough to reject it.
LO 2
"Test construction: matching the right tool to the right question before the data speaks"
How analysts use this at work
Performance analysts at Vanguard and T. Rowe Price apply the six-step hypothesis testing process every time they assess whether a fund is genuinely outperforming its benchmark. They begin by stating the null and alternative hypotheses clearly, choosing the correct test based on whether they have one sample or two, whether the samples are paired or independent, and whether they are testing a mean or a variance. They commit to the significance level and the decision rule before calculating anything. Only then do they compute the test statistic. The output is a defensible performance conclusion that appears in client reports and GIPS-compliant disclosures. Getting the test type wrong, running a paired test when samples are independent, or placing the wrong variance in the numerator of an F-test produces an incorrect conclusion that misleads clients about manager skill.
Risk analysts at JPMorgan and PIMCO use hypothesis tests to validate the assumptions inside their risk models. Before accepting that a return distribution is approximately normal, they test whether the mean and variance meet the model specifications. They choose chi-square tests for variance and t-tests for means, applying each at the appropriate degrees of freedom. The consequence of a wrong test choice here is not just a wrong number. A risk model built on incorrect distributional assumptions underestimates tail losses, leading to insufficient capital reserves. The test is not a formality. It is a gate that determines whether the model goes into production.
Interview questions
JPMorgan Quantitative Analytics "An analyst wants to test whether two equity strategies have significantly different return variances. Walk through the six steps of this hypothesis test, including which test statistic to use, what the null and alternative hypotheses should be, and what the F-statistic measures."
Vanguard Quantitative Analyst "A fund has 36 months of returns with a sample mean below its benchmark. The analyst runs a t-test and fails to reject the null. A senior manager asks whether the fund is meeting its objective. What is the precise statistical answer, and what practical interpretation should the manager take away?"
PIMCO Risk Analyst "When testing whether the mean monthly return of a bond strategy exceeds zero, an analyst chooses a two-tailed test when a one-tailed test would have been more appropriate for the research question. Describe the practical consequence of this choice in terms of statistical power."
One-line to use in your interview
Interviewers listen for industry-specific language. It signals you understand the concept, not just the definition. Use the plain English version to adapt it in your own words.
Before running any test, I write down the hypotheses, the test statistic, and the exact rejection rule, because deciding after you see the data is how you fool yourself into finding patterns that are not really there.
In plain English
I lock in my decision rules before I look at the numbers. That way I cannot change my standards halfway through to make a result look better than it is.
LO 3
"Parametric vs nonparametric: choosing the right test when your data misbehaves"
How analysts use this at work
Quantitative researchers at Two Sigma and Citadel face this decision constantly when working with financial data that rarely behaves like a textbook example. Returns data from stressed markets is often heavily skewed, clustered around zero, or punctuated by extreme outliers from flash crashes and liquidity events. A parametric t-test assumes normality and tests the mean, but outliers inflate the sample mean without affecting the median, making the test answer a question nobody actually asked. The researcher must recognize that the normality assumption is violated and switch to a nonparametric equivalent such as the Wilcoxon signed-rank test, which focuses on the median and makes no distributional demand. The professional habit here is assessing whether the data justifies the stronger test before defaulting to it, not switching to nonparametric because it feels safer.
Consultants at Mercer and State Street working with institutional clients use robustness checks to validate performance conclusions. When a pension fund's actuary tests whether a new liability-driven investment strategy is outperforming a benchmark, the consultant runs both the parametric t-test and its nonparametric equivalent on the same dataset. If both tests agree, the conclusion holds regardless of whether the return distribution is perfectly normal or merely approximately so. If the tests diverge, the consultant flags that the conclusion is sensitive to the distributional assumption and reports it cautiously. This side-by-side approach is not redundant. It is the diagnostic step that tells clients how much confidence to place in the result.
Interview questions
Two Sigma Quantitative Researcher "A researcher is testing whether a trading signal has genuine predictive power. They have 200 daily return observations that appear roughly normal, but one market dislocation produced extreme outliers. Under what circumstances would you recommend a nonparametric test instead of a parametric t-test, and what would the researcher gain or lose by making each choice?"
Vanguard Investment Analyst "An analyst has 30 monthly fund return observations. The data appear roughly normal. Their supervisor asks them to run both a parametric t-test and a nonparametric equivalent, and to explain why. The parametric test rejects the null, but the nonparametric test does not. What does this divergence tell you, and how should the analyst proceed?"
T. Rowe Price Performance Analyst "Explain the difference between parametric and nonparametric tests, and describe a real scenario in financial analysis where a nonparametric test would be the more appropriate choice even if the parametric assumptions were approximately satisfied."
One-line to use in your interview
Interviewers listen for industry-specific language. It signals you understand the concept, not just the definition. Use the plain English version to adapt it in your own words.
I default to parametric tests when the data supports them because they detect real effects more often, but I check whether the normality assumption actually holds first, and I switch to nonparametric when it does not, because the weaker test is the honest choice when the conditions for the stronger one are not met.
In plain English
The stronger test works better when your data is clean. When it is not, the weaker test is not a compromise. It is the correct tool for a messier situation, and pretending the data is cleaner than it is leads to overconfident conclusions.