Statistics Exam Study Guide

Chapter 1: Descriptive Statistics

Key Concepts

  1. Measures of Location
    • Mean: Sum of all values divided by the number of values.
      • Sensitive to outliers. Use for symmetric data.
    • Median: Middle value of ordered data.
      • Better for skewed data.
  2. Measures of Spread
    • Standard Deviation (SD): Square root of variance.
      • Shows how far data points are from the mean.
    • Interquartile Range (IQR): Difference between Q3 and Q1.
      • Resistant to outliers.

Graphs

  1. Histograms: Used for frequency distributions of continuous data.
  2. Boxplots: Show the 5-number summary (Min, Q1, Median, Q3, Max) and outliers.

Chapter 2: Probability

Key Concepts

  1. Axioms of Probability:
    • \(P(A) \geq 0\) (Non-negativity)
    • \(P(S) = 1\) (Sample space normalization)
    • For disjoint events \(A_1, A_2, \dots\),
      \[ P\left( \bigcup_{i=1}^n A_i \right) = \sum_{i=1}^n P(A_i) \]
  2. Conditional Probability:
    • \(P(A|B) = \frac{P(A \cap B)}{P(B)}\) (when \(P(B) > 0\))
  3. Multiplication Rule:
    • For independent events \(A\) and \(B\),
      \[ P(A \cap B) = P(A) \cdot P(B) \]
  4. Bayes’ Theorem:
    \[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]
    Example: Cancer test problem with sensitivity, specificity, and base rates.

Chapter 3: Random Variables

Key Concepts

  1. Discrete Random Variables
    • Defined by a pmf (Probability Mass Function).
    • Cumulative Distribution Function (CDF):
      \[ F(x) = P(X \leq x) = \sum_{k \leq x} P(X = k) \]
  2. Continuous Random Variables
    • Defined by a pdf (Probability Density Function).
    • To find probabilities: Integrate the pdf over the range.
      \[ P(a \leq X \leq b) = \int_a^b f(x) \, dx \]
  3. Expectation (Mean)
    • Discrete: \(E(X) = \sum x \cdot P(X=x)\)
    • Continuous: \(E(X) = \int x \cdot f(x) \, dx\)
  4. Variance
    • \(\text{Var}(X) = E\left((X - \mu)^2\right) = E(X^2) - [E(X)]^2\)

Chapter 4: Mathematical Expectation

Key Concepts

  1. Linear Combinations:
    • For \(Y = a + bX\):
      • \(E(Y) = a + bE(X)\)
      • \(\text{Var}(Y) = b^2 \text{Var}(X)\)
  2. Functions of Random Variables
    • \(E(g(X))\) for discrete and continuous cases:
      • Discrete: \(E(g(X)) = \sum g(x) \cdot P(X=x)\)
      • Continuous: \(E(g(X)) = \int g(x) \cdot f(x) \, dx\)
  3. Joint Expectation: For two variables \(X\) and \(Y\),
    \[ E(g(X, Y)) = \sum \sum g(x, y) \cdot P(X=x, Y=y) \]

Chapter 5: Discrete Probability Distributions

Distribution PMF Mean \(E(X)\) Variance \(\text{Var}(X)\)
Bernoulli(p) \(P(X=1) = p, P(X=0) = 1-p\) \(p\) \(p(1-p)\)
Binomial(n, p) \(P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}\) \(np\) \(np(1-p)\)
Geometric(p) \(P(X=k) = (1-p)^{k-1} p\) \(\frac{1}{p}\) \(\frac{1-p}{p^2}\)
Poisson(λ) \(P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}\) \(\lambda\) \(\lambda\)

Chapter 6: Continuous Probability Distributions

Distribution PDF Mean \(E(X)\) Variance \(\text{Var}(X)\)
Uniform(a, b) \(f(x) = \frac{1}{b-a}\) \(\frac{a+b}{2}\) \(\frac{(b-a)^2}{12}\)
Normal(μ, σ²) \(f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\) \(\mu\) \(\sigma^2\)
Exponential(λ) \(f(x) = \lambda e^{-\lambda x}, x \geq 0\) \(\frac{1}{\lambda}\) \(\frac{1}{\lambda^2}\)

Practice Problem 1: Bayes’ Theorem

Problem: A test for a disease is 99% accurate when the disease is present and has a 1% false positive rate. If the disease occurs in 2% of the population, what is the probability a person has the disease given they test positive?

Solution:
- Let \(D\) = Disease, \(D^c\) = No Disease, \(T^+\) = Positive Test.
\[ P(D | T^+) = \frac{P(T^+ | D)P(D)}{P(T^+)} \]
Where \(P(T^+) = P(T^+ | D)P(D) + P(T^+ | D^c)P(D^c)\):
\[ P(T^+) = (0.99)(0.02) + (0.01)(0.98) = 0.0198 + 0.0098 = 0.0296 \] \[ P(D | T^+) = \frac{(0.99)(0.02)}{0.0296} = \frac{0.0198}{0.0296} \approx 0.668 \] Answer: 66.8% chance they have the disease given a positive test.

Chapter 8: Sampling Distributions

Key Concepts

  1. Central Limit Theorem (CLT)
    • For a sample mean \(\bar{X}\), if \(n\) is large:
      \[ \bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \]
    • Use this to approximate probabilities when sample sizes are large, regardless of the underlying distribution.
  2. Sampling Distribution of the Sample Mean
    • Mean: \(\mu_{\bar{X}} = \mu\)
    • Standard Error (SE): \(\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}\)
  3. t-Distribution
    • Used when \(\sigma\) is unknown, and \(s\) (sample SD) is used.
    • Characteristics: Heavier tails than the Normal distribution.

Practice Problem 2: Sampling Distributions

Problem: A company’s fleet of cars has NOx emissions \(N(0.03, 0.02^2)\). If a sample of 25 cars is taken:
1. What is the standard error of the sample mean?
2. What is the probability that the sample mean exceeds 0.045?

Solution:
1. Standard Error:
\[ \text{SE} = \frac{\sigma}{\sqrt{n}} = \frac{0.02}{\sqrt{25}} = 0.004 \]

  1. Find \(P(\bar{X} > 0.045)\):
  • Convert to a Z-score:
    \[ Z = \frac{\bar{X} - \mu}{\text{SE}} = \frac{0.045 - 0.03}{0.004} = 3.75 \]
  • From the Z-table, \(P(Z > 3.75) \approx 0.00009\).

Answer:
- Standard error = 0.004
- Probability = 0.00009 (very unlikely).


Chapter 9: Confidence Intervals and Estimation

Key Concepts

  1. Confidence Interval for Population Mean
    • If \(\sigma\) is known:
      \[ \bar{x} \pm z^* \frac{\sigma}{\sqrt{n}} \]
    • If \(\sigma\) is unknown:
      \[ \bar{x} \pm t^* \frac{s}{\sqrt{n}} \]
  2. Interpretation of Confidence Level
    • Example: “We are 95% confident that the true mean lies between the bounds of the confidence interval.”
  3. Sample Size Determination
    • To achieve a margin of error \(E\):
      \[ n = \left( \frac{z^* \cdot \sigma}{E} \right)^2 \]

Practice Problem 3: Confidence Intervals

Problem: A light bulb has a lifetime with \(\sigma = 40\). A sample of 30 bulbs has an average lifetime of 780 hours.
1. Construct a 96% confidence interval.
2. What sample size is needed to ensure the margin of error is 9 hours?

Solution:
1. Use the formula \(\bar{x} \pm z^* \frac{\sigma}{\sqrt{n}}\):
- \(z^*\) for 96%: \(z^* = 2.05\) (from Z-table).
- Standard error:
\[ \text{SE} = \frac{\sigma}{\sqrt{n}} = \frac{40}{\sqrt{30}} \approx 7.30 \]
- CI:
\[ 780 \pm 2.05 \cdot 7.30 \Rightarrow (765, 795) \]

  1. Solve for \(n\):
    \[ n = \left( \frac{z^* \cdot \sigma}{E} \right)^2 = \left( \frac{2.05 \cdot 40}{9} \right)^2 \approx 84 \]

Answer:
1. CI: (765, 795)
2. Required sample size: 84


Chapter 10: Hypothesis Testing

Key Concepts

  1. Steps for a Significance Test
    1. State the hypotheses:
      • Null (\(H_0\)): No effect or difference.
      • Alternative (\(H_a\)): Claim or difference.
    2. Choose significance level \(\alpha\) (e.g., 0.05).
    3. Compute the test statistic:
      \[ Z = \frac{\bar{x} - \mu_0}{\frac{\sigma}{\sqrt{n}}}, \quad t = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}} \]
    4. Find the p-value and compare with \(\alpha\).
    5. Make a conclusion.
  2. Errors in Hypothesis Testing
    • Type I Error: Reject \(H_0\) when \(H_0\) is true.
      • Probability = \(\alpha\).
    • Type II Error: Fail to reject \(H_0\) when \(H_0\) is false.
  3. Power of a Test
    • \(\text{Power} = 1 - P(\text{Type II Error})\).
    • Increasing power: Larger \(n\), larger \(\alpha\), or smaller variability.

Practice Problem 4: Hypothesis Testing

Problem: A plant reports average commute time is 32.6 minutes. A sample of 60 workers has \(\bar{x} = 34.5\), \(\sigma = 6.1\). Test at \(\alpha = 0.05\) whether the true mean exceeds 32.6.

Solution:
1. Hypotheses:
- \(H_0: \mu = 32.6\), \(H_a: \mu > 32.6\)

  1. Test statistic:
    \[ Z = \frac{\bar{x} - \mu_0}{\frac{\sigma}{\sqrt{n}}} = \frac{34.5 - 32.6}{\frac{6.1}{\sqrt{60}}} = \frac{1.9}{0.787} \approx 2.41 \]

  2. Find p-value: From Z-table, \(P(Z > 2.41) \approx 0.0080\).

  3. Conclusion: \(p = 0.008 < 0.05\), so reject \(H_0\).

Answer: There is sufficient evidence to conclude the true mean exceeds 32.6 minutes.


Chapter 11-12: Linear Regression

Key Concepts

  1. Least Squares Regression Line
    • Equation: \(\hat{y} = a + bx\), where:
      \[ b = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}, \quad a = \bar{y} - b\bar{x} \]
  2. Inference for Slope
    • Hypotheses: \(H_0: \beta = 0\), \(H_a: \beta \neq 0\).
    • Test statistic:
      \[ t = \frac{b}{\text{SE}_b}, \quad \text{SE}_b = \frac{s}{\sqrt{\sum (x_i - \bar{x})^2}} \]
  3. Residual Plots
    • Conditions for regression inference:
      1. Linearity
      2. Constant variance
      3. Normality of residuals

Chapter 13: ANOVA (Analysis of Variance)

Key Concepts

  1. Purpose of ANOVA
    • Compare means across multiple groups to determine if at least one group differs significantly.
  2. Hypotheses for One-Factor ANOVA
    • Null (\(H_0\)): \(\mu_1 = \mu_2 = \dots = \mu_k\) (all group means are equal).
    • Alternative (\(H_a\)): At least one \(\mu_i\) differs.
  3. F-Statistic
    • Measures variation between groups compared to variation within groups:
      \[ F = \frac{\text{MS}_{\text{between}}}{\text{MS}_{\text{within}}} \]
    • Mean Squares:
      • \(\text{MS}_{\text{between}} = \frac{\text{SS}_{\text{between}}}{k-1}\)
      • \(\text{MS}_{\text{within}} = \frac{\text{SS}_{\text{within}}}{n-k}\)
  4. Interpreting ANOVA Output
    • Large \(F\)-value and small \(p\)-value (\(p < \alpha\)) → Reject \(H_0\).
    • Post hoc tests (e.g., Tukey’s HSD) help identify which means differ.
  5. Residual Plots
    • Use to check ANOVA assumptions:
      1. Residuals are normally distributed.
      2. Variance of residuals is constant across groups.

Practice Problem 5: ANOVA

Problem: A researcher measures the performance of students under 3 teaching methods (A, B, C). The mean performance scores and ANOVA results are summarized below:

Source SS df MS F p-value
Between 45.2 2 22.6 8.3 0.002
Within 54.6 27 2.02
Total 99.8 29
  1. State the hypotheses.
  2. Interpret the \(p\)-value and conclusion.

Solution:
1. Hypotheses:
- \(H_0: \mu_A = \mu_B = \mu_C\) (all group means are equal).
- \(H_a\): At least one mean differs.

  1. \(F = 8.3\), \(p = 0.002\):
    • Since \(p < 0.05\), reject \(H_0\).

Conclusion: There is significant evidence that at least one teaching method’s mean performance differs from the others.


Final Cheat Sheet: Key Formulas

Probability

  1. \(P(A \cap B) = P(A) \cdot P(B|A)\)
  2. Bayes’ Theorem:
    \[ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} \]

Confidence Intervals

  1. Mean (Known \(\sigma\)):
    \[ \bar{x} \pm z^* \frac{\sigma}{\sqrt{n}} \]
  2. Mean (Unknown \(\sigma\)):
    \[ \bar{x} \pm t^* \frac{s}{\sqrt{n}} \]

Hypothesis Testing

  1. Z-Test:
    \[ Z = \frac{\bar{x} - \mu_0}{\frac{\sigma}{\sqrt{n}}} \]
  2. t-Test:
    \[ t = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}} \]

Linear Regression

  1. Slope:
    \[ b = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]
  2. Y-Intercept:
    \[ a = \bar{y} - b\bar{x} \]

ANOVA

  1. F-Statistic:
    \[ F = \frac{\text{MS}_{\text{between}}}{\text{MS}_{\text{within}}} \]

Test-Taking Tips

  1. Read carefully: Underline important numbers, hypotheses, and conditions.
  2. Sketch diagrams: Venn diagrams (probability), boxplots, or residual plots.
  3. Show all work: Clearly write formulas and steps to reduce errors.
  4. Time management: Don’t get stuck on one problem; skip and return if necessary.

Chapter 9: One- & Two-Sample Estimation Problems

Key Concepts

  1. Confidence Intervals (CIs)
    • A confidence interval gives a range of plausible values for the population parameter (e.g., mean \(\mu\)).
    • General Formula:
      \[ \text{CI} = \bar{x} \pm \text{Margin of Error} \]
    • For population means:
      • When \(\sigma\) (population standard deviation) is known (use Z):
        \[ \bar{x} \pm z^* \frac{\sigma}{\sqrt{n}} \]
      • When \(\sigma\) is unknown (use t):
        \[ \bar{x} \pm t^* \frac{s}{\sqrt{n}} \]
    • Conditions for inference:
      1. Random sample.
      2. Normality of sampling distribution:
        • \(n \geq 30\) (CLT applies), or
        • Population is Normal.
  2. Confidence Level
    • The confidence level (e.g., 95%) is the long-run percentage of confidence intervals that capture the true parameter.
    • Example: “We are 95% confident that the true mean lies between X and Y.”
  3. Determining Sample Size
    • To ensure a margin of error \(E\) for a confidence level \(z^*\):
      \[ n = \left( \frac{z^* \cdot \sigma}{E} \right)^2 \]
  4. Two-Sample Settings
    • Compare means of two independent populations.
    • Three cases:
      1. Known variances → Use Z procedures.
      2. Unknown, assumed equal variances → Pooled \(t\)-test.
      3. Unknown, unequal variances → Unpooled \(t\)-test (Welch’s \(t\)-test).

Practice Problem 6: Confidence Intervals

Problem: A sample of 30 bulbs has \(\bar{x} = 780\), \(\sigma = 40\). Construct a 95% confidence interval for the true mean.

Solution:
- \(n = 30\), \(\sigma = 40\), \(\bar{x} = 780\), \(z^* = 1.96\) (for 95%).
- Standard Error:
\[ \text{SE} = \frac{\sigma}{\sqrt{n}} = \frac{40}{\sqrt{30}} \approx 7.30 \]
- CI:
\[ 780 \pm 1.96 \cdot 7.30 \]
\[ 780 \pm 14.31 \Rightarrow (765.69, 794.31) \]

Answer: \((765.7, 794.3)\).


Chapter 10: One- & Two-Sample Hypothesis Tests

Key Concepts

  1. Hypothesis Testing Steps
    1. State the null (\(H_0\)) and alternative (\(H_a\)) hypotheses.
    2. Choose significance level \(\alpha\) (usually 0.05).
    3. Compute the test statistic:
      • Z-test (if \(\sigma\) known):
        \[ Z = \frac{\bar{x} - \mu_0}{\frac{\sigma}{\sqrt{n}}} \]
      • t-test (if \(\sigma\) unknown):
        \[ t = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}} \]
    4. Find the p-value and compare it to \(\alpha\).
    5. Make a conclusion:
      • If \(p < \alpha\), reject \(H_0\).
      • If \(p \geq \alpha\), fail to reject \(H_0\).
  2. Types of Errors
    • Type I Error: Rejecting \(H_0\) when it is true.
      • Probability = \(\alpha\).
    • Type II Error: Failing to reject \(H_0\) when it is false.
      • Probability = \(\beta\).
    • Power: \(1 - \beta\) (probability of correctly rejecting \(H_0\)).
  3. Matched Pairs vs. Independent Samples
    • Matched Pairs: Paired data (e.g., pre-test/post-test). Analyze differences.
    • Independent Samples: Two unrelated groups.

Practice Problem 7: Hypothesis Testing

Problem: A company claims their machines fill bottles with \(\mu = 500\) ml. A sample of 36 bottles has \(\bar{x} = 495\) and \(\sigma = 10\). Test \(H_0: \mu = 500\) vs \(H_a: \mu < 500\) at \(\alpha = 0.05\).

Solution:
1. Hypotheses:
- \(H_0: \mu = 500\), \(H_a: \mu < 500\).

  1. Test statistic:
    \[ Z = \frac{\bar{x} - \mu_0}{\frac{\sigma}{\sqrt{n}}} = \frac{495 - 500}{\frac{10}{\sqrt{36}}} = \frac{-5}{1.67} \approx -3.00 \]

  2. Find p-value: From Z-table, \(P(Z < -3.00) \approx 0.0013\).

  3. Compare to \(\alpha = 0.05\):

    • \(p = 0.0013 < 0.05\), so reject \(H_0\).

Conclusion: There is significant evidence that the mean bottle fill is less than 500 ml.


Chapter 11-12: Simple Linear Regression

Key Concepts

  1. Least Squares Regression Line
    • Equation: \(\hat{y} = a + bx\), where:
      \[ b = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}, \quad a = \bar{y} - b\bar{x} \]
  2. Inference for the Slope
    • Hypotheses: \(H_0: \beta = 0\), \(H_a: \beta \neq 0\).
    • Test statistic for slope:
      \[ t = \frac{b}{\text{SE}_b}, \quad \text{SE}_b = \frac{s}{\sqrt{\sum (x_i - \bar{x})^2}} \]
  3. Conditions for Regression Inference
    • Linearity: Residuals show no pattern.
    • Constant variance: Spread of residuals is consistent.
    • Normality: Residuals follow a Normal distribution.

Practice Problem 8: Regression

Problem: Given the following data:

\(x\) \(y\)
1 2
2 4
3 5
4 4
5 5
  1. Calculate the slope \(b\).
  2. Find the equation of the regression line.

Solution:
1. First, compute means:
\[ \bar{x} = \frac{1+2+3+4+5}{5} = 3, \quad \bar{y} = \frac{2+4+5+4+5}{5} = 4 \]
Compute \(b\):
\[ b = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} = \frac{(1-3)(2-4) + (2-3)(4-4) + \dots}{(1-3)^2 + (2-3)^2 + \dots} = 0.6 \]

  1. Solve for \(a\):
    \[ a = \bar{y} - b\bar{x} = 4 - 0.6(3) = 2.2 \]

Answer: Regression line: \(\hat{y} = 2.2 + 0.6x\).


Chapter 13: ANOVA

  1. F-Statistic:
    \[ F = \frac{\text{MS}_{\text{between}}}{\text{MS}_{\text{within}}} \]
    • Large \(F\)-values suggest differences between group means.
  2. Hypotheses:
    • \(H_0: \mu_1 = \mu_2 = \dots = \mu_k\).
    • \(H_a\): At least one mean is different.
  3. Conditions:
    • Independent samples.
    • Equal variance across groups.
    • Normality of data.

Chapter 11-12: Simple Linear Regression

1. Least Squares Regression Line

The goal of linear regression is to model the relationship between a predictor variable \(x\) (independent variable) and a response variable \(y\) (dependent variable).

The least squares regression line minimizes the sum of the squared residuals (vertical distances between observed and predicted values).
The line is given by:
\[ \hat{y} = a + bx \]
Where:
- \(b\) is the slope:
\[ b = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]
- \(a\) is the y-intercept:
\[ a = \bar{y} - b\bar{x} \]

Example Calculation:
Suppose you have the following data:
| \(x\) | \(y\) | |———|———| | 1 | 2 | | 2 | 3 | | 3 | 5 |

  1. Compute means:
    \[ \bar{x} = \frac{1 + 2 + 3}{3} = 2, \quad \bar{y} = \frac{2 + 3 + 5}{3} = 3.33 \]

  2. Compute \(b\):
    \[ b = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]

    • Numerator: \((1-2)(2-3.33) + (2-2)(3-3.33) + (3-2)(5-3.33) = 1.33 + 0 + 1.33 = 2.66\)
    • Denominator: \((1-2)^2 + (2-2)^2 + (3-2)^2 = 1 + 0 + 1 = 2\)
      \[ b = \frac{2.66}{2} = 1.33 \]
  3. Compute \(a\):
    \[ a = \bar{y} - b\bar{x} = 3.33 - (1.33)(2) = 0.67 \]

Final Regression Line:
\[ \hat{y} = 0.67 + 1.33x \]


2. Interpreting the Regression Coefficients

  1. Slope (\(b\)): The change in \(y\) for each 1-unit increase in \(x\).
    • Positive slope: \(y\) increases as \(x\) increases.
    • Negative slope: \(y\) decreases as \(x\) increases.
  2. Intercept (\(a\)): The predicted value of \(y\) when \(x = 0\).

Example: From the regression line \(\hat{y} = 0.67 + 1.33x\):
- Slope \(1.33\): For each 1-unit increase in \(x\), \(y\) increases by 1.33.
- Intercept \(0.67\): When \(x = 0\), \(y\) is predicted to be 0.67.


3. Residuals and Residual Plots

  • Residual: The difference between the observed \(y\) and the predicted \(\hat{y}\):
    \[ \text{Residual} = y - \hat{y} \]
  • Residual plots are used to check the conditions for inference:
    1. Linearity: Residuals should show no clear pattern.
    2. Constant variance: Spread of residuals should be similar across \(x\).
    3. Normality: Residuals should be approximately Normally distributed.

4. Inference for the Regression Slope

To test the significance of the regression slope (\(\beta\)):

  1. Hypotheses:
    • Null: \(H_0: \beta = 0\) (no relationship between \(x\) and \(y\)).
    • Alternative: \(H_a: \beta \neq 0\) (significant linear relationship).
  2. Test Statistic:
    \[ t = \frac{b}{\text{SE}_b}, \quad \text{SE}_b = \frac{s}{\sqrt{\sum (x_i - \bar{x})^2}} \]
    • \(b\): Sample slope.
    • \(s\): Residual standard deviation.
  3. Conditions for Inference:
    • Linear relationship (checked with residual plots).
    • Independent observations.
    • Constant variance of residuals.
    • Normality of residuals.

Chapter 13: ANOVA (Analysis of Variance)

1. One-Way ANOVA Overview

ANOVA compares the means of three or more groups to determine if at least one group mean is significantly different.

  • Hypotheses:
    • Null: \(H_0: \mu_1 = \mu_2 = \dots = \mu_k\) (all group means are equal).
    • Alternative: \(H_a\): At least one mean differs.

2. F-Statistic

The F-statistic compares the variation between groups to the variation within groups:
\[ F = \frac{\text{MS}_{\text{between}}}{\text{MS}_{\text{within}}} \]

Where:
- Mean Square Between (MSB): Measures variation between group means.
\[ \text{MSB} = \frac{\text{SS}_{\text{between}}}{k-1} \]
- Mean Square Within (MSW): Measures variation within groups.
\[ \text{MSW} = \frac{\text{SS}_{\text{within}}}{n-k} \]
- Degrees of Freedom:
- Between groups: \(df_{\text{between}} = k - 1\)
- Within groups: \(df_{\text{within}} = n - k\)
- Total: \(df_{\text{total}} = n - 1\)


3. Assumptions of ANOVA

  1. Independent Samples: Groups are independent of each other.
  2. Equal Variance: Variance within each group is approximately equal.
  3. Normality: The data in each group is approximately Normally distributed.

4. Post-Hoc Tests

If the ANOVA test rejects \(H_0\), post-hoc tests (e.g., Tukey’s HSD) identify which pairs of group means differ significantly.


Practice Problem 9: One-Way ANOVA

Problem: Three fertilizers are tested for crop yield (in bushels per acre). The data is summarized as follows:

Group Sample Size (\(n_i\)) Mean (\(\bar{x}_i\)) Variance (\(s_i^2\))
Fertilizer A 10 20 4
Fertilizer B 10 25 5
Fertilizer C 10 30 6
  1. State the hypotheses.
  2. Explain the steps to calculate the F-statistic.

Solution:
1. Hypotheses:
- \(H_0: \mu_A = \mu_B = \mu_C\)
- \(H_a:\) At least one mean differs.

  1. Steps to calculate \(F\):
    • Compute SSB (Sum of Squares Between) and SSW (Sum of Squares Within).
    • Calculate MSB and MSW:
      \[ \text{MSB} = \frac{\text{SSB}}{k-1}, \quad \text{MSW} = \frac{\text{SSW}}{n-k} \]
    • Compute \(F\):
      \[ F = \frac{\text{MSB}}{\text{MSW}} \]
    • Compare \(F\)-value to critical value (from F-table) or calculate \(p\)-value.