| Distribution | PMF | Mean \(E(X)\) | Variance \(\text{Var}(X)\) |
|---|---|---|---|
| Bernoulli(p) | \(P(X=1) = p, P(X=0) = 1-p\) | \(p\) | \(p(1-p)\) |
| Binomial(n, p) | \(P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}\) | \(np\) | \(np(1-p)\) |
| Geometric(p) | \(P(X=k) = (1-p)^{k-1} p\) | \(\frac{1}{p}\) | \(\frac{1-p}{p^2}\) |
| Poisson(λ) | \(P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}\) | \(\lambda\) | \(\lambda\) |
| Distribution | Mean \(E(X)\) | Variance \(\text{Var}(X)\) | |
|---|---|---|---|
| Uniform(a, b) | \(f(x) = \frac{1}{b-a}\) | \(\frac{a+b}{2}\) | \(\frac{(b-a)^2}{12}\) |
| Normal(μ, σ²) | \(f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\) | \(\mu\) | \(\sigma^2\) |
| Exponential(λ) | \(f(x) = \lambda e^{-\lambda x}, x \geq 0\) | \(\frac{1}{\lambda}\) | \(\frac{1}{\lambda^2}\) |
Problem: A test for a disease is 99% accurate when the disease is present and has a 1% false positive rate. If the disease occurs in 2% of the population, what is the probability a person has the disease given they test positive?
Solution:
- Let \(D\) = Disease, \(D^c\) = No Disease, \(T^+\) = Positive Test.
\[
P(D | T^+) = \frac{P(T^+ | D)P(D)}{P(T^+)}
\]
Where \(P(T^+) = P(T^+ | D)P(D) + P(T^+ |
D^c)P(D^c)\):
\[
P(T^+) = (0.99)(0.02) + (0.01)(0.98) = 0.0198 + 0.0098 = 0.0296
\] \[
P(D | T^+) = \frac{(0.99)(0.02)}{0.0296} = \frac{0.0198}{0.0296} \approx
0.668
\] Answer: 66.8% chance they have the disease
given a positive test.
Problem: A company’s fleet of cars has NOx emissions
\(N(0.03, 0.02^2)\). If a sample of 25
cars is taken:
1. What is the standard error of the sample mean?
2. What is the probability that the sample mean exceeds 0.045?
Solution:
1. Standard Error:
\[
\text{SE} = \frac{\sigma}{\sqrt{n}} = \frac{0.02}{\sqrt{25}} = 0.004
\]
Answer:
- Standard error = 0.004
- Probability = 0.00009 (very unlikely).
Problem: A light bulb has a lifetime with \(\sigma = 40\). A sample of 30 bulbs has an
average lifetime of 780 hours.
1. Construct a 96% confidence interval.
2. What sample size is needed to ensure the margin of error is 9
hours?
Solution:
1. Use the formula \(\bar{x} \pm z^*
\frac{\sigma}{\sqrt{n}}\):
- \(z^*\) for 96%: \(z^* = 2.05\) (from Z-table).
- Standard error:
\[
\text{SE} = \frac{\sigma}{\sqrt{n}} = \frac{40}{\sqrt{30}} \approx
7.30
\]
- CI:
\[
780 \pm 2.05 \cdot 7.30 \Rightarrow (765, 795)
\]
Answer:
1. CI: (765, 795)
2. Required sample size: 84
Problem: A plant reports average commute time is 32.6 minutes. A sample of 60 workers has \(\bar{x} = 34.5\), \(\sigma = 6.1\). Test at \(\alpha = 0.05\) whether the true mean exceeds 32.6.
Solution:
1. Hypotheses:
- \(H_0: \mu = 32.6\), \(H_a: \mu > 32.6\)
Test statistic:
\[
Z = \frac{\bar{x} - \mu_0}{\frac{\sigma}{\sqrt{n}}} = \frac{34.5 -
32.6}{\frac{6.1}{\sqrt{60}}} = \frac{1.9}{0.787} \approx 2.41
\]
Find p-value: From Z-table, \(P(Z > 2.41) \approx 0.0080\).
Conclusion: \(p = 0.008 < 0.05\), so reject \(H_0\).
Answer: There is sufficient evidence to conclude the true mean exceeds 32.6 minutes.
Problem: A researcher measures the performance of students under 3 teaching methods (A, B, C). The mean performance scores and ANOVA results are summarized below:
| Source | SS | df | MS | F | p-value |
|---|---|---|---|---|---|
| Between | 45.2 | 2 | 22.6 | 8.3 | 0.002 |
| Within | 54.6 | 27 | 2.02 | ||
| Total | 99.8 | 29 |
Solution:
1. Hypotheses:
- \(H_0: \mu_A = \mu_B = \mu_C\) (all
group means are equal).
- \(H_a\): At least one mean
differs.
Conclusion: There is significant evidence that at least one teaching method’s mean performance differs from the others.
Problem: A sample of 30 bulbs has \(\bar{x} = 780\), \(\sigma = 40\). Construct a 95% confidence interval for the true mean.
Solution:
- \(n = 30\), \(\sigma = 40\), \(\bar{x} = 780\), \(z^* = 1.96\) (for 95%).
- Standard Error:
\[
\text{SE} = \frac{\sigma}{\sqrt{n}} = \frac{40}{\sqrt{30}} \approx
7.30
\]
- CI:
\[
780 \pm 1.96 \cdot 7.30
\]
\[
780 \pm 14.31 \Rightarrow (765.69, 794.31)
\]
Answer: \((765.7, 794.3)\).
Problem: A company claims their machines fill bottles with \(\mu = 500\) ml. A sample of 36 bottles has \(\bar{x} = 495\) and \(\sigma = 10\). Test \(H_0: \mu = 500\) vs \(H_a: \mu < 500\) at \(\alpha = 0.05\).
Solution:
1. Hypotheses:
- \(H_0: \mu = 500\), \(H_a: \mu < 500\).
Test statistic:
\[
Z = \frac{\bar{x} - \mu_0}{\frac{\sigma}{\sqrt{n}}} = \frac{495 -
500}{\frac{10}{\sqrt{36}}} = \frac{-5}{1.67} \approx -3.00
\]
Find p-value: From Z-table, \(P(Z < -3.00) \approx 0.0013\).
Compare to \(\alpha = 0.05\):
Conclusion: There is significant evidence that the mean bottle fill is less than 500 ml.
Problem: Given the following data:
| \(x\) | \(y\) |
|---|---|
| 1 | 2 |
| 2 | 4 |
| 3 | 5 |
| 4 | 4 |
| 5 | 5 |
Solution:
1. First, compute means:
\[
\bar{x} = \frac{1+2+3+4+5}{5} = 3, \quad \bar{y} =
\frac{2+4+5+4+5}{5} = 4
\]
Compute \(b\):
\[
b = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i -
\bar{x})^2} = \frac{(1-3)(2-4) + (2-3)(4-4) + \dots}{(1-3)^2 + (2-3)^2 +
\dots} = 0.6
\]
Answer: Regression line: \(\hat{y} = 2.2 + 0.6x\).
The goal of linear regression is to model the relationship between a predictor variable \(x\) (independent variable) and a response variable \(y\) (dependent variable).
The least squares regression line minimizes the sum
of the squared residuals (vertical distances between observed and
predicted values).
The line is given by:
\[
\hat{y} = a + bx
\]
Where:
- \(b\) is the
slope:
\[
b = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i -
\bar{x})^2}
\]
- \(a\) is the
y-intercept:
\[
a = \bar{y} - b\bar{x}
\]
Example Calculation:
Suppose you have the following data:
| \(x\) | \(y\) | |———|———| | 1 | 2 | | 2 | 3 | | 3 | 5
|
Compute means:
\[
\bar{x} = \frac{1 + 2 + 3}{3} = 2, \quad \bar{y} = \frac{2 + 3 + 5}{3} =
3.33
\]
Compute \(b\):
\[
b = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}
\]
Compute \(a\):
\[
a = \bar{y} - b\bar{x} = 3.33 - (1.33)(2) = 0.67
\]
Final Regression Line:
\[
\hat{y} = 0.67 + 1.33x
\]
Example: From the regression line \(\hat{y} = 0.67 + 1.33x\):
- Slope \(1.33\): For each 1-unit
increase in \(x\), \(y\) increases by 1.33.
- Intercept \(0.67\): When \(x = 0\), \(y\) is predicted to be 0.67.
To test the significance of the regression slope (\(\beta\)):
ANOVA compares the means of three or more groups to determine if at least one group mean is significantly different.
The F-statistic compares the variation between
groups to the variation within groups:
\[
F = \frac{\text{MS}_{\text{between}}}{\text{MS}_{\text{within}}}
\]
Where:
- Mean Square Between (MSB): Measures variation between
group means.
\[
\text{MSB} = \frac{\text{SS}_{\text{between}}}{k-1}
\]
- Mean Square Within (MSW): Measures variation within
groups.
\[
\text{MSW} = \frac{\text{SS}_{\text{within}}}{n-k}
\]
- Degrees of Freedom:
- Between groups: \(df_{\text{between}} = k -
1\)
- Within groups: \(df_{\text{within}} = n -
k\)
- Total: \(df_{\text{total}} = n -
1\)
If the ANOVA test rejects \(H_0\), post-hoc tests (e.g., Tukey’s HSD) identify which pairs of group means differ significantly.
Problem: Three fertilizers are tested for crop yield (in bushels per acre). The data is summarized as follows:
| Group | Sample Size (\(n_i\)) | Mean (\(\bar{x}_i\)) | Variance (\(s_i^2\)) |
|---|---|---|---|
| Fertilizer A | 10 | 20 | 4 |
| Fertilizer B | 10 | 25 | 5 |
| Fertilizer C | 10 | 30 | 6 |
Solution:
1. Hypotheses:
- \(H_0: \mu_A = \mu_B = \mu_C\)
- \(H_a:\) At least one mean
differs.