Chapter 2 Statistics and Probability

2.1 IMPORTANT Concepts to review

Probability Basics and Random Variables
- beginnings of prob: sample spaces, basic counting and combinatorial principles (not necessary to know all ins-and-outs but helpful to understand basics for simplifying problems
- random variables
  - expectation
  - variance
  - covariance
Probability Distributions
- discrete & continuous
- uniform, normal, poisson, binomial, geometric
Hypothesis Testing
- central limit theorem
- sampling distributions
- p-values
- confidence intervals
- type I and type II errors
Modeling
- maximum likelihood estimation
- bayesian statistics

2.1.1 Key terms for Variability Metrics

Variability (also dispersion) measures whether the data values are tightly clustered or spread out.

Deviations / Errors / Residuals: the difference between the observed values and the estimate of location (i.e. mean / median, etc.)

Variance / mean-squared-error: the sum of squared residuals from the mean divided by n-1 where n is the number of data values.

Standard deviation: the square root of variance

Mean absolute deviation: the mean of the absolute values of the deviations from the mean.

Percentile / Quantile: the value such that P percent of the values take on this value or less and (100-P) percent take on this value or more.

Interquartile range / IQR: the difference between the 75th percentile and the 25th percentile.

2.2 Designing Studies

Population: A collection of individuals or objects that we will be analyzing on their properties.

Sample: A representative subset of population chosen to be analyzed (a well-chosen sample contains most of the information about a particular population parameter.

2.2.1 Identifying variables type:

Numerical vs. Categorical
- numerical —> continuous or discrete? (based on whether or not they can take on an infinite number of values or only non-negative whole numbers, respectively)
- categorical —> ordinal? (whether or not levels have a natural ordering)
Associated (Explanatory) vs. Independent (Response) —> show relationships w/ other vars?
Confounding variables

2.2.2 Classify study type as ovbservational or experimental

Observational studies: researcher collects data by observing but not directly interfering with how data arise —> correlation only
- Retrospective study: when an observational study uses data from the past
- Prospective study: …. data are collected throughout the study
Experiments: when researchers randomly assign subjects to treatments (can be causal)

2.2.3 Sampling Techniques

Probability Sampling
- Random sampling: choosing sample randomly without any given logic —> each member has an equal chance of being selected in the sample.
- Stratified sampling: First divide population into homogenous strata (subjects within each stratum are similar but different across strata), then randomly sample from within each strata.
  - e.g. to make sure both genders are equally represented in a study, we might divide the population into males and females and then randomly sample from within each gender group
- Cluster sampling: Divide population into heterogenous clusters (subjects within clusters are different but clusters are similar to each other —> randomly sample a few clusters
  - Multistage sampling add one other step to cluster sampling: randomly sample observations from WITHIN each cluster
- Systematic sampling
Non-Probability Sampling
- Snowball
- Quota
- Judgement
- Convenience sample bias occurs when individuals who are easily accessible, are more likely to be included in the sample.
- Non-response bias happens when only a non-random proportion of the randomly sampled people respond to a survey —> sample no longer representative (initial sample is random but the final valid sample is not)
  - e.g. when we take a random sample of individuals from Stanford, but certain groups of population, such as from a lower socioeconomic status, are much less likely to respond to the survey —> our sample is not representative enough of the entire Stanford community
- Volunteer Response bias occurs when sample consists of only people who volunteer to respond bcuz they have strong opinions on the issue (no initial random sample)

2.2.4 Principles of Experimental Design—Control, Randomize, Replicate, and Block—and their purposes

Control — compare treatment of interest to a control group
Randomize — randomly assign subjects to treatments
Replicate — collect a sufficiently large sample, or replicate the entire study
Block — block for variables known or suspected to affect outcome
- if there are variables known or suspected to affect the response variable, first group the subject into blocks based on these variables —> then randomized cases within each block to treatment groups
- e.g. design an experiment to investigate if energy gels make you run faster: the treatment group gets the energy gel, the control group does not. It is suspected that energy gels might effect pro and amateur athletes differently therefore we block for pro status. <-- we divide our sample into pro and amateur athletes, then randomly assign pro and amateur athletes to treatment and control groups so that both pro and amateur athletes are equally represented in the resulting treatment and control group.

Blocking variable vs. Explanatory variable

Explanatory variables (factors) are conditions we impose on our experimental units.
Blocking variables are characteristics that the experimental units come with (which may affect how experimental units respond to response variable differently).

Other terminologies

Placebo: a fake treatment, often used as the control group in medical studies
Placebo effect: when experimental units show improvement just becuz they believe they’re receiving a special treatment
Blinding: when experimental units DO NOT know they are in the control or treatment groups.
Double-blind study: when BOTH the experimenters and researchers DO NOT know who is in the control or treatment group.

Experimental Design Workflow:

Control any possible confounders / confounding variables (non-explanatory factors that may influence different responses)
Randomize into treatment and control groups
Replicate by using a sufficiently large sample or repeating the experiment
Block any variables that might influence the response

* Stratified sampling allows for controlling for possible confounders in the sampling stage, while blocking allows for controlling for such variables during random assignment.

2.2.5 Random Sampling vs. Random Assignment

If random sampling has been employed in data collection, the results should be generalizable to the target population. (but still NOT causal)
- WHY? —> if subjects are randomly selected from the population, then each subject in the population is equally likely to be selected so that the resulting sample is likely representative of the population.
If random assignment has been employed in study design, the results suggest causality.
- WHY? —> in our sample, subjects usually exhibit slightly different characteristics from one another. Through random assignment, we ensure that these different characteristics are represented equally in the treatment and control groups —> allows us to attribute any observed difference between treatment and control groups to treatment being observed on the subjects, since otherwise these groups are essentially THE SAME.
A study that relies on volunteers employ random assignment (experiment), but NOT random sampling can be used to make causal conclusions but ONLY apply to the sample (so results cannot be generalized).
A study that uses NO random assignment, but DOES use random sampling, is a typical observation study. Results can ONLY be used to make correlation statements, but they CAN be generalized to the population at large.
A study that DOES NOT use random assignment or random sampling, can ONLY be used to make correlational statements, and these conclusions are NOT generalizable. This is an unideal observational study.

2.2.6 Hypothesis Tests and Resampling

2.2.7 Statistical Significance and p-values

p-value: Given a chance model that embodies the null hypothesis, the p-value is the probability of obtaining results as unsusual or extreme as the observed results.

Alpha: The probability threshold of "unusualness" that chance results must surpass for actual outcomes to be deemed statistically significant.

Type 1 error (false-positive): Mistakenly concluding an effect is real (when it is due to chance). (i.e. reject H0 when it is actually true)

Type 2 error (false-negative): Mistakenly concluding an effect is due to chance (when it is real). (i.e. fails to reject H0 when it is actually false)