Chapter 2 Statistics and Probability
2.1 IMPORTANT Concepts to review
- Probability Basics and Random Variables
beginnings of prob: sample spaces, basic counting and combinatorial principles (not necessary to know all ins-and-outs but helpful to understand basics for simplifying problems
random variables
expectation
variance
covariance
- Probability Distributions
discrete & continuous
uniform, normal, poisson, binomial, geometric
- Hypothesis Testing
central limit theorem
sampling distributions
p-values
confidence intervals
type I and type II errors
- Modeling
maximum likelihood estimation
bayesian statistics
2.1.1 Key terms for Variability Metrics
Variability
(also dispersion
) measures whether the data values are tightly clustered or spread out.
Deviations / Errors / Residuals:
the difference between the observed values and the estimate of location (i.e. mean / median, etc.)
Variance / mean-squared-error:
the sum of squared residuals from the mean divided by n-1 where n is the number of data values.
Standard deviation:
the square root of variance
Mean absolute deviation:
the mean of the absolute values of the deviations from the mean.
Percentile / Quantile:
the value such that P percent of the values take on this value or less and (100-P) percent take on this value or more.
Interquartile range / IQR:
the difference between the 75th percentile and the 25th percentile.
2.2 Designing Studies
Population:
A collection of individuals or objects that we will be analyzing on their properties.
Sample:
A representative subset of population chosen to be analyzed (a well-chosen sample contains most of the information about a particular population parameter.
2.2.1 Identifying variables type:
Numerical vs. Categorical
numerical —> continuous or discrete? (based on whether or not they can take on an infinite number of values or only non-negative
whole
numbers, respectively)categorical —> ordinal? (whether or not levels have a natural ordering)
Associated (Explanatory) vs. Independent (Response) —> show relationships w/ other vars?
Confounding variables
2.2.2 Classify study type as ovbservational or experimental
Observational studies: researcher collects data by observing but not directly interfering with how data arise —> correlation only
Retrospective study:
when an observational study uses datafrom the past
Prospective study:
…. data are collectedthroughout the study
Experiments:
when researchersrandomly assign
subjects to treatments (can be causal)
2.2.3 Sampling Techniques
Probability Sampling
Random sampling:
choosing sample randomly without any given logic —> each member has an equal chance of being selected in the sample.Stratified sampling:
First divide population intohomogenous
strata (subjects within each stratum are similar but different across strata), thenrandomly
sample fromwithin each strata.
- e.g. to make sure both genders are equally represented in a study, we might divide the population into males and females and then randomly sample from within each gender group
Cluster sampling:
Divide population intoheterogenous
clusters (subjects within clusters are different butclusters are similar to each other
—>randomly sample a few clusters
Multistage sampling
add one other step to cluster sampling:randomly sample observations from WITHIN each cluster
Systematic sampling
Non-Probability Sampling
Snowball
Quota
Judgement
Convenience sample bias
occurs when individuals who are easily accessible, aremore likely to be included
in the sample.Non-response bias
happens when only anon-random
proportion of the randomly sampled people respond to a survey —> sample no longer representative (initial sample is random but the final valid sample is not)- e.g. when we take a random sample of individuals from Stanford, but certain groups of population, such as from a lower socioeconomic status, are much less likely to respond to the survey —> our sample is not representative enough of the entire Stanford community
Volunteer Response bias
occurs when sample consists of only people whovolunteer to respond
bcuz they havestrong opinions
on the issue (no initial random sample)
2.2.4 Principles of Experimental Design—Control, Randomize, Replicate, and Block—and their purposes
Control
— compare treatment of interest to a control groupRandomize
— randomly assign subjects to treatmentsReplicate
— collect a sufficiently large sample, or replicate the entire studyBlock
— block for variables known or suspected to affect outcomeif there are variables known or suspected to affect the response variable, first group the subject into blocks based on these variables —> then randomized cases within each block to treatment groups
e.g. design an experiment to investigate if energy gels make you run faster: the treatment group gets the energy gel, the control group does not. It is suspected that energy gels might effect
pro and amateur
athletes differently therefore weblock for pro status
. <-- we divide our sample into pro and amateur athletes, then randomly assign pro and amateur athletes to treatment and control groups so that both pro and amateur athletes are equally represented in the resulting treatment and control group.
Blocking variable vs. Explanatory variable
Explanatory variables
(factors)
are conditions weimpose
on our experimental units.Blocking variables are
characteristics
that the experimental units come with (which may affect how experimental units respond to response variable differently).
Other terminologies
Placebo:
afake treatment,
often used as the control group in medical studiesPlacebo effect:
when experimental units showimprovement
just becuz theybelieve
they’re receiving aspecial
treatmentBlinding:
when experimental units DO NOT know they are in the control or treatment groups.Double-blind study:
when BOTH the experimenters and researchers DO NOT know who is in the control or treatment group.
Experimental Design Workflow:
Control any possible
confounders / confounding variables
(non-explanatory factors that may influence different responses)Randomize into treatment and control groups
Replicate by using a sufficiently large sample or repeating the experiment
Block any variables that might influence the response
* Stratified sampling allows for controlling for possible confounders in the sampling stage, while blocking allows for controlling for such variables during random assignment.
2.2.5 Random Sampling vs. Random Assignment
If
random sampling
has been employed in data collection, the results should begeneralizable
to the target population. (but still NOT causal)- WHY? —> if subjects are randomly selected from the population, then each subject in the population is equally likely to be selected so that the resulting sample is likely
representative
of the population.
- WHY? —> if subjects are randomly selected from the population, then each subject in the population is equally likely to be selected so that the resulting sample is likely
If
random assignment
has been employed in study design, the results suggestcausality.
- WHY? —> in our sample, subjects usually exhibit slightly different characteristics from one another. Through random assignment, we ensure that these different
characteristics are represented equally in the treatment and control groups
—> allows us toattribute any observed difference
between treatment and control groupsto treatment
being observed on the subjects, since otherwise these groups are essentially THE SAME.
- WHY? —> in our sample, subjects usually exhibit slightly different characteristics from one another. Through random assignment, we ensure that these different
A study that relies on
volunteers
employrandom assignment (experiment)
, but NOT random sampling can be used to make causal conclusions but ONLY apply to the sample (so results cannot be generalized).A study that uses NO random assignment, but DOES use random sampling, is a
typical observation study
. Results can ONLY be used to make correlation statements, but they CAN be generalized to the population at large.A study that DOES NOT use random assignment or random sampling, can ONLY be used to make correlational statements, and these conclusions are NOT generalizable. This is an
unideal observational study
.
2.2.7 Statistical Significance and p-values
p-value:
Given a chance model that embodies the null hypothesis, the p-value is the probability of obtaining results as unsusual
or extreme
as the observed results.
Alpha:
The probability threshold of "unusualness"
that chance results must surpass
for actual outcomes to be deemed statistically significant.
Type 1 error (false-positive):
Mistakenly concluding an effect is real
(when it is due to chance). (i.e. reject H0 when it is actually true)
Type 2 error (false-negative):
Mistakenly concluding an effect is due to chance
(when it is real). (i.e. fails to reject H0 when it is actually false)