Data Science Prep Notes
2022-09-29
Chapter 1 Summary resources for DS learning
1.1 Statistics, Probability, and A/B Testing
1.1.1 Stanford Courses (NON-EDUC):
- STATS 160-Introduction to Statistical Methods
- COVERS estimation | confidence intervals | test of hypotheses | t-tests | correlation && regression | analysis of variance and chi-square tests
- STATS 116-Theory of Probability or CS109-Introduction to Probability for Computer Scientists
- COVERS probability spaces | discrete spaces (binomial, hypergeometric, Possion) | continuous spaces (normal, exponential) and densities | random variables | expectation | independence | conditional probability | the laws of large numbers and central limit theorem (CLT)
1.1.2 Online Resources
Coursera:
Introduction to Probability and Data with R by Duke University: If you are already familiar with R then skip week 3-5 sections, week 6 & 8 are most useful for reviewing important probability concepts)Coursera:
Six Sigma Advanced Define and Measure Phases: Week 3-5 covers probability & statistics and statistical distributionBook:
Practical Statistics for Data Scientists (available on Oreilly Stanford account has free access)Useful website for reviewing / practicing prob & stats questions:
BrilliantUdacity:
Introduction to A/B Testing by Google
Additional Tips: A/B Testing is NOT required by all DS/DA jobs, but if you are interested in applying for a Product Data Scientist then it is REQUIRED. Be sure to browse DS/DA interns JD so that you know what skills would be needed.
1.2 SQL / R / Python / Visualization Tools
SQL classes:
Stanford class:
CS145 Database Management and Data SystemsUdemy:
SQL-MySQL for Data Analytics and Business Intelligence + The Ultimate MySQL Bootcamp (should take less than a week to learn these two courses)Other resources:
18 BEST SQL online learning resources
SQL Practices:
Hackerrank (easy), Summary of SQL in LeetCode (go for medium and hard!!)MySQL instructions
on Windows function & Frame Clause, WITH common table expression (very useful)Udemy Python courses:
Data Analysis with Pandas and Python + Python for Data Science and Machine Learning Bootcamp (tip: if you are familiar with Python then I recommend directly taking Andrew Ng’s ML courses and practice ur python ds coding thru hands-on projects and also regularly checking the Complete python data science cheatsheets)R: Datacamp
(for learning how to program in R. However, most tech companies prefer Python so no need to be an R expert, but it is good to learn especially if you are also interested in doing research in academic)Tableau / Power BI:
Learning Path: Your Guide to become a Tableau Expert | Tableau Tutorial | Power BI Tutorial (looking at online tutorial guides are sufficient for learning data viz tools cuz they are easy, but if you prefer taking online video lessons then Coursera—Data Visualization and Communication with Tableau by Duke will be a good choice).Other helpful resources:
The R Gallery (website that contains useful R visualizations examples and code)
R Markdown Guide (should be helpful to learn Rmd too cuz it makes everything look pretty)
Collection of RStudio Cheatsheets (a collection of cheat sheets for all common used data manipulation and visualization packages (e.g. ggplot2, dplyr, tidyr, stringr, lubridate, shiny, etc.)
Python Data Science Cheat Sheet (Beginner)
Python Data Science Cheat Sheet (PDF) (Collection of popular python libraries cheat sheets including Numpy, Pandas, Seaborn, Matplotlib, Scikit, SciPy - linear algrebra)
1.3 Machine Learning resources
1.3.1 Stanford Classes:
- STATS 202-Data Mining and Analysis (Terms: Aut, Sum) –> course website
- STATS 216-Introduction to Statistical Learning (Terms: Win) –> course syllabus: This is a math-light version of STATS 202
- CS 129-Applied Machine Learning (Terms: Win) –> course website: similar to the ML course by Andrew Ng on Coursera
- CS 229-Machine Learning (Terms: Aut, Win, Spr, Sum) –> course website: This is a very MATHY ML class, so if you are not comfortable with doing mathematical proofs of some ML theories do not take it for credit.
USEFUL RESOURCE:
A collection/guide of Stanford AI courses under CS/STATS department | Stanford Grades Distribution 2020
1.3.2 Online Resources
Topics covered:
Supervised Learning:
multiple linear regression, logistic regression, neural networks, & decision trees)Unsupervised Learning:
clustering, dimensionality reduction, recommender systemsSome AI & ML innovation: evaluating and tuning models, taking a data-centric approach to improving performance
Applied Learning Project:
Build ML models using
Numpy
&Scikit
Build & train a neural network with
Tensorflow
to perform multi-class classificatonBuild & use decision trees and tree ensemble methods, including
forest
andboosted trees
Build recommender systems with
a collaborative filtering
approach anda content-based deep learning
method
《An Introduction to Statistical Learning》(This book is also used for Stanford’s course STATS 202: Data Mining and Analysis)
Machine Learning 101 on Towards Data Science and many other articles
Other useful resources (notes):
Data Science Specialization Course Notes (Notes for all 9 courses in Coursera Data Science Specialization from JHU, taken by Xing —> topics include Experimental Design / EDA / Statistical Inferences / Regression models / Practical Machine Learning, etc.)
Notes on《Hands-on-Machine-Learning-with-Scikit-Learn-Keras-and-TensorFlow》 (GitHub repo for notes & code [.ipynb] on the book w/ same name) This book is also available on Oreilly
Data Science with R: A Resource Compendium very very very complete collection of data science resources with R by Martin Monkman, ranging from topics like data wrangling, Bayesian methods, to time series modeling & ML methods)
Natural Language Processing Notes – Python (chapters from Python Notes for Linguistics by Alvin Chen).
Additional Tips: All resources listed as other useful resources are mainly for your references when you need to actually implement certain methods / conduct a project / or to review certain syntax or concept. I personally DO NOT recommend beginners to start their learning journey with these resources, because it is much more important that you have already built a SOLID foundation in all fields mentioned above through SYSTEMATIC learning processes.
1.4 Interview Questions & Resume
VMock Dashboard (A smart platform that rates & analyzes your resume)
FAANGPath Resume Template (a FREE tech resume template built with
LaTeX
onOverleaf
)DS Interview related GitHub respository:
Data-Science-Interview-Resources | 120-Data-Science-Interview-Questions
Cracking-the-data-science-interview
(this one contains almost ALL related resources for DS job prepraration, but it might be overwhelming if you just start your DS-prep journey [a lot of cheat sheets on KEY topics/concepts, so I recommend selectively using some resources there).