The field of Statistics aims to interpret large data sets that contain random variation. Baseball is a simple game that contains a high degree of randomness, and because professional baseball has been played since the 19th century, a large amount of data has been collected about players’ performance. In this class we examine key concepts in Statistics and Data Science using baseball as a motivating example. We will also discuss how newer statistics, created by sabermetric researchers, have led to additional insights, and will be learn how to use the R programming language to analyze data. Assignments will consist of weekly problem sets and a short final project. By taking this class students will develop an understanding of key Statistical concepts that will be useful for interpreting data from many fields.
Resources
Class resources: syllabus , final project guidelines, class piazza site
Textbooks: Teaching Statistics Using Baseball, Big Data Baseball, Analyzing Baseball Data with R (optional)
R resources: R tutorial, R Markdown cheat sheet, article on using R to analyze baseball data , Learning R videos: Intro, common functions, vectors , descriptive statistics, Visualizing Univariate Data, scatter plots
Baseball resources: Basic and more detailed rules of baseball, , NYT article: What Umpires Get Wrong, NYT article: Baseball’s borders ,
Shiny Apps: Regression app, Big League Baseball app , Single proportion app
R Markdown worksheets: Worksheet 1, Worksheet 2, Worksheet 3, Worksheet 4, Worksheet 5, Worksheet 6, Worksheet 7, Worksheet 8, Worksheet 9, Worksheet 10, Worksheet 11
Schedule
Class 1: Introduction
Class 2: Baseball statistics and an introduction to R
Class 3: Summary statistics and plots for a single batch of data
Worksheet 1
Class 4: Exploring categorical and quantitative data
Class 5: Quantifying variability
Worksheet 2
Class 6: More descriptive statistics: Percentiles, boxplots, and z-scores
Class 7: Relationships between variables
Worksheet 3
Class 8: Simple linear regression
Class 9: Linear regression continued
Worksheet 4
Class 10: Multiple linear regression
Regression Shiny App
Class 11: Data manipulation (with dplyr)
Worksheet 5
Class 12: Understanding probability using games
Big League Baseball App
Class 13: Understanding probability using games continued
Worksheet 6
NYT article: digital strat-o-matic
Class 14: Tree diagrams and the binomial distribution
Class 15: Binomial and normal distributions
Worksheet 7
Class 16: Introduction to statistical inference
Class 17: Hypothesis tests on a single proportion
Single proportion shiny app
Class 18: Hypothesis tests for two proportions
Worksheet 8
Class 19: Hypothesis tests for two proportions and two means
Class 20: Randomization tests for two or more means
Worksheet 9
Class 21: Parametric tests for two or means
Class 22: Hypothesis tests for two or more means and confidence intervals
Worksheet 10
Class 23: Confidence intervals
Class 24: Final project presentations
Class 25: Class presentations, and review
Worksheet 11