Grit, genes or privilege – which matters most?

What you will learn

How correlation between variables is measured and interpreted..

How simple and multiple regression equations are estimated and interpreted

How and why multiple trials of a random variable are normally distributed

How probability distributions for random variables are constructed from regression equations

How probability distributions are used to predict outcomes for random variables

Description

The 10th grade math class at Middletown High School is using data science tools to explain the results of last summer’s Middletown bike rally. To construct a dataset, they use the miles covered by each of the 30 riders in the rally as the dependent variable. The independent explanatory variables they choose are motivation (“grit”), aptitude (“genes”) and the amounts riders’ parents spend on their kids equipment and training for the rally (“privilege”). Based on answers to a questionnaire, each rider is given a grit score, a gene score and a privilege score.

The class uses several data science tools to analyse the database to determine how much of the variation in rider performance is explained by grit, genes and privilege, and to predict rider performance in next summer’s rally.

They begin by looking at the correlations between rider performance and the explanatory variables. They learn how correlation is calculated, and how to interpret strong, weak, positive and negative correlation.

The class then  performs simple regressions on rider performance using each of the the explanatory variables in turn. Each regression produces an equation whose coefficient and constant describe the relationship between rider performance and grit, genes or privilege.

The class then looks at how the R-squared value reported in each regression is calculated. R-squared measures the percentage of variation in the dependent variable that is explained by variation in the explanatory variable.

To understand the combined explanatory effect of grit, genes and privilege on rider performance, the class proceeds to use multiple regression on the dataset. Multiple regression estimates coefficients and a constant for a single equation that includes all three explanatory variables.


Get Instant Notification of New Courses on our Telegram channel.


Having estimated the equation that best explains rider performance last summer, the class then learns how the regression equation can be used to predict rider performance in next summer’s rally.

The starting point here is to understand that rider performance next summer can be seen as a random variable, because it is the sum of random variables, each represented by one of the terms of the regression equation.

The class then looks at frequency distributions that result after multiple trials of a random variable that is the sum of random variables. They see that as the number of trials increases, the distribution takes on the bell shape of the so-called normal distribution.

Moving to the next step, the class considers how a frequency distribution can also be thought of as a probability distribution. The class learns how to build a normal probabilitly distribution for a random variable by using the mean or expected value of the variable together with the variable’s standard error, which measures how widely multiple trials of the variable are spread around the mean value.

The class is now ready to use the multiple regression equation to build the probability distribution for a rider’s performance next summer. For any given rider, the equation calculates the expected number of miles he will cover based on his scores. The regression also calculates the standard error of the estimate.

In the final stage of the analysis, the class uses probability distributions to calculate the odds of various outcomes in next summer’s rally — for example, the odds that Gina will ride more than 35 miles, or the odds that Gina will ride further than her brother Joey.

English
language

Content

Introduction

Introduction

The Middletown Bike Rally

Day one: the Middletown Bike Rally
Day two: Building the rally dataset

Explaining rider performance

Day 3: Correlation
Day 4: Simple regression analysis
Day 5: Multiple regression analysis

Predicting rider performance

Day 6: Random variables and the normal distribution
Day 7: Probablity and Prediction
Day 8: What are the odds?
Day 9: Beating the odds