Linear Regression Eigth

2 minute read

Residuals

Below is the scatterplot with the linear regression line that we generated on the last screen with ggplot2. Titles have been added for clarity. In this example, we see that a linear model is a reasonable choice to explain the relationship between distance and cost. How does this compare to the trend line you imagined?

Uber linear model

Recall that intercept refers to the value on the y-axis where X=0. Slope is the rate of change in Y for every single unit change in X. In this example, the intercept is estimated at 5.85. In other words, if we call an Uber but don't go anywhere with the driver (a distance of 0 miles), we can expect to pay about $5.85. A word of caution here that a trip distance of 0 falls outside the range of our observed data. In general we should refrain from using model coefficients to describe values outside of the observed range.

The slope is estimated at 1.55, meaning that we can expect our Uber trip cost to increase by about $1.55 for every mile we travel. We have an estimate for these two coefficients that we can share with you here, because we performed a linear regression of cost onto distance.

To build a linear model from our uber_trips dataset, we input the dataset into the linear modeling algorithm in R. The program "trains" itself against the 50 sets of inputs to make predictions about Y for every input of X. Our model produced an output (prediction) for every input. We can check the predicted outputs against the actual values we observed for cost. The predicted values represent our model's best estimate of Y using X. The difference between the observed value and the model’s prediction is called a residual. We can visualize the residuals like this:

Residuals

In the scatterplot above, the residuals are represented by the blue lines that connect the observations to the fit line. These blue lines represent the distance on the y-axis that the observed value differs from the predicted value. We can calculate the residual for every point in our dataset and use these values to assess the accuracy of our model. If our model does a good job of predicting trip cost for every trip distance traveled, then our residuals will be relatively small. On the other hand, if our model does not predict trip cost well, then our model is a poor estimator and the residuals will be relatively large.

It can be useful to visualize the residuals to see where our model poorly performs or does well. For example, looking at the plot above, we see that our model generally underestimates the true cost of Uber trips that are one mile or less.

But visualizing the residuals does not allow us to quanitify the quality of the fit. We need to use a summary measure to quantify the extent to which the predicted trip cost matches the true trip cost for a given car trip. Fortunately, statisticians have developed various summary measurements that can take the residuals from our model and transform them into a single value that represents the predictive ability of our model. We will work in-depth with a few of these methods later in the course, but for now, let's focus on the simplest regression error metric: Mean absolute error (MAE).

Updated: