Linear Regression Sixth

1 minute read

Error

As mentioned briefly when we learned about prediction, it is near impossible for a model to be 100% accurate because of error. Error refers to the deviation of an observed value from the unobservable true value of the quantity of interest. Specifically, the accuracy of our prediction for cost depends on two types of error: reducible error, and irreducible error.

In this example we can minimize reducible error in our linear regression model by choosing the predictor (e.g. distance) that provides the best estimate of cost. With the data available to us here we can't do any better than choosing the distance variable. One way we could potentially reduce error is to use a different statistical model that provides a better estimate of f. But first we need to master linear regression, so we'll stick to that approach for now!

Examples of irreducible error include variables that are not measured but contain useful information for predicting Y. In our case these unmeasured variables could be characteristics of driver habits, traffic levels, weather conditions, time of day, surge pricing, or road construction levels. If we don't measure an input useful for predicting Y, we can't estimate it! Error is independent of X and cannot be predicted using X.

Unmeasurable variation also contributes to error. Another term for this is random noise. In our case examples of this might include the driver's mood on a particular day, or the driver's ability to find a suitable place to stop at dropoff location. So even if we are able to bring our reducible error to zero, the accuracy of any modeling prediction will always be bound by the amount of irreducible error present. Irreducible error is out of our control. To build an accurate model, our goal is to estimate f in a manner that minimizes reducible error for the particular statistical technique that we choose.

Which x-axis variable (date or distance) from the two plots that we generated earlier will likely have the greatest amount of error if we apply a linear regression model to estimate cost?

Cost vs. date vs. distance

Instructions

  1. Examine the two plots we generated earlier and select the plot that shows the x-axis variable (date or distance) that will likely result in the greater amount of error if we perform a linear regression to estimate cost.
    • Assign either the value 'scatter_plot' or 'line_chart' to the variable greater_error.

Updated: