Linear Regression Fourth
Prediction
Looking at the two plots we generated, we observe that there does not appear to be any obvious relationship between cost
and date
, but there is a relationship between cost
and distance
. We've updated the titles, but these plots are otherwise identical to what we generated in the last exercise:
Let's break-down each plot individually. With regards to date
, suppose we take an Uber trip from our home to the same destination every day. We would expect that this trip would cost rougly the same each time. Sure there could be some variation day-to-day because of traffic levels, weather, or driver habits, for example, but overall we would expect the costs to be relatively consistent. The line chart we generated does not demonstrate any sort of stability or trend, something else must be influencing trip cost.
On the other hand, looking at the scatterplot of cost
and distance
we see that, in general, trips that are a greater distance have a higher cost than trips of a shorter distance. This is an example of a situation where we have two variables that appear to have some sort of a relationship. We may be able to build a model that provides a resonable estimate of cost
based on distance
. As stated previously, our goal is to build an accurate model that predicts trip cost on the basis of trip distance. Is prediction the only reason that we would want to build a model?
Modeling is generally performed for one of two purposes: prediction or inference. Let's begin with prediction.
If our primary purpose of building a model is to generate an accurate prediction, we aren't too concerned if the function form of cost
explained by distance
is unknown. In other words, rather than understanding the intricacies of , our primary concern is that our model gives us accurate predictions for cost
for each input . This is the main point to convey about prediction!
In the real world, we will often encounter situations where we have inputs available, but we do not have information available for . Take, for example, our hypothetical situation of living in Brooklyn and frequently using Uber to get around.
Suppose, in an effort to reduce trip cost, we are planning our routes for the week to minimize distance traveled from location to location. We can estimate the total cost for each planned trip of distance, based on the data we have previously collected. In this situation, we predict with:
We can omit the error term because it averages to . Recall from earlier courses that the "hat" symbol indicates an estimate. We can formulate an estimate of with an estimate of using our inputs. Technically, we are also estimating — our drivers could take a different route than the online map service suggested — but let's assume that our routes are travelled exactly as planned. So, if we know , does that mean our estimates of will be exact? That is rarely the case. Why? Because of error! We'll discuss error in a moment, but first, let's learn about inference.