Linear Regression Seventh

3 minute read

Estimating F with parametric models

What steps do we take to estimate f? Regardless of the type of modeling approach we choose, there are some common steps we take to estimate the function that uses X to describe Y. With any model, we observe n different data points, where n refers to the number of observations we have in the dataset. In our case, we have information about the 50 Uber rides we recorded data for. This dataset is called the training data, because we will use these observations to train our model how to estimate f.

To provide a more specific example, let's explore how we estimate f using linear regression, which is a type of parametric model. The term parametric refers to parameter. In the case of linear regression, we are able to formulate an estimate of f by estimating two parameters: intercept and slope. You may have referred to these paramaters as coefficients. The bivariate linear model is represented mathematically as:

y^=β^0+β^1x

Here y^ indicates a prediction of Y assuming X=x.

Intercept (β^0) refers to the value on the y-axis where x=0. Intercept is also the expected mean value of y^ when x=0.

Slope (β^1)is the rate of change in y^ for every single unit change in x. We'll visualize an example of this in a moment.

We are able to estimate f with linear regression because we have made the assumption that f should take a linear form - a straight line with no curves. But what's important here is that by choosing linear regression, we have simplified the process of estimating f — our model can focus on estimating only two parameters: intercept and slope. The trade-off is that our model may not fit the data well if the true form of f is not linear.

In the case of linear regression, f is estimated using what is known as the least squares estimate (sometimes abbreviated LSE) to fit the model to the training data. We'll learn how the least squares estimate works in a later mission, but the main thing to understand here is that the result of "training" the least squares method to our data is a value for intercept and slope that provides the "closest" or "best" fit to the 50 data points using a straight line. We'll examine a measure of fit in the next screen that describes what the "best" fit is, but for now let's think about best fit while we look at the scatterplot we generated:

Distance scatter

Take a look at the pattern of the points in the scatterplot above. We observe that there is the general pattern: with an increase in distance comes an increase in cost. With this pattern in mind, imagine drawing a straight line through the points in such a way that the line is as close as possible to each of the 50 points at once. With the line that fits best, some points will be above the line, some will fall below the line, and a few points might fall on the line, or very close to it. Maybe something like this?:

Distance scatter draw line

To be clear, the line drawn above is not the true regression line. It's a rough guess based on the spread of the points. If this seems difficult to imagine, don't worry, it is! Fortunately, ggplot2 can do this for us, and we don't even need to build a linear model first.

Using the geom_smooth() function from ggplot2, we can visualize a linear model on the scatterplot we previously built. We don't need to be concerned with the details of fitting this linear model at this moment, we'll get to that in a later mission. For now, let's continue to build our intuition around predictive modeling.

Instructions

In this exercise, we will add a linear regression fit line to the scatterplot we built earlier in the mission. To do this, we will add the geom_smooth() layer to our plot. Include distance on the x-axis and cost on the y-axis as before. We've included the code from your original scatterplot in the display code.

  1. Generate a scatterplot that visualizes a linear regression model.
    • Add geom_smooth() to the scatterplot you built previously.
    • Within geom_smooth() enter the arguments: method = "lm", se = FALSE.

The argument se = FALSE is used to specify that we do not want to show confidence intervals in our plots. Once you've generated the plot, compare how the fit line compares to what you imagined.

Updated: