Linear Regression Third

2 minute read

General Model form

Did you see any variables in the uber_trips data that might be useful for predicting Uber trip cost? Recall that our goal is to build an accurate model that predicts the cost of a future trip with Uber. To build a model we provide an input variable to explain, or predict, an output variable.

An input — or input variable — is also sometimes referred to as a predictor, independent variable, feature, attribute, descriptor, or simply variable. Throughout these missions on linear modeling, we will generally use the terms input variable, predictor variable, or independent variable.

An output or output variable is also known as a dependent variable, outcome, response variable, response, target, or class. Throughout this missions on linear modeling we will generally use the terms output variable, response variable or dependent variable.

The general form of a model that performs such a prediction can be represented as:

Y=f(X)+ϵ

In this context:

  • X represents a set of inputs
  • Y represents a set of outputs.
  • ϵ represents the error term.

The random error term ϵ is independent of X and has a mean of approximately zero. We'll learn more about the error term in a later screen. For now, let's discuss terminology about inputs and outputs.

Returning to the formula above, f is a precise function that represents the information X provides about Y. In the process of modeling we rarely know the value of f, so we observe values of X to make predictions about Y. Estimates of f may use more than one predictor variable, in which case the X variables are represented as X1,X2,...,Xp where p refers to the total number of predictors. But for learning purposes we will focus on using a single predictor variable for now, so we only need to be concerned about the single predictor variable X.

In our example above, Y represents the cost of an Uber trip, which is considered a quantitative response to an input variable. Bivariate regression can be performed with pairs of variables measured on a ratio or interval scale. Which variables from our uber_trips data are either ratio or interval? The only variable that is not either interval or ratio scale data is destination.

Let's plot our data to see if any of our variables might be suitable to predict the cost of future Uber trips.

# Line plot example
ggplot(data = df, 
       aes(x = independent_variable, y = dependent_variable)) +
  geom_line()

# Scatterplot example
ggplot(data = df, 
       aes(x = independent_variable, y = dependent_variable)) +
  geom_point()
instructions

Generate two plots with ggplot2. Include cost on the y-axis in each plot, because we are treating this as the dependent variable. In each case, a simple exploratory plot is fine. There is no need to customize axis labels or chart titles.

  1. Generate a line plot with cost on the y-axis and date on the x-axis.
  2. Generate a scatter plot with cost on the y-axis and distance on the x-axis.

Take a look at each plot. Do you notice a relationship between cost and date, or cost and distance?

Get Help
Get a hint
See the answer
Community discussion

Updated: