Linear Regression Third
General Model form
Did you see any variables in the uber_trips
data that might be useful for predicting Uber trip cost? Recall that our goal is to build an accurate model that predicts the cost of a future trip with Uber. To build a model we provide an input variable to explain, or predict, an output variable.
An input — or input variable — is also sometimes referred to as a predictor, independent variable, feature, attribute, descriptor, or simply variable. Throughout these missions on linear modeling, we will generally use the terms input variable, predictor variable, or independent variable.
An output or output variable is also known as a dependent variable, outcome, response variable, response, target, or class. Throughout this missions on linear modeling we will generally use the terms output variable, response variable or dependent variable.
The general form of a model that performs such a prediction can be represented as:
In this context:
- represents a set of inputs
- represents a set of outputs.
- represents the error term.
The random error term is independent of and has a mean of approximately zero. We'll learn more about the error term in a later screen. For now, let's discuss terminology about inputs and outputs.
Returning to the formula above, is a precise function that represents the information provides about . In the process of modeling we rarely know the value of , so we observe values of to make predictions about . Estimates of may use more than one predictor variable, in which case the variables are represented as where refers to the total number of predictors. But for learning purposes we will focus on using a single predictor variable for now, so we only need to be concerned about the single predictor variable .
In our example above, represents the cost of an Uber trip, which is considered a quantitative response to an input variable. Bivariate regression can be performed with pairs of variables measured on a ratio or interval scale. Which variables from our uber_trips
data are either ratio or interval? The only variable that is not either interval or ratio scale data is destination
.
Let's plot our data to see if any of our variables might be suitable to predict the cost of future Uber trips.
# Line plot example
ggplot(data = df,
aes(x = independent_variable, y = dependent_variable)) +
geom_line()
# Scatterplot example
ggplot(data = df,
aes(x = independent_variable, y = dependent_variable)) +
geom_point()
Generate two plots with ggplot2
. Include cost
on the y-axis in each plot, because we are treating this as the dependent variable. In each case, a simple exploratory plot is fine. There is no need to customize axis labels or chart titles.
- Generate a line plot with
cost
on the y-axis anddate
on the x-axis. - Generate a scatter plot with
cost
on the y-axis anddistance
on the x-axis.
Take a look at each plot. Do you notice a relationship between cost
and date
, or cost
and distance
?