Linear Regression Second
Intro to Modeling - Example
To build our intuition around modeling, let's use a motivating example. In this series of missions let's imagine that we live in the Brooklyn borough of New York City. We are interested in the real estate market and use Dataquest to learn data science so that we can analyze home sales data.
This diagram shows the location of Brooklyn, and the other boroughs of New York City:
We don't own a car, so one of the main expenses that we incur living in Brooklyn is the cost of using Uber to get around. We've recorded data for 50 recent Uber rides and wonder if it is possible to use this data to build a model to predict the cost of future rides? If we build a model to predict the cost of an Uber ride, maybe then we can figure out how to reduce travel costs in the future.
In this series of missions we will use the terms model and predictive model interchangeably. So what, exactly, do we mean by predictive model? The book "Applied Predictive Modeling" by Max Kuhn and Kjell Johnson defines predictive modeling as the process of developing a mathematical tool or model that generates an accurate prediction. We find this definition of predictive modeling useful here.
This course focuses on linear regression modeling specifically, which is a type of statistical model for predicting, or estimating, an output based on one or more inputs. In the next few missions we will attempt to estimate the cost of an Uber trip using only a single input. This is known as bivariate regression, or simple regression.
Why would we choose a linear model that uses a straight line when we know there is variation our the data? With an increase in flexibility comes the risk that our model will follow the error — or noise — in the dataset too closely. If the model "works too hard" to find patterns in the data when fitting a model, then the model may not perform well when evaluating unseen data. This is known as overfitting.
We will not encounter an overfit model in this course because bivariate linear regression is, by definition, simple and therefore not prone to overfitting. However, as we build our foundational knowledge of predictive modeling, it is essential to know about overfitting and why it can sometimes be advantageous to choose a simpler model instead of a more complex model that may be at risk of following patterns that do not exist in the unseen data.
In the next screen we will learn more about the general form of a predictive model. But first let's load our Uber trip data and have a look. The data contains four variables:
date
: Day that the Uber trip was takendestination
: Specific neighborhood in Brooklyn that we traveled todistance
: Total trip distance (in miles)cost
: Total cost of the trip
This dataset was built using real Uber trip cost data from this dataset on Kaggle. The date
and destination
data in our dataset are fictitious, but all distance
and associated cost
observations are sampled from actual data. To build this dataset, we used simple random sampling to extract information for 50 "UberX" trips out of over 55,000 observations. Simple random sampling was also used to randomly generate the date
and destination
data. The specific neighborhood names for Brooklyn were pulled from this data on property sales in New York City, which we will work with in later missions of this course.
- Load the csv file
uber_trips.csv
and assign this dataframe to an object nameduber_trips
. Be sure to load thereadr
package to read-in the data. - Once you have loaded the data, take a look at the type of information available. Do you think any of this information can be used to predict the cost of a future trip with Uber?
date | destination | distance | cost | |
---|---|---|---|---|
1 | 2019-09-02 | Bath Beach | 0.39 | 7 |
2 | 2019-09-07 | Bensonhurst | 2.79 | 8.5 |
3 | 2019-09-10 | Borough Park | 4.48 | 14 |
4 | 2019-09-13 | Brighton Beach | 2.98 | 10.5 |
5 | 2019-09-17 | Bush Terminal | 1.34 | 7.5 |
6 | 2019-09-20 | Brownsville | 0.55 | 7 |
7 | 2019-09-22 | Bushwick | 0.73 | 7 |
8 | 2019-09-24 | Bushwick | 2.8 | 9.5 |
9 | 2019-09-27 | Cobble Hill | 1 | 9.5 |
10 | 2019-10-02 | Cypress Hills | 1.22 | 7.5 |
11 | 2019-10-03 | Cobble Hill | 2.14 | 8.5 |
12 | 2019-10-05 | East New York | 1.9 | 8 |
13 | 2019-10-06 | Flatbush-North | 3.07 | 9.5 |
14 | 2019-10-07 | Downtown-Metrotech | 2.17 | 8.5 |
15 | 2019-10-08 | Downtown-Metrotech | 0.55 | 7 |
16 | 2019-10-10 | Downtown-Metrotech | 1.56 | 8 |
17 | 2019-10-12 | Downtown-Metrotech | 1.56 | 8 |
18 | 2019-10-14 | Gowanus | 2.14 | 7.5 |
19 | 2019-10-15 | Flatbush-North | 0.56 | 7 |
20 | 2019-10-16 | Downtown-Fulton Mall | 2.17 | 9.5 |
21 | 2019-10-21 | Navy Yard | 2.98 | 9.5 |
22 | 2019-10-23 | Marine Park | 3.53 | 10.5 |
23 | 2019-10-24 | Flatlands | 3.45 | 10.5 |
24 | 2019-10-25 | Kensington | 1.12 | 7.5 |
25 | 2019-10-29 | Midwood | 1.22 | 7.5 |
26 | 2019-10-31 | Ocean Parkway-North | 1 | 7.5 |
27 | 2019-11-02 | Park Slope | 2.14 | 9.5 |
28 | 2019-11-05 | Marine Park | 4.48 | 12.5 |
29 | 2019-11-07 | Red Hook | 3.04 | 9.5 |
30 | 2019-11-08 | Gravesend | 3.08 | 10.5 |
31 | 2019-11-11 | Manhattan Beach | 2.66 | 9.5 |
32 | 2019-11-12 | Red Hook | 2.84 | 10.5 |
33 | 2019-11-15 | Midwood | 2.79 | 10.5 |
34 | 2019-11-16 | Ocean Hill | 2.32 | 9 |
35 | 2019-11-20 | Park Slope | 2.17 | 8.5 |
36 | 2019-11-21 | Spring Creek | 1.34 | 8.5 |
37 | 2019-11-23 | Old Mill Basin | 3.39 | 16 |
38 | 2019-11-30 | Ocean Parkway-North | 3.34 | 13 |
39 | 2019-12-01 | Flatlands | 2.98 | 11.5 |
40 | 2019-12-05 | Williamsburg-North | 1.44 | 7.5 |
41 | 2019-12-11 | Bay Ridge | 0.55 | 7 |
42 | 2019-12-14 | Wyckoff Heights | 2.86 | 9.5 |
43 | 2019-12-15 | Sunset Park | 1.62 | 9.5 |
44 | 2019-12-17 | Park Slope South | 2.66 | 9.5 |
45 | 2019-12-18 | Williamsburg-Central | 3.22 | 11.5 |
46 | 2019-12-19 | Williamsburg-North | 1.9 | 9.5 |
47 | 2019-12-24 | Mill Basin | 2.56 | 9 |
48 | 2019-12-27 | Bushwick | 2.73 | 10.5 |
49 | 2019-12-28 | Gerritsen Beach | 2.17 | 8.5 |
50 | 2019-12-29 | Park Slope South | 2.8 | 10.5 |