Can I use "x = data[['SAT']]" instead of "x = data['SAT']",?
When I try "x = data[['SAT']]" (as we did in multiple regression) instead of "x = data['SAT']", it seems to work fine without having to reshape the data; and I get the same results (e.g., intercept and coeffs). Is there any particular advantage of reshaping the data here?
Thank you for the great question!
Both your solution
x = data[['SAT']] x.shape
and the solution in the lecture
x = data['SAT'] x_matrix = x.values.reshape(-1,1) x_matrix.shape
will return a variable of shape
(84, 1). This means that both approaches will work without an error when performing the regression, but we need to take some caution here. Let me demonstrate this with an example.
Let's first look at the version of the code where we use
x = data['SAT'] and apply the
x_matrix is of shape
(84, 1) and of type
numpy.ndarray (that is, an n-dimensional array). The simple linear regression, therefore, runs without any errors. The subtlety comes next, when we want to predict the GPA, given an SAT score. We need to make sure that the variable
new_data has the same number of columns and is of the same type as the variable we put in as a first argument in the
fit() method - in our case,
new_data has 1 column, representing the SAT, and is of type
predict() method runs flawlessly.
Let's now study your suggestion:
We see that the shape of the
x variable is indeed
(84, 1) but the type is different. It is a
DataFrame, rather than an
ndarray (try and print it out to see what the
x variable stores). The
fit() method works fine, as expected. Let's now try and predict the GPA for an SAT score of 1740, defining
new_data in the known way. We do get a result, but also a warning. What it tells us is that
new_data doesn't have valid feature names (as it is an
ndarray) but the variable
x, which we used to fit the model, does (as it is a
DataFrame object). What we need to do is define
new_data such that it also is a
DataFrame, containing a single column called
The warning is now gone.
To sum this up, you always need to make sure that the variable you use to make predictions is of the same type and has the same number of columns (arranged in the same way) as the feature matrix you use to fit the model.
To answer your question, there is no particular advantage of reshaping a matrix over directly using the
DataFrame object, so long as you are consistent throughout your code.
Hope this helps and answers your question!