Can I use "x = data[['SAT']]" instead of "x = data['SAT']",?
When I try "x = data[['SAT']]" (as we did in multiple regression) instead of "x = data['SAT']", it seems to work fine without having to reshape the data; and I get the same results (e.g., intercept and coeffs). Is there any particular advantage of reshaping the data here?
Thank you.
Hey,
Thank you for the great question!
Both your solution
x = data[['SAT']]
x.shape
and the solution in the lecture
x = data['SAT']
x_matrix = x.values.reshape(-1,1)
x_matrix.shape
will return a variable of shape (84, 1)
. This means that both approaches will work without an error when performing the regression, but we need to take some caution here. Let me demonstrate this with an example.
Let's first look at the version of the code where we use x = data['SAT']
and apply the reshape()
method.
The variable x_matrix
is of shape (84, 1)
and of type numpy.ndarray
(that is, an n-dimensional array). The simple linear regression, therefore, runs without any errors. The subtlety comes next, when we want to predict the GPA, given an SAT score. We need to make sure that the variable new_data
has the same number of columns and is of the same type as the variable we put in as a first argument in the fit()
method - in our case, x_matrix
. Indeed, new_data
has 1 column, representing the SAT, and is of type numpy.ndarray
. The predict()
method runs flawlessly.
Let's now study your suggestion:
We see that the shape of the x
variable is indeed (84, 1)
but the type is different. It is a DataFrame
, rather than an ndarray
(try and print it out to see what the x
variable stores). The fit()
method works fine, as expected. Let's now try and predict the GPA for an SAT score of 1740, defining new_data
in the known way. We do get a result, but also a warning. What it tells us is that new_data
doesn't have valid feature names (as it is an ndarray
) but the variable x
, which we used to fit the model, does (as it is a DataFrame
object). What we need to do is define new_data
such that it also is a DataFrame
, containing a single column called SAT
.
The warning is now gone.
To sum this up, you always need to make sure that the variable you use to make predictions is of the same type and has the same number of columns (arranged in the same way) as the feature matrix you use to fit the model.
To answer your question, there is no particular advantage of reshaping a matrix over directly using the DataFrame
object, so long as you are consistent throughout your code.
Hope this helps and answers your question!
Kind regards,
365 Hristina