Transformations of the data. Why do we need it?
I was curious about log and exp transformations. As far as I understood, we use them to transform non-linear relationship to linear. But doesn't it kill the sole purpose of linear regression, if we are transforming the data to make it linear? in other words, if the relationship is not linear, why should we even transform the data and explore it with linear regression, instead of some other relationship or "algorithm"?
Additionally, there are different types of log transformations. Is there a guideline on which one to use, or is it a matter of preference of the user?
I would just like to build on the answer given by The 365 Team regarding why you would choose to transform data and use linear regression instead of a more complicated regression model like non-linear regression/lasso regression or any of the other more complex regression models out there.
Say you have data of population growth where Y is the population size and X is the number of decades. The graph of the data looks like an exponential curve that follows the simple formula: Y = e^X + C where Y is the target variable (population) and e^x (number of decades) which is our input.is the predictor variable. The graph is not a straight line. This would mean that we would be unable to model the data with linear regression and are stuck.
We want to model this problem according to the linear regression formula Y = B + B1X1 + B2X2 +....+ e so we transform the predictor variable with logarithmic transformation (log base e) into a linear graph that can be modelled with linear regression. Our new predictor variable population is now ln(population) and we have a linear graph that can be modelled linearly.
Y = e^X + C
Y = ln(e^X) + C
Y = X + C
Now we will be using ln(population) instead of population as the predictor variable for our OLS regression algorithm to determine the coefficient and the intercept term for our linear regression model.
You may ask, "why not just use an exponential regression model instead?". Fair question, the answer is that this is a very simple problem where the underlying equation y=e^x is easy to derive. In real problem you will work with many more predictor variables when performing regression modelling, so simply changing the equation to fit the data will not work because you have no way of knowing what the relationship between the predictor variables and target variable are. It is more efficient and feasible to transform the predictor variables individually to have a linear relationship with the predictor variable so as to fit the data to the linear regression model.