Transformations of the data. Why do we need it?

Question

I was curious about log and exp transformations. As far as I understood, we use them to transform non-linear relationship to linear. But doesn't it kill the sole purpose of linear regression, if we are transforming the data to make it linear? in other words, if the relationship is not linear, why should we even transform the data and explore it with linear regression, instead of some other relationship or "algorithm"?

Additionally, there are different types of log transformations. Is there a guideline on which one to use, or is it a matter of preference of the user?

Answer 1

Hi Emin, It is always better to use simpler models as they are more easily interpreted and troubleshooted. On top of that, if you are using a linear regression, chances are many people can collaborate with you on it, give feedback or take it over from you. Even more, there are a lot of resources online on that topic. The more complicated the model, the more niche it becomes. This makes it hard to troubleshoot, hard to monitor, hard to maintain and very hard to handover to other colleagues. That aside, you can always use a neural network or an XGBoosted decision tree and you would have the best result. However, it will take you much more time to build properly. Best, The 365 Team

Answer 2

I would just like to build on the answer given by The 365 Team regarding why you would choose to transform data and use linear regression instead of a more complicated regression model like non-linear regression/lasso regression or any of the other more complex regression models out there.

Say you have data of population growth where Y is the population size and X is the number of decades. The graph of the data looks like an exponential curve that follows the simple formula: Y = e^X + C where Y is the target variable (population) and e^x (number of decades) which is our input.is the predictor variable. The graph is not a straight line. This would mean that we would be unable to model the data with linear regression and are stuck.
We want to model this problem according to the linear regression formula Y = B + B1X1 + B2X2 +....+ e so we transform the predictor variable with logarithmic transformation (log base e) into a linear graph that can be modelled with linear regression. Our new predictor variable population is now ln(population) and we have a linear graph that can be modelled linearly.

Y = e^X + C
Y = ln(e^X) + C
Y = X + C

Now we will be using ln(population) instead of population as the predictor variable for our OLS regression algorithm to determine the coefficient and the intercept term for our linear regression model.
You may ask, "why not just use an exponential regression model instead?". Fair question, the answer is that this is a very simple problem where the underlying equation y=e^x is easy to derive. In real problem you will work with many more predictor variables when performing regression modelling, so simply changing the equation to fit the data will not work because you have no way of knowing what the relationship between the predictor variables and target variable are. It is more efficient and feasible to transform the predictor variables individually to have a linear relationship with the predictor variable so as to fit the data to the linear regression model.

Transformations of the data. Why do we need it?

Submit an answer