Could you please name some techniques used to classify categorical values, especially when they are quite many, and it isn't easy to find a simple logical way to do that, nationalities, for example?
Thank you for your question!
If I understand correctly, you are wondering if there are any other ways to map a categorical feature to a numerical value. If yes, then one possibility is to use panda's
get_dummies() method as demonstrated in the course. Other ways are to use sklearn's
OneHotEncoder(). You can read more about them on this page under 6.3.4. Encoding categorical features. Note that target categorical variables are encoded differently using, for example,
LabelEncoder() (more information on the topic you can find in the documentation).
Hope that answers your question! :)
Thank you, Hristina, for your quick response. Actually, I have realized that I posted my question in the wrong spot :(. I was referring to the classification techniques mentioned by the instructor in the "Grouping the Various Reasons for Absence" video, Course Name: "SQL + Tableau + Python". He grouped the illnesses logically into a few categories, according to their similarities, to reduce the dummy dimensions generated. My question aimed to understand if there are other grouping techniques when the categorical dimension has many values, and no business logic/ similarities help in grouping these values. For, example if one of the variables in the regression analysis is 'Nationalities,' which could contain hundreds of values. Is there any technique that helps group the values in such fields to avoid generating a lot of dummy variables.
I really appreciate any help you can provide.
Hi Ranim and Hristina!
I hope you don't mind if I join the conversation.
@Ranim: This is a great question, although speaking from my experience I'd say that we should never neglect the economic reasoning and intuition in the middle of an analytical process.
Applying that to your question, on one hand, one can create a dummy for every nationality, yes. However, that would create a little too many variables and might not truly help your research. In that case, you may want to ask yourself - "What is the purpose of my research?", "What am I trying to prove?".
Based on the answer you find, you might start thinking of a specific way to group the variables. In your example, what comes to my mind is group the countries geographically (i.e. group by continent), by historical background (e.g. the countries in the Anglosphere, Latin-language speaking countries, Slavic countries etc.), or by current similar political or economic conditions (e.g. countries forming the G8, economically poor and non-industrialised countries etc.).
In addition, more often than not, there are ways to group the variables and if you really don't find any, you might want to reorganise your independent variables differently.
If none of these techniques helps, I suggest you look into cluster analysis and factor analysis, as well as consider using the tools Hristina described about in her answer.
Of course, let's not forget that you can sometimes just take a dataset, apply ML techniques, and try to obtain that economic/social/analytical/business intuition from there. But in any case, I think one should always look for it; always keep it in the back of our minds.
Hope this helps.
Thank you, Martin, for your swift response; I really appreciate that you took the time to answer my question. I will try to find my way taking into account the advice you gave. On another note, Thank you for your course "SQL + Tableau + Python" Indeed; it is the first time I have seen a clear, straightforward, and detailed explanation. Therefore, If you were ever considering adding another section connecting Python to Tableau, as this is the actual situation most of the time, your course would be outstanding (the first and the best :) ).
Many thanks again,
Thank you very much for your kind words!
Note taken - to an extent, integrating Python and Tableau works differently on different operating systems but this doesn't make the subject any less interesting. We will consider the topic.
Good luck and please feel free to post another question should you encounter any difficulties. Thank you.