The future of learning is doing.
350+ Students trust us
Art Of DS Applications
16th March 2023
Anish Roychowdhury
A spice is defined as a seed, fruit, root, bark, or other plant substance primarily used for flavouring, colorings, or preserving food. In other words, spice is an external addition to food – meat or vegetables which hugely enhances the taste and in many cases the nutritional value of the food. The salt and pepper, an everyday table accessory in every household was not always so! In fact the humble salt which we disregard today to be important was once so valuable that the Roman soldiers used to get paid in salt – hence the origin of the word salary.
In the world of data the situation is no different – data is both the food and the spice , and the better data professional would improve the information value of the raw data by cleverly choosing the right way to look at it, resorting to mergers with external inputs at times. Often in our daily life as a data science professional we use data sources in our analysis and modelling which are directly client or user provided and may have seemingly limited value otherwise. Now take external data inputs and merge the information with the feature at hand and lo-behold a striking new feature develops with a much higher information value.
A certain OTT platform had provided some data for analysis for CAC (Cost of Acquiring each Customer) Reduction which warranted a thorough exploratory analysis of click bait data. One of the first preliminary studies I performed was to check the cardinality of the categorical variables amongst others. Two apparently relevant features were found to be of high cardinality (device and app_version) and initially the thoughts occurred to neglect them and go ahead with other fields. A little pondering over a cup of black coffee and fresh air gave rise to the thought: can we not extract some information value from these fields ?
Read here to learn how we tackled one such case of high cardinality.
In the data I saw there are 500+ mobile brands and devices those the users visiting the platform from. This resulted in a high cardinality feature which may not be of much direct use in Machine Learning algorithms or for data analysis in itself.
We started with hypothesis:
I went ahead and researched prices of about 500 plus model names from various sites. Some of the key challenges faced were:
With the data that I extracted, after some cleaning, put into a histogram plot on the prices. ( See Fig. 1)
Figure 1 : Price Distribution for different cell phone models from various brands
This nicely depicted the distribution based on which I came up with price buckets catering to segments like premium, entry level etc. Thus I basically incorporated externally sourced pricing data to convert a high cardinality feature to a more usable “affluence segment” feature which would then be hot encoded easily to use the information value involved in the price category of the phone or device.
This is called Enriching Data, by enhancing the value from the same raw data.
There are various ways we enhance the value from the data
To learn more such tricks and techniques in the ‘Art of Data Science’ from seasoned industry practitioners sign up with us to receive more such blogs with useful data science tips.
You may write to us and share your ideas, questions and thoughts at mitra@setuschool.com
Monika Pandey
Monika Pandey
Monika Pandey
Monika Pandey
Anish Roychowdhury
Ananya Dey
Thulasiram Gunipati
Ujjyaini Mitra
Ujjyaini Mitra
Anish Roychowdhury
Ujjyaini Mitra
Ujjyaini Mitra
Satadru Bhattacharya
Thulasiram Gunipati
Thulasiram Gunipati
Thulasiram Gunipati
Form Submitted Successfully