logo

PGD Cohort 2 Enrolment is live. Call 8100551189 or Request a Call Back.

PGD Cohort 2 Enrolment is live.
Call 8100551189 or Request a Call Back.

The spicy secrets of making your data tastier and giving you more juice!

Art Of DS Applications


16th March 2023

Anish Roychowdhury

blog

A spice is defined as a seed, fruit, root, bark, or other plant substance primarily used for flavouring, colorings, or preserving food. In other words, spice is an external addition to food – meat or vegetables which hugely enhances the taste and in many cases the nutritional value of the food. The salt and pepper, an everyday table accessory in every household was not always so!  In fact the humble salt which we disregard today to be important was once so valuable that the Roman soldiers used to get paid in salt – hence the origin of the word salary.

 

In the world of data the situation is no different – data is both the food and the spice , and the better data professional would improve the information value of the raw data by cleverly choosing the right way to look at it, resorting to mergers with external inputs at times.  Often in our daily life as a data science professional we use data sources in our analysis and modelling which are directly client or user provided and may have seemingly limited value otherwise. Now take external data inputs and merge the information with the feature at hand and lo-behold a striking new feature develops with a much higher information value.

 

A certain OTT platform had provided some data for analysis for CAC (Cost of Acquiring each Customer)  Reduction which warranted a thorough exploratory analysis  of click bait data. One of the first preliminary studies I performed was to  check the cardinality of the categorical variables amongst others. Two apparently relevant features were found to be of high cardinality (device and app_version) and initially the thoughts occurred to neglect them and go ahead with other fields. A little pondering over a cup of black coffee and fresh air gave rise to the thought: can we not extract some information value from these fields ? 

Read here to learn how we tackled one such case of high cardinality.

 

 

In the data I saw there are 500+ mobile brands and devices those the users visiting the platform from. This resulted in a high cardinality feature which may not be of much direct use in Machine Learning algorithms or for data analysis in itself.  

We started with hypothesis:

 

  • Can mobile brands and devices talk about people’s affluence level?

 

I went ahead and researched prices of about 500 plus model names from various sites. Some of the key challenges faced were: 

  1. Consistency of the model name while scraping price from various sites posed a big challenge, because exact name match is not possible in most cases. (keeping this for another blog to explain how we overcome this)
  2. I scraped the data from multiple sites. Different sites give different prices for the same brand and same model. So the median price was used , in case of multiple sources. 
  3. Another challenge I faced is the price decline of some of the mobile brands over time. If you buy a newly launched mobile it will be priced much higher than buying the same mobile after the model is at least a year old. So, that could bias our analyses, which risk had to be considered. 
  4. Also, with the availability of the second hand mobile market, I could not know if that was a first hand or second hand device.

 

With the data that I extracted, after some cleaning, put into a histogram plot on the prices.  ( See Fig. 1) 

 

 

Figure 1 : Price Distribution for different cell phone models from various brands 

 

This nicely depicted the distribution based on which I came up with price buckets catering to segments like premium, entry level etc. Thus I basically incorporated externally sourced pricing data to convert a high cardinality feature to a more usable “affluence segment” feature which would then be hot encoded easily to use the information value involved in the price category of the phone or device. 

This is called Enriching Data, by enhancing the value from the same raw data.

 

 

 

 

There are various ways we enhance the value from the data

 

 



To learn more such tricks and techniques in the ‘Art of Data Science’ from seasoned industry practitioners sign up with us to receive more such blogs with useful data science tips.

You may write to us and share your ideas, questions and thoughts at mitra@setuschool.com



RELATED BLOGS