The future of learning is doing.
350+ Students trust us
Fundamentals
16th March 2023
Satadru Bhattacharya
I was talking to a Fintech client and they were discussing that their loan default rate is higher than their expectations. So the conversation went like this:
Mr X (Client) : Can you do some ML models to help us in this?
Me: Yes of course, we can build a loan default model, which will prescribe a probability for a borrower to default in paying back. Based on your risk tolerance you can decide the cut off, and then disburse the loan.
Mr X: That’s interesting. So what do you need from us?
Me: Could you share your data with us?
Mr X : What data are you looking for?
Me : …. - (I am keeping this for another blog, how to list down what data we need for the modelling)
Mr X : Okay, my team will share that with you tomorrow. When can we see some initial understanding from the data?
Me: Allow us a day’s time….
After 3 days, I was ready with a visual presentation to showcase the initial understanding with the client team. The client team said - we only have 20 minutes , we can not go through the whole deck. Can you just give us a brief, if the data is useful? Or if you have questions?
Hmmmm.. I was confused! There are more than 30 data columns and they want a gist? Have you faced a similar situation?
That’s when I learnt how to present it simply in a tabular format, which is easy to understand, easy to communicate and time efficient. Ideally top consulting firms present the crux of the story to their clients in a simple, concise manner which is persuasive too. Today I am sharing my learning with you in this blog. Read it here.
EDA is when we deep dive into data and we understand their statistical behaviour, their relation with the target variable (if any), or correlation among the variables. Whereas, data understanding is only top level understanding, which we can do fast enough. Ideally, as soon as you get the data, you should do data understanding and go back to the client and ask if you have any basic doubts about the data, without wasting time in doing any deep dive.
So, we do these analyses and present our learning in a simple tabular form in a single excel page.
Most of the headers in the above table are self explanatory. Just to give a little of explanations for few of them:
% of NULL values = # of NULL values / total # of rows
NULL imputation: Since we may not get time with the client team repeatedly, we try to presume possible options, and therefore get those clarified in the same call itself. So, keep ourselves ready with possible NULL imputations and get a buy-in from the client, which will be meaningful.
High cardinality management: Same way, we keep few options pre-planned. Either we can learn from the client, or we share our options and get a buy-in on a few of them. Once again, the early buy-in helps in developing trust with client or senior management and shows your efficiency as a great data scientist.
Now from this specific use case, where I had a loan default data I faced few challenges for a few specific fields.
Our guess was right for CIBIL score, and 000-1 was for those borrowers with no previous credit history, therefore no credit score. Keeping them NULL is the best way, as zero imputation will totally change the meaning of the field value. (Keeping this for another blog - How dangerous could be zero imputation for NULL fields)
Now for the Zipcode, we learnt that they have observed high fraudulent activities from a few specific zip codes, and therefore, any applications coming from those ZIPcodes, make them super cautious and they do further due diligence or do not give loans without a guarantor or without loan insurance. Now that was interesting learning from the client, which helped us tackling this high cardinal variable during model building stages.
With one such simple yet effective tabular summary of Data understanding, the call was wrapped within 15 minutes. Because I had all the information in one place. Also I had my doubts well drafted. Along with doubts I had some possible solutions/ options pre drafted, which nudged the discussion in the right direction and fast.
I hope this will help you in presenting your data understanding to Senior members or top management (internal and external) in a well structured way. Sign up with us for more such useful tips and blogs from us. Feel free to share your thoughts with us. Write to us at mitra@setuschool.com
Monika Pandey
Monika Pandey
Monika Pandey
Monika Pandey
Anish Roychowdhury
Ananya Dey
Thulasiram Gunipati
Ujjyaini Mitra
Ujjyaini Mitra
Anish Roychowdhury
Ujjyaini Mitra
Ujjyaini Mitra
Satadru Bhattacharya
Thulasiram Gunipati
Thulasiram Gunipati
Thulasiram Gunipati
Form Submitted Successfully