logo

PGD Cohort 2 Enrolment is live. Call 8100551189 or Request a Call Back.

PGD Cohort 2 Enrolment is live.
Call 8100551189 or Request a Call Back.

Presenting Data Understanding like top consulting firms do

Fundamentals


16th March 2023

Satadru Bhattacharya

blog

 I was talking to a Fintech client and they were discussing that their loan default rate is higher than their expectations. So the conversation went like this:

 

Mr X (Client) : Can you do some ML models to help us in this?

Me: Yes of course, we can build a loan default model, which will prescribe a probability for a borrower to default in paying back. Based on your risk tolerance you can decide the cut off, and then disburse the loan. 

Mr X: That’s interesting. So what do you need from us?

Me: Could you share your data with us?

Mr X : What data are you looking for?

Me : …. - (I am keeping this for another blog, how to list down what data we need for the modelling)

Mr X : Okay, my team will share that with you tomorrow. When can we see some initial understanding from the data?

Me: Allow us a day’s time….

 

After 3 days, I was ready with a visual presentation to showcase the initial understanding with the client team. The client team said - we only have 20 minutes , we can not go through the whole deck. Can you just give us a brief, if the data is useful? Or if you have questions? 

 

Hmmmm.. I was confused! There are more than 30 data columns and they want a gist? Have you faced a similar situation?

 

That’s when I learnt how to present it simply in a tabular format, which is easy to understand, easy to communicate and time efficient. Ideally top consulting firms present the crux of the story to their clients in a simple, concise manner which is persuasive too. Today I am sharing my learning with you in this blog. Read it here.

 

 

 

Please note that Data Understanding is different than Exploratory Data Analysis (EDA)

EDA is when we deep dive into data and we understand their statistical behaviour, their relation with the target variable (if any), or correlation among the variables. Whereas, data understanding is only top level understanding, which we can do fast enough. Ideally, as soon as you get the data, you should do data understanding and go back to the client and ask if you have any basic doubts about the data, without wasting time in doing any deep dive.

What all things are included in the Data Understanding

  1. Data identity - what each of the row in the data represent
  2. Data Definition - what each of the column represents
  3. Data shape/ dimensions ( rows and columns)  - is this a sample data? How large is the complete data therefore? How often does this data get updated? 
  4. Data Type (for each column what’s the data type and are they the right data type or any conversions required?)
  5. Data duplications
  6. Data NULL values
  7. Data outliers
  8. Data cardinality / ordinality (in case of categorical fields)

 

So, we do these analyses and present our learning in a simple tabular form in a single excel page. 

 

Most of the headers in the above table are self explanatory. Just to give a little of explanations for few of them:

% of NULL values = # of NULL values / total # of rows

NULL imputation: Since we may not get time with the client team repeatedly, we try to presume possible options, and therefore get those clarified in the same call itself. So, keep ourselves ready with possible NULL imputations and get a buy-in from the client, which will be meaningful. 

High cardinality management: Same way, we keep few options pre-planned. Either we can learn from the client, or we share our options and get a buy-in on a few of them. Once again, the early buy-in helps in developing trust with client or senior management and shows your efficiency as a great data scientist.

 

Now from this specific use case, where I had a loan default data I faced few challenges for a few specific fields. 

  1. CIBIL Score came as categorical, which is supposed to be a numerical field. Upon investigation, I found that there are rows marked as 000-1. Two possibilities here, which we need to validate from the client.
    1. Either these borrowers are the first time borrowers, and therefore there is no credit history for them. Which means, ideally these are NULL values. 
    2. Or somehow this data has not come correctly in this file and this is an error.
  2. There are few fields, which we could not guess their definitions. So we also included those for clarification
  3. There are columns like ZIPCODE which come with very high cardinality. We wanted to understand if the client uses Zip Code currently for any internal decision making. 

 

Our guess was right for CIBIL score, and 000-1 was for those borrowers with no previous credit history, therefore no credit score. Keeping them NULL is the best way, as zero imputation will totally change the meaning of the field value. (Keeping this for another blog - How dangerous could be zero imputation for NULL fields)

 

Now for the Zipcode, we learnt that they have observed high fraudulent activities from a few specific zip codes, and therefore, any applications coming from those ZIPcodes, make them super cautious and they do further due diligence or do not give loans without a guarantor or without loan insurance. Now that was interesting learning from the client, which helped us tackling this high cardinal variable during model building stages. 

 

With one such simple yet effective tabular summary of Data understanding, the call was wrapped within 15 minutes. Because I had all the information in one place. Also I had my doubts well drafted. Along with doubts I had some possible solutions/ options pre drafted, which nudged the discussion in the right direction and fast. 

 

I hope this will help you in presenting your data understanding to Senior members or top management (internal and external) in a well structured way. Sign up with us for more such useful tips and blogs from us. Feel free to share your thoughts with us. Write to us at mitra@setuschool.com

 

RELATED BLOGS