Detailed examples of using Python to calculate KS

In the financial field, our y value and the predicted default probability are just two distributions with unknown distributions. A good credit risk control model generally evaluates the model in terms of accuracy, stability and interpretability.

Generally speaking. The distribution of samples of good people should be very different from the distribution of samples of bad people. KS happens to be the distinguishing ability indicator in the effectiveness index: **KS is used to evaluate the risk distinguishing ability of the model, and the KS indicator measures the accumulation of good and bad samples. The difference between the distributions. **

The greater the cumulative difference between good and bad samples and the greater the KS index, the stronger the model's ability to distinguish risks.

1、 Crosstab implementation, the core of calculating ks is the cumulative probability distribution of good and bad people. We use the pandas.crosstab function to calculate the cumulative probability distribution.

2、 roc_curve implementation, when the roc_curve function in the sklearn library calculates roc and auc, the cumulative probability distribution of good and bad people has been obtained during the calculation process, and we use sklearn.metrics.roc_curve to calculate the ks value

3、 ks_2samp implementation, call stats.ks_2samp() function to calculate. Link scipy.stats.ks_2samp¶ to implement the source code for ks_2samp(), and the detailed process is implemented here

4、 Call stats.ks_2samp() directly to calculate ks

import pandas as pd 
import numpy as np
from sklearn.metrics import roc_curve
from scipy.stats import ks_2samp
def ks_calc_cross(data,pred,y_label):'''
Features:Calculate KS value, output corresponding split point and cumulative distribution function curve graph
input value:
data:Two-dimensional array or dataframe, including model scores and real labels
pred:One-dimensional array or series, representing the model score (usually the probability of predicting the positive class)
y_label:One-dimensional array or series, representing real labels ({0,1}or{-1,1})
output value:'ks':KS value,'crossdens':Cumulative probability distribution of good and bad customers and their difference gap
'''
crossfreq = pd.crosstab(data[pred[0]],data[y_label[0]])
crossdens = crossfreq.cumsum(axis=0)/ crossfreq.sum()
crossdens['gap']=abs(crossdens[0]- crossdens[1])
ks = crossdens[crossdens['gap']== crossdens['gap'].max()]return ks,crossdens
def ks_calc_auc(data,pred,y_label):'''
Features:Calculate KS value, output corresponding split point and cumulative distribution function curve graph
input value:
data:Two-dimensional array or dataframe, including model scores and real labels
pred:One-dimensional array or series, representing the model score (usually the probability of predicting the positive class)
y_label:One-dimensional array or series, representing real labels ({0,1}or{-1,1})
output value:'ks':KS value
'''
fpr,tpr,thresholds=roc_curve(data[y_label[0]],data[pred[0]])
ks =max(tpr-fpr)return ks
def ks_calc_2samp(data,pred,y_label):'''
Features:Calculate KS value, output corresponding split point and cumulative distribution function curve graph
input value:
data:Two-dimensional array or dataframe, including model scores and real labels
pred:One-dimensional array or series, representing the model score (usually the probability of predicting the positive class)
y_label:One-dimensional array or series, representing real labels ({0,1}or{-1,1})
output value:'ks':KS value,'cdf_df':Cumulative probability distribution of good and bad customers and their difference gap
'''
Bad = data.loc[data[y_label[0]]==1,pred[0]]
Good = data.loc[data[y_label[0]]==0, pred[0]]
data1 = Bad.values
data2 = Good.values
n1 = data1.shape[0]
n2 = data2.shape[0]
data1 = np.sort(data1)
data2 = np.sort(data2)
data_all = np.concatenate([data1,data2])
cdf1 = np.searchsorted(data1,data_all,side='right')/(1.0*n1)
cdf2 =(np.searchsorted(data2,data_all,side='right'))/(1.0*n2)
ks = np.max(np.absolute(cdf1-cdf2))
cdf1_df = pd.DataFrame(cdf1)
cdf2_df = pd.DataFrame(cdf2)
cdf_df = pd.concat([cdf1_df,cdf2_df],axis =1)
cdf_df.columns =['cdf_Bad','cdf_Good']
cdf_df['gap']= cdf_df['cdf_Bad']-cdf_df['cdf_Good']return ks,cdf_df
data ={'y_label':[1,1,1,1,1,1,0,0,0,0,0,0],'pred':[0.5,0.6,0.7,0.6,0.6,0.8,0.4,0.2,0.1,0.4,0.3,0.9]}
data = pd.DataFrame(data)
ks1,crossdens=ks_calc_cross(data,['pred'],['y_label'])
ks2=ks_calc_auc(data,['pred'],['y_label'])
ks3=ks_calc_2samp(data,['pred'],['y_label'])
get_ks = lambda y_pred,y_true:ks_2samp(y_pred[y_true==1], y_pred[y_true!=1]).statistic
ks4=get_ks(data['pred'],data['y_label'])print('KS1:',ks1['gap'].values)print('KS2:',ks2)print('KS3:',ks3[0])print('KS4:',ks4)

Output result:

KS1:[0.83333333]
KS2:0.833333333333
KS3:0.833333333333
KS4:0.833333333333

When there are NAN data in the data, there are some problems that need attention!

For example, we added y_label=0, pred=np.nan to the original data.

data = {‘y_label’:[1,1,1,1,1,1,0,0,0,0,0,0,0],
‘pred’:[0.5,0.6,0.7,0.6,0.6,0.8,0.4,0.2,0.1,0.4,0.3,0.9,np.nan]}

Execute at this time

ks1,crossdens=ks_calc_cross(data,[‘pred’], [‘y_label’])

Output result

KS1: [ 0.83333333]

carried out

ks2=ks_calc_auc(data,[‘pred’], [‘y_label’])

The following error will be reported

ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).

carried out

ks3=ks_calc_2samp(data,[‘pred’], [‘y_label’])

Output result

KS3: 0.714285714286

carried out

ks4=get_ks(data[‘pred’],data[‘y_label’])

Output result

KS4: 0.714285714286

We can see from the above results

The ks values calculated by the three methods are all different.

NAN is ignored in the calculation of ks_calc_cross, and the probability distribution of the correct data is calculated. The calculated ks is the same as our hand calculated ks

Because the built-in function of the ks_calc_auc function cannot handle the NAN value, it reports an error directly, so if you need ks_calc_auc to calculate the ks value, you need to remove the NAN value in advance.

The ks calculated by ks_calc_2samp is because of the searchsorted() function (students who are interested can simulate the data to see this function by themselves), the Nan value will be sorted to the maximum by default, thus changing the original cumulative distribution probability of the data, resulting in the calculated ks There is an error with the real ks.

to sum up

In actual situations, we generally calculate the ks value of the probability of default. At this time, there is no NAN value. So the above three methods can calculate the ks value. But when we calculate the ks value of a single variable, sometimes the data quality is not good, and when there is a NAN value, there will be problems if we continue to use ks_calc_auc and ks_calc_2samp.

There are two solutions

  1. Remove the NAN value in the data in advance

  2. Calculate directly with ks_calc_cross.

The above detailed example of using Python to calculate KS is all the content shared by the editor. I hope to give you a reference.

Recommended Posts

Detailed examples of using Python to calculate KS
Detailed explanation of Python web page parser usage examples
Python| function using recursion to solve
Detailed explanation of python backtracking template
Analysis of usage examples of Python yield
Detailed implementation of Python plug-in mechanism
Detailed explanation of python sequence types
Detailed explanation of ubuntu using gpg2
Using Python to implement multiple clipboards
Detailed usage of dictionary in Python
Python example to calculate IV value
Detailed examples of Centos6 network configuration
How to verify successful installation of python
Python access to npy format data examples
Detailed explanation of Python IO port multiplexing
Detailed analysis of Python garbage collection mechanism
200 lines of Python code to achieve snake
Python from attribute to property detailed explanation
Python Romberg method to find integral examples
How to debug python program using repr
Detailed explanation of Python guessing algorithm problems
Detailed explanation of the principle of Python super() method
Detailed explanation of python standard library OS module
Python3 development environment to build a detailed tutorial
Detailed explanation of the usage of Python decimal module
500 lines of python code to achieve aircraft war
Detailed explanation of how python supports concurrent methods
Implementation of Python headless crawler to download files
Detailed explanation of data types based on Python
Diagram of using Python process based on FME
7 features of Python3.9
01. Introduction to Python
100 small Python examples
Introduction to Python
Detailed explanation of the principle of Python function parameter classification
Detailed explanation of the principle of Python timer thread pool
Detailed explanation of the implementation steps of Python interface development
Detailed explanation of common tools for Python process control
How to understand the introduction of packages in Python
How to understand a list of numbers in python
Detailed explanation of the attribute access process of Python objects
Detailed explanation of the remaining problem based on python (%)
Example of how to automatically download pictures in python
Using Python to analyze the Guangzhou real estate market
Python simulation to realize the distribution of playing cards