In the financial field, our y value and the predicted default probability are just two distributions with unknown distributions. A good credit risk control model generally evaluates the model in terms of accuracy, stability and interpretability.
Generally speaking. The distribution of samples of good people should be very different from the distribution of samples of bad people. KS happens to be the distinguishing ability indicator in the effectiveness index: **KS is used to evaluate the risk distinguishing ability of the model, and the KS indicator measures the accumulation of good and bad samples. The difference between the distributions. **
The greater the cumulative difference between good and bad samples and the greater the KS index, the stronger the model's ability to distinguish risks.
1、 Crosstab implementation, the core of calculating ks is the cumulative probability distribution of good and bad people. We use the pandas.crosstab function to calculate the cumulative probability distribution.
2、 roc_curve implementation, when the roc_curve function in the sklearn library calculates roc and auc, the cumulative probability distribution of good and bad people has been obtained during the calculation process, and we use sklearn.metrics.roc_curve to calculate the ks value
3、 ks_2samp implementation, call stats.ks_2samp() function to calculate. Link scipy.stats.ks_2samp¶ to implement the source code for ks_2samp(), and the detailed process is implemented here
4、 Call stats.ks_2samp() directly to calculate ks
import pandas as pd
import numpy as np
from sklearn.metrics import roc_curve
from scipy.stats import ks_2samp
def ks_calc_cross(data,pred,y_label):'''
Features:Calculate KS value, output corresponding split point and cumulative distribution function curve graph
input value:
data:Two-dimensional array or dataframe, including model scores and real labels
pred:One-dimensional array or series, representing the model score (usually the probability of predicting the positive class)
y_label:One-dimensional array or series, representing real labels ({0,1}or{-1,1})
output value:'ks':KS value,'crossdens':Cumulative probability distribution of good and bad customers and their difference gap
'''
crossfreq = pd.crosstab(data[pred[0]],data[y_label[0]])
crossdens = crossfreq.cumsum(axis=0)/ crossfreq.sum()
crossdens['gap']=abs(crossdens[0]- crossdens[1])
ks = crossdens[crossdens['gap']== crossdens['gap'].max()]return ks,crossdens
def ks_calc_auc(data,pred,y_label):'''
Features:Calculate KS value, output corresponding split point and cumulative distribution function curve graph
input value:
data:Two-dimensional array or dataframe, including model scores and real labels
pred:One-dimensional array or series, representing the model score (usually the probability of predicting the positive class)
y_label:One-dimensional array or series, representing real labels ({0,1}or{-1,1})
output value:'ks':KS value
'''
fpr,tpr,thresholds=roc_curve(data[y_label[0]],data[pred[0]])
ks =max(tpr-fpr)return ks
def ks_calc_2samp(data,pred,y_label):'''
Features:Calculate KS value, output corresponding split point and cumulative distribution function curve graph
input value:
data:Two-dimensional array or dataframe, including model scores and real labels
pred:One-dimensional array or series, representing the model score (usually the probability of predicting the positive class)
y_label:One-dimensional array or series, representing real labels ({0,1}or{-1,1})
output value:'ks':KS value,'cdf_df':Cumulative probability distribution of good and bad customers and their difference gap
'''
Bad = data.loc[data[y_label[0]]==1,pred[0]]
Good = data.loc[data[y_label[0]]==0, pred[0]]
data1 = Bad.values
data2 = Good.values
n1 = data1.shape[0]
n2 = data2.shape[0]
data1 = np.sort(data1)
data2 = np.sort(data2)
data_all = np.concatenate([data1,data2])
cdf1 = np.searchsorted(data1,data_all,side='right')/(1.0*n1)
cdf2 =(np.searchsorted(data2,data_all,side='right'))/(1.0*n2)
ks = np.max(np.absolute(cdf1-cdf2))
cdf1_df = pd.DataFrame(cdf1)
cdf2_df = pd.DataFrame(cdf2)
cdf_df = pd.concat([cdf1_df,cdf2_df],axis =1)
cdf_df.columns =['cdf_Bad','cdf_Good']
cdf_df['gap']= cdf_df['cdf_Bad']-cdf_df['cdf_Good']return ks,cdf_df
data ={'y_label':[1,1,1,1,1,1,0,0,0,0,0,0],'pred':[0.5,0.6,0.7,0.6,0.6,0.8,0.4,0.2,0.1,0.4,0.3,0.9]}
data = pd.DataFrame(data)
ks1,crossdens=ks_calc_cross(data,['pred'],['y_label'])
ks2=ks_calc_auc(data,['pred'],['y_label'])
ks3=ks_calc_2samp(data,['pred'],['y_label'])
get_ks = lambda y_pred,y_true:ks_2samp(y_pred[y_true==1], y_pred[y_true!=1]).statistic
ks4=get_ks(data['pred'],data['y_label'])print('KS1:',ks1['gap'].values)print('KS2:',ks2)print('KS3:',ks3[0])print('KS4:',ks4)
Output result:
KS1:[0.83333333]
KS2:0.833333333333
KS3:0.833333333333
KS4:0.833333333333
When there are NAN data in the data, there are some problems that need attention!
For example, we added y_label=0, pred=np.nan to the original data.
data = {‘y_label’:[1,1,1,1,1,1,0,0,0,0,0,0,0],
‘pred’:[0.5,0.6,0.7,0.6,0.6,0.8,0.4,0.2,0.1,0.4,0.3,0.9,np.nan]}
Execute at this time
ks1,crossdens=ks_calc_cross(data,[‘pred’], [‘y_label’])
Output result
KS1: [ 0.83333333]
carried out
ks2=ks_calc_auc(data,[‘pred’], [‘y_label’])
The following error will be reported
ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).
carried out
ks3=ks_calc_2samp(data,[‘pred’], [‘y_label’])
Output result
KS3: 0.714285714286
carried out
ks4=get_ks(data[‘pred’],data[‘y_label’])
Output result
KS4: 0.714285714286
We can see from the above results
The ks values calculated by the three methods are all different.
NAN is ignored in the calculation of ks_calc_cross, and the probability distribution of the correct data is calculated. The calculated ks is the same as our hand calculated ks
Because the built-in function of the ks_calc_auc function cannot handle the NAN value, it reports an error directly, so if you need ks_calc_auc to calculate the ks value, you need to remove the NAN value in advance.
The ks calculated by ks_calc_2samp is because of the searchsorted() function (students who are interested can simulate the data to see this function by themselves), the Nan value will be sorted to the maximum by default, thus changing the original cumulative distribution probability of the data, resulting in the calculated ks There is an error with the real ks.
to sum up
In actual situations, we generally calculate the ks value of the probability of default. At this time, there is no NAN value. So the above three methods can calculate the ks value. But when we calculate the ks value of a single variable, sometimes the data quality is not good, and when there is a NAN value, there will be problems if we continue to use ks_calc_auc and ks_calc_2samp.
There are two solutions
Remove the NAN value in the data in advance
Calculate directly with ks_calc_cross.
The above detailed example of using Python to calculate KS is all the content shared by the editor. I hope to give you a reference.
Recommended Posts