个人信贷评估模型研究_个人信用不良贷款估值模型_格拉迪沃的博客-程序员宝宝

技术标签: 数据竞赛  数据挖掘  个人信贷  机器学习实战  

数据初探和可视化分析

这一部分主要是对数据可视化分析,使用常识和专家经验寻找关键特征和预测的量之间的大致关系,在这里主要学习的pandas的主要使用以及seaborn和matplotlib的可视化方法和数据分析的思路。

介绍

本文数据来源于Lending Club平台,主要目的是对客户的信用状态进行评估,其信用状态如下表:
在这里插入图片描述
由人工把7种再次划分为良好与不良两种状态,主要使用分析工具是pandas、sklearn、keras和seaborn、matplotlib;用pandas做数据清洗和数据规整分析,用sklearn做特征工程,使用keras进行分类,用seaborn、matplotlib进行可视化分析。下面是所需要的包

# Import our libraries we are going to use for our data analysis.
import keras 
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Plotly visualizations
from plotly import tools
import plotly.plotly as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
# plotly.tools.set_credentials_file(username='AlexanderBach', api_key='o4fx6i1MtEIJQxfWYvU1')


# For oversampling Library (Dealing with Imbalanced Datasets)
from imblearn.over_sampling import SMOTE
from collections import Counter

# Other Libraries
import time

一般信息统计

主要是用pandas读取数据,查看数据信息

% matplotlib inline

df = pd.read_csv('../input/loan.csv', low_memory=False)

# Copy of the dataframe
original_df = df.copy()
#查看表中的基本样本,以及总体信息
df.head()
df.info()

然后根据习惯可以重命名,删掉没用的信息,如成员ID

df = df.rename(columns={
    "loan_amnt": "loan_amount", "funded_amnt": "funded_amount"})
df.drop([ 'emp_title',  'zip_code', 'title'], axis=1, inplace=True)#inplace覆盖原来的

数据分布

画直方图
看一些变量的直方图,这里使用sns的displot函数来画直方图

fig, ax = plt.subplots(1, 3, figsize=(16,5))

loan_amount = df["loan_amount"].values
funded_amount = df["funded_amount"].values
investor_funds = df["investor_funds"].values

sns.distplot(loan_amount, ax=ax[0], color="#F7522F")
ax[0].set_title("Loan Applied by the Borrower", fontsize=14)
sns.distplot(funded_amount, ax=ax[1], color="#2F8FF7")
ax[1].set_title("Amount Funded by the Lender", fontsize=14)
sns.distplot(investor_funds, ax=ax[2], color="#2EAD46")
ax[2].set_title("Total committed by Investors", fontsize=14)

在这里插入图片描述
画饼状图
先对loan_status特征重新划分为两类

bad_loan = ["Charged Off", "Default", "Does not meet the credit policy. Status:Charged Off", "In Grace Period", 
           "Late (16-30 days)", "Late (31-120 days)"]


df['loan_condition'] = np.nan

def loan_condition(status):
   if status in bad_loan:
       return 'Bad Loan'
   else:
       return 'Good Loan'
   
   
df['loan_condition'] = df['loan_status'].apply(loan_condition)

用plot画饼状图

colors = ["#3791D7", "#D72626"]
labels ="Good Loans", "Bad Loans"
df["loan_condition"].value_counts().plot.pie(explode=[0,0.25], 
											autopct='%1.2f%%',
											shadow=True, 	
                                            colors=colors,
                                            labels=labels, 
                                            fontsize=12, startangle=70#x       :(每一块)的比例,如果sum(x) > 1会使用sum(x)归一化;
#labels  :(每一块)饼图外侧显示的说明文字;
#explode :(每一块)离开中心距离;
#shadow  :在饼图下面画一个阴影。默认值:False,即不画阴影;
#autopct :控制饼图内百分比设置,可以使用format字符串或者format function
 #       '%1.1f'指小数点前后位数(没有用空格补齐);

在这里插入图片描述
画柱状图
将信息转化为时间变量

# Lets' transform the issue dates by year.
df['issue_d'].head()
dt_series = pd.to_datetime(df['issue_d'])
df['year'] = dt_series.dt.year

根据年份画贷款金额,这里用sns的barplot

plt.figure(figsize=(12,8))
#非常方便的传参形式,直接在DataFrame上对某两列进行可视化,另外可以还有一个参量“hue”,表示另一个维度,每一年按这个维度划分
sns.barplot('year', 'loan_amount', data=df, palette='tab10')
plt.title('Issuance of Loans', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Average loan amount issued', fontsize=14)

在这里插入图片描述

好贷款与坏贷款

贷款类型

用pandas查看某列分量的值

df['loan_status'].value_counts()

各地区发放的贷款

对区域进行划分,重组;在这里可以好好体会pandas中apply函数的使用。

df['addr_state'].unique()#看不同值

# Make a list with each of the regions by state.

west = ['CA', 'OR', 'UT','WA', 'CO', 'NV', 'AK', 'MT', 'HI', 'WY', 'ID']
south_west = ['AZ', 'TX', 'NM', 'OK']
south_east = ['GA', 'NC', 'VA', 'FL', 'KY', 'SC', 'LA', 'AL', 'WV', 'DC', 'AR', 'DE', 'MS', 'TN' ]
mid_west = ['IL', 'MO', 'MN', 'OH', 'WI', 'KS', 'MI', 'SD', 'IA', 'NE', 'IN', 'ND']
north_east = ['CT', 'NY', 'PA', 'NJ', 'RI','MA', 'MD', 'VT', 'NH', 'ME']

df['region'] = np.nan

def finding_regions(state):
    if state in west:
        return 'West'
    elif state in south_west:
        return 'SouthWest'
    elif state in south_east:
        return 'SouthEast'
    elif state in mid_west:
        return 'MidWest'
    elif state in north_east:
        return 'NorthEast'
    
df['region'] = df['addr_state'].apply(finding_regions)

深入研究不良贷款

按贷款状况分类为每个地区的不良贷款的贷款数量。
首先把不良贷款找出来,然后按地区分组.
要点1: pd.crosstab(badloans_df[‘region’], badloans_df[‘loan_status’]).apply(lambda x: x/x.sum() * 100)
pd.crosstab()是交叉列表,第一个参量为行引索,第二个参量为列引索;后面跟了apply函数,其中x指的是整个DataFrame本身;关于lambda函数详情请参考这里

badloans_df = df.loc[df["loan_condition"] == "Bad Loan"]

# loan_status cross
loan_status_cross = pd.crosstab(badloans_df['region'], badloans_df['loan_status']).apply(lambda x: x/x.sum() * 100)
number_of_loanstatus = pd.crosstab(badloans_df['region'], badloans_df['loan_status'])


# Round our values
loan_status_cross['Charged Off'] = loan_status_cross['Charged Off'].apply(lambda x: round(x, 2))
loan_status_cross['Default'] = loan_status_cross['Default'].apply(lambda x: round(x, 2))
# loan_status_cross['Does not meet the credit policy. Status:Charged Off'] = loan_status_cross['Does not meet the credit policy. Status:Charged Off'].apply(lambda x: round(x, 2))
loan_status_cross['In Grace Period'] = loan_status_cross['In Grace Period'].apply(lambda x: round(x, 2))
loan_status_cross['Late (16-30 days)'] = loan_status_cross['Late (16-30 days)'].apply(lambda x: round(x, 2))
loan_status_cross['Late (31-120 days)'] = loan_status_cross['Late (31-120 days)'].apply(lambda x: round(x, 2))

#按行求和
number_of_loanstatus['Total'] = number_of_loanstatus.sum(axis=1) 
# number_of_badloans
number_of_loanstatus

然后可视化
先把各个Series转化为list,这是非常实用的

charged_off = loan_status_cross['Charged Off'].values.tolist()
default = loan_status_cross['Default'].values.tolist()
# not_meet_credit = loan_status_cross['Does not meet the credit policy. Status:Charged Off'].values.tolist()
grace_period = loan_status_cross['In Grace Period'].values.tolist()
short_pay = loan_status_cross['Late (16-30 days)'] .values.tolist()
long_pay = loan_status_cross['Late (31-120 days)'].values.tolist()



charged = go.Bar(
    x=['MidWest', 'NorthEast', 'SouthEast', 'SouthWest', 'West'],
    y= charged_off,
    name='Charged Off',
    marker=dict(
        color='rgb(192, 148, 246)'
    ),
    text = '%'
)

defaults = go.Bar(
    x=['MidWest', 'NorthEast', 'SouthEast', 'SouthWest', 'West'],
    y=default,
    name='Defaults',
    marker=dict(
        color='rgb(176, 26, 26)'
    ),
    text = '%'
)

# credit_policy = go.Bar(
#     x=['MidWest', 'NorthEast', 'SouthEast', 'SouthWest', 'West'],
#     y= not_meet_credit,
#     name='Does not meet Credit Policy',
#     marker = dict(
#         color='rgb(229, 121, 36)'
#     ),
#     text = '%'
# )

grace = go.Bar(
    x=['MidWest', 'NorthEast', 'SouthEast', 'SouthWest', 'West'],
    y= grace_period,
    name='Grace Period',
    marker = dict(
        color='rgb(147, 147, 147)'
    ),
    text = '%'
)

short_pays = go.Bar(
    x=['MidWest', 'NorthEast', 'SouthEast', 'SouthWest', 'West'],
    y= short_pay,
    name='Late Payment (16-30 days)', 
    marker = dict(
        color='rgb(246, 157, 135)'
    ),
    text = '%'
)

long_pays = go.Bar(
    x=['MidWest', 'NorthEast', 'SouthEast', 'SouthWest', 'West'],
    y= long_pay,
    name='Late Payment (31-120 days)',
    marker = dict(
        color = 'rgb(238, 76, 73)'
        ),
    text = '%'
)




data = [charged, defaults,  grace, short_pays, long_pays]
layout = go.Layout(
    barmode='stack',#['stack', 'group', 'overlay', 'relative']可选
    title = '% of Bad Loan Status by Region',
    xaxis=dict(title='US Regions')
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='stacked-bar')

在这里插入图片描述

商业视角

了解业务的操作方面

我们将重点关注三个关键指标:国家发放的贷款(总和),向客户收取的平均利率以及各州的所有客户的平均年收入。

#按州绘制

#按我们的指标进行分组
#First Plotly Graph(我们评估业务的运营方面)
by_loan_amount = df.groupby(['region','addr_state'], as_index=False).loan_amount.sum()
by_interest_rate = df.groupby(['region', 'addr_state'], as_index=False).interest_rate.mean()
by_income = df.groupby(['region', 'addr_state'], as_index=False).annual_income.mean()



# Take the values to a list for visualization purposes.
states = by_loan_amount['addr_state'].values.tolist()
average_loan_amounts = by_loan_amount['loan_amount'].values.tolist()
average_interest_rates = by_interest_rate['interest_rate'].values.tolist()
average_annual_income = by_income['annual_income'].values.tolist()


from collections import OrderedDict

# 创造一个有序的字典
metrics_data = OrderedDict([('state_codes', states),
                            ('issued_loans', average_loan_amounts),
                            ('interest_rate', average_interest_rates),
                            ('annual_income', average_annual_income)])
                     

metrics_df = pd.DataFrame.from_dict(metrics_data)
metrics_df = metrics_df.round(decimals=2)
metrics_df.head()

在地图上可视化,这是一个画美国地图的模板,可以套用;

# Now it comes the part where we plot out plotly United States map
import plotly.plotly as py
import plotly.graph_objs as go
#metrics_df的索引是每一个州
for col in metrics_df.columns:
    metrics_df[col] = metrics_df[col].astype(str)
    
scl = [[0.0, 'rgb(210, 241, 198)'],[0.2, 'rgb(188, 236, 169)'],[0.4, 'rgb(171, 235, 145)'],\
            [0.6, 'rgb(140, 227, 105)'],[0.8, 'rgb(105, 201, 67)'],[1.0, 'rgb(59, 159, 19)']]

metrics_df['text'] = metrics_df['state_codes'] + '<br>' +\
'Average loan interest rate: ' + metrics_df['interest_rate'] + '<br>'+\
'Average annual income: ' + metrics_df['annual_income'] 


data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = metrics_df['state_codes'],
        z = metrics_df['issued_loans'], 
        locationmode = 'USA-states',
        text = metrics_df['text'],
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "$s USD")
        ) ]


layout = dict(
    title = 'Lending Clubs Issued Loans <br> (A Perspective for the Business Operations)',
    geo = dict(
        scope = 'usa',
        projection=dict(type='albers usa'),
        showlakes = True,
        lakecolor = 'rgb(255, 255, 255)')
)

fig = dict(data=data, layout=layout)
iplot(fig, filename='d3-cloropleth-map')

在这里插入图片描述

按收入类别分析

我们将创建不同的收入类别,以便检测重要的模式,并在我们的分析中更深入。
低收入类别:年收入低于或等于100,000美元的借款人。
中等收入类别:年收入高于100,000美元但低于或等于200,000美元的借款人。
高收入类别:年收入高于200,000美元的借款人。
作为高收入类别的一部分的借款人获得的贷款额高于中低收入类别的人。当然,年收入较高的人更有可能支付更高金额的贷款。 (子图左侧第一行)
低收入类别借入的贷款成为不良贷款的变动略高。 (子图右侧第一行)
高收入和中等年收入的借款人的就业时间比收入较低的人长。(在次要情况的左边第二行)
低收入的借款人平均利率较高,而年收入较高的人贷款利率较低。 (子图右侧第二行)

#让我们为annual_income创建类别,因为大多数不良贷款都位于100k以下

df['income_category'] = np.nan
lst = [df]
#这是一个很好的遍历整个数据表格的方法,其实也可以用apply
for col in lst:
    col.loc[col['annual_income'] <= 100000, 'income_category'] = 'Low'
    col.loc[(col['annual_income'] > 100000) & (col['annual_income'] <= 200000), 'income_category'] = 'Medium'
    col.loc[col['annual_income'] > 200000, 'income_category'] = 'High'

上面的写法用apply法代替

df['income_category'] = np.nan

def income_category(income):
    if income<=100000:
        return 'Low'
    elif (income > 100000) & (income <= 200000):
        return  'Medium'
    elif  income>=200000:
        return 'High'

df['income_category'] = df['annual_income'].apply(income_category)

对于像好坏的变量可以用上面的方法,要是object是不多的,可以用有序特征映射,这里还是用遍历法

# Let's transform the column loan_condition into integrers.

lst = [df]
df['loan_condition_int'] = np.nan

for col in lst:
    col.loc[df['loan_condition'] == 'Good Loan', 'loan_condition_int'] = 0 # Negative (Bad Loan)
    col.loc[df['loan_condition'] == 'Bad Loan', 'loan_condition_int'] = 1 # Positive (Good Loan)
    
# Convert from float to int the column (This is our label)  
df['loan_condition_int'] = df['loan_condition_int'].astype(int)

employment_length = ['10+ years', '< 1 year', '1 year', '3 years', '8 years', '9 years',
                    '4 years', '5 years', '6 years', '2 years', '7 years', 'n/a']

# Create a new column and convert emp_length to integers.

lst = [df]
df['emp_length_int'] = np.nan

for col in lst:
    col.loc[col['emp_length'] == '10+ years', "emp_length_int"] = 10
    col.loc[col['emp_length'] == '9 years', "emp_length_int"] = 9
    col.loc[col['emp_length'] == '8 years', "emp_length_int"] = 8
    col.loc[col['emp_length'] == '7 years', "emp_length_int"] = 7
    col.loc[col['emp_length'] == '6 years', "emp_length_int"] = 6
    col.loc[col['emp_length'] == '5 years', "emp_length_int"] = 5
    col.loc[col['emp_length'] == '4 years', "emp_length_int"] = 4
    col.loc[col['emp_length'] == '3 years', "emp_length_int"] = 3
    col.loc[col['emp_length'] == '2 years', "emp_length_int"] = 2
    col.loc[col['emp_length'] == '1 year', "emp_length_int"] = 1
    col.loc[col['emp_length'] == '< 1 year', "emp_length_int"] = 0.5
    col.loc[col['emp_length'] == 'n/a', "emp_length_int"] = 0

画收入与贷款金额、信用状态、平均工作时间、利率高低进行可视化,选用violinplot进行绘图。扩展阅读

fig, ((ax1, ax2), (ax3, ax4))= plt.subplots(nrows=2, ncols=2, figsize=(14,6))

# Change the Palette types tomorrow!

sns.violinplot(x="income_category", y="loan_amount", data=df, palette="Set2", ax=ax1 )
sns.violinplot(x="income_category", y="loan_condition_int", data=df, palette="Set2", ax=ax2)
sns.boxplot(x="income_category", y="emp_length_int", data=df, palette="Set2", ax=ax3)
sns.boxplot(x="income_category", y="interest_rate", data=df, palette="Set2", ax=ax4)
plt.savefig('plot2.png', format='png')

在这里插入图片描述

评估风险

了解业务的风险方面

虽然业务的运营方面很重要,但我们还必须分析每个州的风险水平。信用评分是分析单个客户风险水平的重要指标。但是,还有其他重要指标以某种方式估计其他国家的风险水平。
看看违约和地区之间的关系

by_condition = df.groupby('addr_state')['loan_condition'].value_counts()/ df.groupby('addr_state')['loan_condition'].count()
by_emp_length = df.groupby(['region', 'addr_state'], as_index=False).emp_length_int.mean().sort_values(by="addr_state")

loan_condition_bystate = pd.crosstab(df['addr_state'], df['loan_condition'] )

cross_condition = pd.crosstab(df["addr_state"], df["loan_condition"])
# Percentage of condition of loan
percentage_loan_contributor = pd.crosstab(df['addr_state'], df['loan_condition']).apply(lambda x: x/x.sum() * 100)
condition_ratio = cross_condition["Bad Loan"]/cross_condition["Good Loan"]
by_dti = df.groupby(['region', 'addr_state'], as_index=False).dti.mean()
state_codes = sorted(states)


# Take to a list
default_ratio = condition_ratio.values.tolist()
average_dti = by_dti['dti'].values.tolist()
average_emp_length = by_emp_length["emp_length_int"].values.tolist()
number_of_badloans = loan_condition_bystate['Bad Loan'].values.tolist()
percentage_ofall_badloans = percentage_loan_contributor['Bad Loan'].values.tolist()


# Figure Number 2
risk_data = OrderedDict([('state_codes', state_codes),
                         ('default_ratio', default_ratio),
                         ('badloans_amount', number_of_badloans),
                         ('percentage_of_badloans', percentage_ofall_badloans),
                         ('average_dti', average_dti),
                         ('average_emp_length', average_emp_length)])


# Figure 2 Dataframe 
risk_df = pd.DataFrame.from_dict(risk_data)
risk_df = risk_df.round(decimals=3)
risk_df.head()

在这里插入图片描述
然后可视化’default_ratio’(每个州不良贷款占良好贷款的比率)

# Now it comes the part where we plot out plotly United States map
import plotly.plotly as py
import plotly.graph_objs as go


for col in risk_df.columns:
    risk_df[col] = risk_df[col].astype(str)
    
scl = [[0.0, 'rgb(202, 202, 202)'],[0.2, 'rgb(253, 205, 200)'],[0.4, 'rgb(252, 169, 161)'],\
            [0.6, 'rgb(247, 121, 108  )'],[0.8, 'rgb(232, 70, 54)'],[1.0, 'rgb(212, 31, 13)']]

risk_df['text'] = risk_df['state_codes'] + '<br>' +\
'Number of Bad Loans: ' + risk_df['badloans_amount'] + '<br>' + \
'Percentage of all Bad Loans: ' + risk_df['percentage_of_badloans'] + '%' +  '<br>' + \
'Average Debt-to-Income Ratio: ' + risk_df['average_dti'] + '<br>'+\
'Average Length of Employment: ' + risk_df['average_emp_length'] 


data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = risk_df['state_codes'],
        z = risk_df['default_ratio'], 
        locationmode = 'USA-states',
        text = risk_df['text'],
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "%")
        ) ]


layout = dict(
    title = 'Lending Clubs Default Rates <br> (Analyzing Risks)',
    geo = dict(
        scope = 'usa',
        projection=dict(type='albers usa'),
        showlakes = True,
        lakecolor = 'rgb(255, 255, 255)')
)

fig = dict(data=data, layout=layout)
iplot(fig, filename='d3-cloropleth-map')

在这里插入图片描述

信用评分的重要性

信用评分是评估整体风险水平的重要指标。 在本节中,我们将根据客户信用评分中收到的等级类型分析整体风险水平和不良贷款数量。下图是绘制不同的信用等级随着时间变化,与贷款金额以及贷款利率之间的关系;其中,unstack是在groupby之后将列引索转化为行引索,具体参考这里

# Let's visualize how many loans were issued by creditscore
f, ((ax1, ax2)) = plt.subplots(1, 2)
cmap = plt.cm.coolwarm

by_credit_score = df.groupby(['year', 'grade']).loan_amount.mean()
by_credit_score.unstack().plot(legend=False, ax=ax1, figsize=(14, 4), colormap=cmap)
ax1.set_title('Loans issued by Credit Score', fontsize=14)
    
by_inc = df.groupby(['year', 'grade']).interest_rate.mean()
by_inc.unstack().plot(ax=ax2, figsize=(14, 4), colormap=cmap)
ax2.set_title('Interest Rates by Credit Score', fontsize=14)

ax2.legend(bbox_to_anchor=(-1.0, -0.3, 1.7, 0.1), loc=5, prop={
    'size':12},
           ncol=7, mode="expand", borderaxespad=0.)

在这里插入图片描述
进一步分析不同不同大等级和和不同小等级良好贷款与不良贷款人数的差距

fig = plt.figure(figsize=(16,12))

ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(212)

cmap = plt.cm.coolwarm_r

loans_by_region = df.groupby(['grade', 'loan_condition']).size()
# stacked=True是两图叠加,更容易比较
loans_by_region.unstack().plot(kind='bar', stacked=True, colormap=cmap, ax=ax1, grid=False)
ax1.set_title('Type of Loans by Grade', fontsize=14)

loans_by_grade = df.groupby(['sub_grade', 'loan_condition']).size()
loans_by_grade.unstack().plot(kind='bar', stacked=True, colormap=cmap, ax=ax2, grid=False)
ax2.set_title('Type of Loans by Sub-Grade', fontsize=14)

by_interest = df.groupby(['year', 'loan_condition']).interest_rate.mean()
by_interest.unstack().plot(ax=ax3, colormap=cmap)
ax3.set_title('Average Interest rate by Loan Condition', fontsize=14)
ax3.set_ylabel('Interest Rate (%)', fontsize=12)

在这里插入图片描述

不良贷款的决定因素

画房子所有权与不良贷款贷款金额之间的关系

import seaborn as sns

plt.figure(figsize=(18,18))

# Create a dataframe for bad loans
bad_df = df.loc[df['loan_condition'] == 'Bad Loan']

plt.subplot(211)
g = sns.boxplot(x='home_ownership', y='loan_amount', hue='loan_condition',
               data=bad_df, color='r')

g.set_xticklabels(g.get_xticklabels(),rotation=45)
g.set_xlabel("Type of Home Ownership", fontsize=12)
g.set_ylabel("Loan Amount", fontsize=12)
g.set_title("Distribution of Amount Borrowed \n by Home Ownership", fontsize=16)


plt.subplot(212)
g1 = sns.boxplot(x='year', y='loan_amount', hue='home_ownership',
               data=bad_df, palette="Set3")
g1.set_xticklabels(g1.get_xticklabels(),rotation=45)
g1.set_xlabel("Type of Home Ownership", fontsize=12)
g1.set_ylabel("Loan Amount", fontsize=12)
g1.set_title("Distribution of Amount Borrowed \n through the years", fontsize=16)


plt.subplots_adjust(hspace = 0.6, top = 0.8)

plt.show()

在这里插入图片描述
画利率的高低和贷款好坏的关系,以及利率与贷款时间的关系;这里将利率的高低分箱,化成两类,高和低区划分

from scipy.stats import norm

plt.figure(figsize=(20,10))

palette = ['#009393', '#930000']
plt.subplot(221)
ax = sns.countplot(x='interest_payments', data=df, 
                  palette=palette, hue='loan_condition')

ax.set_title('The impact of interest rate \n on the condition of the loan', fontsize=14)
ax.set_xlabel('Level of Interest Payments', fontsize=12)
ax.set_ylabel('Count')

plt.subplot(222)
ax1 = sns.countplot(x='interest_payments', data=df, 
                   palette=palette, hue='term')

ax1.set_title('The impact of maturity date \n on interest rates', fontsize=14)
ax1.set_xlabel('Level of Interest Payments', fontsize=12)
ax1.set_ylabel('Count')


plt.subplot(212)
low = df['loan_amount'].loc[df['interest_payments'] == 'Low'].values
high = df['loan_amount'].loc[df['interest_payments'] == 'High'].values

#会有四条线的原因是fit=norm多了条拟合线;
ax2= sns.distplot(low, color='#009393', label='Low Interest Payments', fit=norm, fit_kws={
    "color":"#483d8b"}, kde=False) # Dark Blue Norm Color
ax3 = sns.distplot(high, color='#930000', label='High Interest Payments', fit=norm, fit_kws={
    "color":"#c71585"},kde=False) #  Red Norm Color
plt.axis([0, 36000, 0, 0.00016])
plt.legend()


plt.show()
plt.savefig('plot5.png', format='png')

在这里插入图片描述

目的的风险

探究目的和贷款状态之间的关系。

数据清洗

数据过滤

过滤缺失值多的特征

我们先查看每个属性的缺失情况,使用pandas的语句如下;

#读取文件
df = pd.read_csv('Train_set.csv', low_memory=False)
#每个特征缺失值的百分比并排序
check_null = df.isnull().sum().sort_values(ascending=False)/float(len(df))
#查看大于0.2缺失值的特征
print(check_null[check_null > 0.2])

然后去掉缺失值的大于阈值的特征

# 设定阀值
thresh_count = len(df)*0.4 
#若某一列数据缺失的数量超过阀值就会被删除
df= df.dropna(thresh=thresh_count, axis=1) 

过滤重复值特征

如果一个变量大部分的观测都是相同的特征,那么这个特征或者输入变量就是无法用来区分目标。
我们踢出这样的变量

loans = df.loc[:,df.apply(pd.Series.nunique) != 1]

经验性过滤数据

用常识去掉和分类任务无关的数据,比如说任务是个人信用值,那么关于电话、生日、邮政编码等信息一定是没用的。

drop_list = ['sub_grade', 'emp_title',  'title', 'zip_code', 'addr_state', 'earliest_cr_line',
       'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d', 'disbursement_method','debt_settlement_flag','pymnt_plan',
             'revol_util', 'initial_list_status', 'hardship_flag']

loans.drop(drop_list,axis=1,inplace=True)

更改数据类型

针对百分数,时间等参数要更改数字类型。如5%改为0.05

loans['int_rate']=loans['int_rate'].astype(str).str.strip("%").astype(float)/100

缺失值处理

缺失值处理——分类变量

分类变量这里我们用‘unknown’来填充。先可视化分类变量

objectColumns = loans.select_dtypes(include=["object"]).columns  
msno.matrix(loans[objectColumns])  # 缺失值可视化

然后用‘unknown’填充

loans[objectColumns] = loans[objectColumns].fillna("Unknown") 

缺失值处理——数值变量

这里使用可sklearn的Preprocessing模块,参数strategy选用most_frequent,采用众数插补的方法填充缺失值。

numColumns = loans.select_dtypes(include=[np.number]).columns
#设置最大显示
pd.set_option('display.max_columns', len(numColumns))
# 采用众数插补的方法填充缺失值
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy='most_frequent', axis = 0)  #  axis=0  针对列来处理
imr = imr.fit(loans[numColumns])

特征工程

特征衍生

“installment"代表贷款每月分期的金额,我们将’annual_inc’除以12个月获得贷款申请人的月收入金额,然后再把"installment”(月负债)与(‘annual_inc’/12)(月收入)相除生成新的特征’installment_feat’,新特征’installment_feat’代表客户每月还款支出占月收入的比,'installment_feat’的值越大,意味着贷款人的偿债压力越大,违约的可能性越大。

loans['installment_feat'] = loans['installment'] / ((loans['annual_inc']+1) / 12)

特征抽象

把LoanStatus按照正常和违约重新分为两类别(原本有7类)

def coding(col, codeDict):
	colCoded = pd.Series(col, copy=True)
    for key, value in codeDict.items():
        colCoded.replace(key, value, inplace=True)
 
    return colCoded
 
#把贷款状态LoanStatus编码为违约=1, 正常=0:
 
loans["loan_status"] = coding(loans["loan_status"], {
    'Current':0,'Fully Paid':0,'In Grace Period':1,'Late (31-120 days)':1,'Late (16-30 days)':1,'Default':1,'Charged Off':1})
 
print( '\nAfter Coding:')
 
pd.value_counts(loans["loan_status"])

然后将“emp_length”(工作年限)、“grade”(信用等级)进行特征抽象化,变成数字。

# 有序特征的映射
mapping_dict = {
    
    "emp_length": {
    
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "Unknown": 0
    },
    "grade":{
    
        "A": 1,
        "B": 2,
        "C": 3,
        "D": 4,
        "E": 5,
        "F": 6,
        "G": 7
    }
}
 
loans = loans.replace(mapping_dict) 
loans[['emp_length','grade']].head() 

然后对剩下的特征进行one-hot编码

n_columns = ["home_ownership", "verification_status", "application_type","purpose", "term"] 
dummy_df = pd.get_dummies(loans[n_columns]) # 用get_dummies进行one hot编码
loans = pd.concat([loans, dummy_df], axis=1) #当axis = 1的时候,concat就是行对齐,然后将不同列名称的两张表合并
#提出以前的量
loans = loans.drop(n_columns, axis=1)

分箱

将连续变量离散化或者把多状态的离散变量合并成少状态一方面避免特征中无意义的波动对评分带来的波动,使其更加稳定。另一方面避免了极端值的影响。同时可以将缺失值作为独立的一个箱将所有变量变换到相似的尺度。常用的分箱方法有有监督分箱方法和无监督分箱方法,有监督分箱方法包括best-ks和卡方分箱,无监督学习包括等频 等距 聚类等,这里选用无监督分箱方法。

特征缩放(Feature Scaling)

我们采用的是标准化的方法,调用scikit-learn模块preprocessing的子模块StandardScaler。

sc =StandardScaler()  # 初始化缩放器
loans_ml_df[col] =sc.fit_transform(loans_ml_df[col])  #对数据进行标准化

特征选择

主要三种方法,详细见这里,详细介绍了sklearn库中的特征选择方法。这里选择递归特征消除 Recursive feature elimination (RFE)*,根据模型不同,种类也不同,有基于SVM、逻辑回归的、梯度提升树的,等等。这里选择基于逻辑回归的

from sklearn.linear_model.logistic import LogisticRegression
from sklearn.feature_selection import RFE
# 建立逻辑回归分类器
model = LogisticRegression()
# 建立递归特征消除筛选器
rfe = RFE(model, 40) #通过递归选择特征,选择40个特征
rfe = rfe.fit(x_val, y_val)
# 打印筛选结果
print(rfe.n_features_)
print(rfe.estimator_ )
print(rfe.support_)
print(rfe.ranking_) #ranking 为 1代表被选中,其他则未被代表未被选中

分类器选择

分类其选择就前篇一律了,可以基于选神经网络也可以选择xgboost

验证算法方法

主要是混淆矩阵以及ROC曲线。

参考:kaggle

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/qq_32796253/article/details/89249670

智能推荐

一位资深程序员给予Java初学者的建议_大白鲨0的博客-程序员宝宝

如果你是在校学生,务必要在学好基础(比如计算机系统、算法、编译原理等等)的前提下,再考虑去进行下面的学习。第一部分:对于尚未做过Java工作的同学,包括一些在校生以及刚准备转行Java的同学。一、Java基础首先去找一个Java的基础教程学一下,这里有我的一些资源整合学习Java基础的时候,应该尽量多动手,很多时候,你想当然的事情,等你写出来运行一下,你就会发现不是这么回事儿,不信你就...

Unity3D说明文档翻译-Unity Services_张老黑的博客-程序员宝宝

Unity ServicesUnity服务Unity is more than an engine. It also brings a growing range of integrated services to engage, retain and monetize audiences. Unity provides a growing range of complimentary

python+openCV+pyqt5实现播放器_opencv python qt5_Qredsun的博客-程序员宝宝

给之前python+openCV实现播放器增加一个页面实现,我选择了pyqt来做一个简单的页面。先看下UI界面,pyqt的环境配置可以参考:播放器的基本功能:选择视频文件播放控制:(除了视频播放需要重写,其他功能都只需要调用之前写好的函数即可,已经现的功能)播放,使用 pyqt 的QLabel 控件展示一帧一帧的图片。暂停,将暂停方法和暂停按钮关联。重播,将视频资源 释放VideoCapture.release() 后重新加载。截图,关联截图功能。通过进度条控制

Java程序语言(基础篇)第2章 基本程序设计 编程练习题解答_iteye_18800的博客-程序员宝宝

//编程练习题2.1~2.25/** * 2.1 程序要求:编写程序,从控制台读入double型的摄氏温度,然后将其转换为华氏温度,并且显示结果。 * 转换公式如下所示: * fahrenheit = (9/5) * celsius +32 (华氏度= (9/5) * 摄氏度+32) * @作者:wwj * 日期:2012/5/6 * 功能:将摄氏温度转换为华氏温度 * *...

TortoiseGit好用_tortoisegit好用吗_前端小迷妹的博客-程序员宝宝

这里写自定义目录标题欢迎使用Markdown编辑器新的改变功能快捷键合理的创建标题,有助于目录的生成如何改变文本的样式插入链接与图片如何插入一段漂亮的代码片生成一个适合你的列表创建一个表格设定内容居中、居左、居右SmartyPants创建一个自定义列表如何创建一个注脚注释也是必不可少的KaTeX数学公式新的甘特图功能,丰富你的文章UML 图表FLowchart流程图导出与导入导出导入TortoiseGit好用!欢迎使用Markdown编辑器你好! 这是你第一次使用 Markdown编辑器 所展示的欢迎

【1.6万字】连续抓屏保存为Gif动图 【keyboard库、PIL库、imageio库和pygifsicle库 探索】_岳涛@心馨电脑的博客-程序员宝宝

一、抓屏保存为Gif先上代码,后解释from time import sleep, time from PIL import ImageGrab, Image,ImageFont,ImageDraw import keyboard import imageio t_capture = 60 # 最长抓屏时间 frame = 5 # 每秒帧数 sleepTime = 1.0 / frame # 抓屏休眠时间

随便推点

云边端+AI,智慧仓储物流远程视频监控方案分析_it资产 仓库 视频 监控_TSINGSEE的博客-程序员宝宝

依托互联网新兴技术如AI、物联网、大数据、云计算、视频智能分析等,基于EasyGBS打造的工厂的智慧仓储运营模式,提升仓库的透明化、精细化管理,实现仓储物流的数据共享与联动,保证出货的及时性、准确性和高效性,提高仓库的安全监管和高效运营的能力。

此计算机策略设置不允许安装win,系统管理员设置了系统策略,禁止进行此项安装”windows installer被禁用解决办法..._阿伊谈的博客-程序员宝宝

系统管理员设置了系统策略,禁止进行此项安装”windows installer被禁用解决办法今天想为朋友做个手机归属地批量查询系统.在网吧想装一个ACCESS.谁知出现下面这种情况.看来是windows里面进行的权限设置,“策略”让我想起了 组策略里面的确 有这么一项,到网上去搜索了一下 ,多半是关于windows installer服务不能启用的问题,于是就只好自己找了.现提供三种解决方法。解决...

java在字符串开头添加字符串_string - java:使用StringBuilder在开头插入_Jonna轩姐的博客-程序员宝宝

当我偶然发现这篇文章时,我有类似的要求。 我想要一种快速的方法来构建一个可以从双方增长的字符串,即。 在正面和背面任意添加新字母。 我知道这是一篇很老的帖子,但它激发了我尝试创建字符串的几种方法,我想我会分享我的发现。 我也在使用一些Java 8构造,它可以优化案例4和5的速度。[https://gist.github.com/SidWagz/e41e836dec65ff24f78afdf8669...

通过DOS命令批量重命名文件_重命名 dos命令_ruberzhu的博客-程序员宝宝

@echo offsetlocal enabledelayedexpansionfor /f "delims=" %%a in ('dir /a-d /s /b *.avi') do ( echo "``````````" @for /f "tokens=1,2,3 delims=.[-]" %%b in ("%%~nxa") do ( set b1=%%b set

Typora 修改主题,让代码不换行_typora代码不换行_Xander_Wang的博客-程序员宝宝

最近简单看了下 Typora 的主题,然后想修改下主题,让代码不换行,研究了下,记录下。下面的一段代码就是最后的修改效果,需要指出的是,我使用的是 VUE 这个主题#write .CodeMirror-wrap .CodeMirror-code pre { padding-left: 12px; white-space: nowrap;}#write .CodeMi...