Analysis for Credit Risk
In Task 1 of this Assignment 4 you are required to follow the six step CRISP DM process and make use of the data mining tool RapidMiner to analyse and report on the creditrisk_train. csv and creditrisk_score.csv data sets provided for Assignment 4. You should refer to the data dictionary for creditrisk_train.csv (see Table 1 below). In Task 1 and 2 of Assignment 4 you are required to consider all of the business understanding, data understanding, data preparation, modelling, evaluation and deployment phases of the CRISP DM process.
a) Research the concepts of credit risk and credit scoring in determining whether a financial institution should lend at an appropriate level of risk or not lend to a loan application. This will provide you with a business understanding of the dataset you will be analysing in Assignment 4. Identify which (variables) attributes can be omitted from your credit risk data mining model and why. Comment on your findings in relation to determining the credit risk of loan applicants.
b) Conduct an exploratory analysis of the creditrisk_train.csv data set. Are there any missing values, variables with unusal patterns? How consistent are the characteristics of the creditrisk_train.csv and creditrisk_score.csv datasets? Are there any interesting relationships between the potential predictor variables and your target variable credit risk? (Hint: identify the variables that will allow you to split the data set into subgroups). Comment on what variables in the data set creditrisk_train.csv might influence differences in credit scores and credit risk ratings and possible approval or rejection of loan applications?
c) Run a decision tree analysis using RapidMiner. Consider what variables you will want to include in this analysis and report on the results. (Hint: Identify what your target variable and predictor variables are.). Comment on the results of your final model.
d) Run a neural network analysis using RapidMiner, Aagain consider what variables you will want to include in this analysis and report on the results. (Hint: Identify what your target variable and predictor variables are.) Comment on the results of your final model.
e) Based on the results of the Decision Tree analysis and Neural Network analysis – What are the key variables and rules for predicting either good credit risk or bad credit risk? (Hint: with RapidMiner you will need to validate your models on the creditrisk_train.csv data using a number of validation processes for the two models you have generated previously using decision trees and neural network models). Comment on your two predictive models for credit risk scoring in relation to a false/positive matrix, lift chart and ROC chart (Note: for the evaluation operator reports – charts Lift and ROC you will need to convert the target variable credit.risk to a nominal variable with two values (Good and Bad). Comment on the results of your final model.
Overall for Task 1 you need to report on the output of each analysis in sub task activity a to f and briefly comment on the important aspects of each analysis and relevance to credit risk scoring in determining whether to approve a loan with an appropriate credit risk rating or to not lend to a loan application.
Note the final outputs from your statistical analyses in RapidMiner (graphs, decision trees, neural network, statistical analysis results tables should be included as an appendices in your report to provide support for your conclusions regarding each analysis and are not included in the word count.
Explore, Modify, Model,Assessment) and CRISP-DM (Cross Industry Standard Process for Data Mining) are the three major attempts to standardize the data mining process (Azevedo, 2008). Even though they have similar processes, CRISP-DM is the popular methodology in the fields of data mining. In the previous assignment, we have discussed about CRISP-DM through the analysis for the survival rate of passengers on the Titanic.
In this assignment, we will pay more attention to evaluation and visualization of analysis. The important of evaluation and visualization as well as validation of modeling and deployment will be discussed in task 2 and task 3. However, we will still use CRISP-DM to analyze the credit risk.
2. Task 1
2.1 Subtask a
2.1.1 Business Understanding
Credit risk refers to the risk that a borrower will default on any type of debt by failing to make payments which it is obligated to do. The risk is primarily that of the lender and includes lost principal and interest, disruption to cash flows, and increased collection costs. The loss may be complete or partial and can arise in a number of circumstances (Wikipedia.org).To reduce a financial institution’s credit risk, the lender may perform a credit check on the potential borrowers to determine whether a borrower should lend at an appropriate level of risk or not lend to a loan application.
2.1.2 Data Understanding
You have two data sets. One is creditrisk_train.csv, which is a training data set containing the previous history, borrower’s financial informationand the target variable (Credit.Risk). The other is creditrisk_score.csv, which is a dataset to will be predicted. Two data sets include 10 variables. The data dictionary for two data sets is shown in Table 1.
Attributes Data Type Description
Row.No integer Unique identifier for each row.
Application.ID integer Unique identifier for loan application
Credit.Score integer Credit score give to the loan application
This is a measure of the creditworthiness of the applicant.
Late.Payments integer History of late payments with existing loans
Months.In.Job Integer Months in current job
Debt.To.Income.Ratio Real The percentage of borrower’s gross income that goes toward paying debts
Loan.Amount Integer Loan amount requested
Liquid.Assets Integer Liquid.assets
Num.Credit.Lines Integer Number of credit lines
Credit.Risk Polynominal Credit risk rating(Very Low, Low, Moderate, High, Do not lend)
Table 1 Data Dictionary for credit risk data sets
With two data sets and an understanding of what it means, we can proceed to data preparation process.
2.2 Subtask b
2.2.1 Data preparation
We need to consider data consolidation, cleaning and transformation to be sure that the data sets should keep consistency. Firstly, in the data sets, there are two unique identifiers. We do not need one of them, because these are duplicated. Using Select Attributes in RapidMiner, the attribute, RowNo has been eliminated for the analysis (Figure 2-1).
Figure 2-1. Omitting an unnecessary attribute
Secondly, we need to consider that there will are any missing values in the data sets. Fortunately, there is no missing value (Figure 2-2), so we do not need to replace or impute missing values. Are there any variables with unusual patterns? As we consider the data understanding, all attributes have valid types and ranges. For example, Months_In_job (months in current job) attribute has the proper range between 2 and 102 months, with about overall 27 months. How about consistency between creditrisk_train and creditrisk_score? All values in the scoring data set are in the range of the training data set. For instance, in terms of Liquid_Assets attribute, the range from 834 to 24297 in the scoring data set is a subset of those of the training data set, in which the range is between 830 and 24699 (Figure 2-2 and Figure 2-3). As a result, we do not need any data cleansing.
Lastly, as data transformation, the Application.ID attribute has been used as an id, which is implemented by Set Role in RapidMiner. One of the nice side-effects of setting an attribute’s role to ‘id’ rather than removing it using a Select Attributes is that it makes each record easier to match back to individual people later, when viewing predictions in results perspective (Matthew, 2012). Before applying some modeling such as decision tree and neural network in this assignment, as a target variable, Credit.Risk attribute should be set role into a ‘label’ attribute. Most predictive model operators expect the training stream to supply a ‘label’ attribute. The label attribute has five values; Very Low, Low, Moderate, High and DO NOT LEND, which will be predicted in the scoring data set. That is why all values in Credit.Risk attribute are missing. Figure 2-2 and figure 2-3 are meta data for the two data sets, respectively.
Figure 2-2. Meta data for the training data set
Figure 2-3. Meta data for the scoring data set
The next step is to add predictive model operators to the training data set. In this assignment, we will use only two models; decision tree and neural network. One of the main reasons to choose a decision tree is that the appeal of decision trees lies in their relative power, ease of use, robustness with a variety of data and levels of measurement, and ease of interpretability (Barry, 2006).Decision trees are a simple, but powerful form of multiple variable analyses. When it comes to artificial neural networks, it has been shown to be very promising computational systems in many forecasting and business classification applications due to their ability to learn from the data, their nonparametric nature (i.e., no rigid assumptions), and their ability to generalize (Haykin, 2009).
Firstly, we added the basic decision tree in the main process (Figure 2-4). In RapidMiner, there are four criterion on which attributes will be selected for splitting; gain_ratio, information_gain, gini_index and accuracy. In this step, we will use accuracy criterion. Other criterion will be applied at the evaluation progress.
Figure 2-4. The Decision Tree operators added to the model
In Figure 2-5, we will see the preliminary tree using the accuracy criterion. As we see, Credit_Score is the best predictor to determine which Credit_Risk borrowers are belonging to. In the case that credit score is less than or equal to 518, the next best predictor is Debt_Income_Ratio attribute. If Debt_Income_Ratio is greater than about 10%, the borrowers expect their credit risk to belong to the ‘DO NOT LEND’.
Figure 2-5. Decision tree results using accuracy
In Figure 2-6, the prediction for the class ‘DO NOT LEND’ is 100%, because there are no other class frequencies in the class. Although the training data is going to predict that if Debt_Income_Ratio is less than 10%, the borrowers belong to the ‘High’ credit risk class, the model is not 100% based on that prediction, because there are 163 ‘High’ frequencies and one ‘DO NOT LEND’ frequency (Figure 2-6). When we get to the Evaluation process, we will discuss how this uncertainty translates into confidence percentages and how to validate these confidences.
Figure 2-6. Class frequencies for DO NOT LEND and High
Like this, we can predict other credit risks following to the nodes and leaves of the decision tree. The interesting thing is that in the decision tree using accuracy criterion only four variables have influenced on the prediction of the credit risk, which are Credit_Score, Debt_Income_Ratio, Late_Payments and Months_In_Job. The three attributes; Loan_Amt, Liquid_Assets and Num_Credit_Lines have not been related to the prediction. Of cause, if we change accuracy criterion to other criterion such as gain_ratio, information_gain and gini_index, the three attributes are used for the prediction.
Secondly, like the decision tree, we have prepared two data sets and applied the Set Role operator as well as the Select Attribute operator. Then we added the neural network operator in the main process (Figure 2-7).
Figure 2-7. The Neural network operators added to the model
2.3 Subtask c
2.3.1 Modeling – Decision Tree
While we were preparing the data, we decided to use only four predictor variables. Through the Select Attribute operator, four variables, id variable and the target variable are selected. The next step is to apply the decision tree model to the scoring data. In Figure 2-8, the CreditRisk Scoring data set is linked to the unlabelleddata port (unl). To show the results, the label predictions (lab) port and the decision tree model (mod) are connected to res ports.
Figure 2-8. Applying the decision tree model to the scoring data, and outputting label predictions (lab) and a decision tree model (mod).
When we apply the model, we will see familiar results in the decision tree. However, the tree has been applied to the scoring data.
Figure 2-9 Meta data for scoring data set predictions.
Confidence attributes have been created by RapidMiner, along with a prediction attribute. Also we can see four predictor attributes; Credit_Score, Late_Payments, Months_In_Job and Debt_Income_Ratio (Figure 2-9). The interesting thing is that the max value of confidence Moderate is 0.972, which means there will be false positive predictions. In the evaluation process, we will validate decision trees.
Figure 2-10 Predictions and their associated confidence percentages using the decision tree
RapidMiner is completely convinced that Applicant ID 88858 is going to be Very Low (100%), while applicant 628458 is going to be Low with 98.2% confidence. Even though applicant 628458 has 1.8% at confidence Very Low, this applicant is predicted as the Low credit risk. The confidence will be changed according to the criterion. As for this, we will discuss at the evaluation stage.
2.4 Subtask d
2.4.1 Modeling – Neural Network
We can see the graphical view of the neural network model. The circles in the neural network graph are nodes, and the lines between nodes are called neurons. The input circles have each predictor attribute, while the output nodes have each target value, in which there are Moderate, High, Low, DO NOT LEND and Very Low. The thicker and darker the neuron is between nodes, the strong the affinity between the nodes (Matthew, 2012).
Figure 2-11 A graphical view of the improved neural network
Like the decision tree, we can see similar metadata for the scoring data set predictions. However, the predictions and confidence make a little difference. Only the Very Low value has 100% convince. As for DO NOT LEND, it’s max confidence is 0.498 (49.8%) so that there is no prediction for DO NOT LEND.
Figure 2-12 Meta data for scoring data set predictions using the neural network.
As we selected four attributes instead of the all predictor variables, we can see similar result. Thus, we can be sure that the four predictor variables are enough to predict the credit risk. However, we cannot close our eyes, because as the range of the other attributes change in the real world, they are able to become a potential predictor variable.
Figure 2-13 Meta data with four predictor variables
In Figure 2-14, like the decision tree, 888858 and 628458 have similar confidences, in which they are going to be Very Low and Low, respectively. In terms of applicant 863682, there is a great difference. Even though, the prediction for Credit Risk is Moderate, confidence in the neural network is only 0.622 (62.2%), while those of the decision tree is 0.972 (97.2%).
Figure 2-14 Predictions and their associated confidence percentages using the neural network
2.5 Subtask e
2.5.1 Evaluation (Confusion Matrix)
Model Evaluation is an integral part of the model development process. It helps to find the best model that represents our data and how well the chosen model will work in the future.
This step assesses the degree to which the selected model meets the business objectives and, if so, to what extent (Efraim et al, p.174). In this assignment, we used Cross-Validation, which is a statistical method of evaluatingand comparing learning algorithms by dividing datainto two segments: one used to learn or train a modeland the other used to validate the model (Payam, 2008).
In Figure 2-15, we can see two Validation operators. One is for the decision tree and the other is for the neural network. We used the Multiply operator, which copies its input object to all connected output ports. It does not modify the input object.
Figure 2-15 Validating the decision tree and the neural network.
RapidMiner calculates a 95.41% accuracy rate for this model. This overall accuracy rate reflects the class precision rates for each possible value in the Credit_Risk attribute. For example, the class precision (or true positive rate) of pred.Moderate is 94.54%, leaving us with a 5.46% false positive rate for this value. Surprisingly, the true positive rate of pred.DO NO LEND is 0%. That’s why there is no prediction in the DO NOT LEND value.
Figure 2-16 Evaluating the predictive quality of the neural network
When it comes to the decision tree using gain_ratiocriterion, the overall accuracy is 97.19%, which is higher than those of the neural network. Even the class precision rate for pred.DO NOT LEND is 87.50%, leaving us with a 12.50% false positive rate.
Figure 2-17Evaluating the predictive quality of the decision tree using gain_ratiocriterion
Now, we can see another confusion matrix, which is derived from the decision tree using accuracy criterion. The model’s ability to predict is significantly improved. Even though the probability of false positive is only 2.07%, we can trust the prediction of the decision tree.
Figure 2-18 Evaluating the predictive quality of the decision tree using accuracy criterion
2.5.2 Evaluation (Lift Charts)
To use lift charts and ROC curves for evaluating models, we need to convert the target variable credit.risk to a nominal variable with two values (Good and Bad). In order to do this, the Map operator is used. Figure 2-19 show how to change the old values to the new values. What value belongs to Good or Bad depends on the decision of real business. It will be sure that DO NOT LEND and High values are bad credit risk as Moderate, Low and Very Low to Good.
Figure 2-19 Converting the target variable to a nominal variable with two values (Good and Bad)
Figure 2-20 is the final main process for both Compare ROCs and Create Lift Chart operators. We have applied Create Lift Chart to two models; the neural network and the decision tree using accuracy criterion. The target class is Bad, because lenders do not want to lose their money.
Figure 2-20 Compare ROCs and Create Lift Chart for evaluating two models
Firstly, let’s consider the neural network. In the case of the confidence for Bad from 0.87 (87%) to 1 (100%), RapidMiner predicts Bad credit risk with 100%. When the confidence goes down to 0.01 (1%), the false positive rate is only about 16%. However, it does not influence overall true positive rate. As we see the figure 2-16, the accuracy rate for this model is 95.41%.
Figure 2-21 Lift Chart for Credit_Risk = Bad of the neural network
In the case of the decision tree with accuracy criterion, it is simpler to analyze. When confidence for Bad is 1 (100%), 209 out of 209 are predicted accurately. Considered the overall accuracy is 97.93%, it is natural. When we see the two lift charts, it is essential to choose the decision tree for our prediction of credit risk.
The ROC chart is similar to the gain or lift charts in that they provide a means of comparison between classification models. The ROC chart shows false positive rate (1-specificity) on X-axis, the probability of target=1 when its true value is 0, against true positive rate (sensitivity) on Y-axis, the probability of target=1 when its true value is 1. Ideally, the curve will climb quickly toward the top-left meaning the model correctly predicted the cases (Sayad, 2011). The AUC (Area Under the Curve) is almost 1, because the overall accuracy is 97.93% and 95.41% for the decision tree and neural network, respectively. The graph demonstrated that the decision tree is more accurate than the neural network.
In the previous evaluation stage, we proved that the decision tree is better. Especially, accuracy criterion has overall 97.93% accuracy. Moreover, it reduced the predictor variables from seven to four. Now we has the decision tree that shows credit institutions which attributes matter most in predicting the credit risk. However, we need to keep in mind that the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. Whenever business needs change or predictor variables added or modified, we have to recycle the CRISP-DM processes.
ORDER THIS ESSAY HERE NOW AND GET A DISCOUNT !!!
Place an order today and get 13% Didcount (Code GAC13)