Bala Deshpandeon Thu, May 12, 2011 @ 08:35 AM
Inpart 1we gave a brief introduction tologistic regressionand indicated when it might be appropriate to use it in business analytics settings. Probably thebest definitionof Logistic regression is this … a mathematical modeling approach in which the best-fitting, yet least-restrictive model is desired to describe the relationship between several independent explanatory variables and a dependent dichotomous response variable.
In this article we get into the details of how the model equation is developed and then show how to set up a simple analysis using RapidMiner.
How does logistic regression find the sigmoid curve?
A straight line can be depicted by only two parameters: the slope (m) and the intercept (c). The way in which Xs and Ys are related to each other can be simply specified bymandc. However an S-shaped curve is a much more complex shape and representing it parametrically is not as easy. So how does one find a mathematical means to relate the Xs to the Ys?
It turns out that if we transform the Ys to thelogarithm of the odds of Y, then the transformed target variable is linearly related to the Xs. In most cases where we need to use logistic regression, the Y is usually a YES-NO type of response. This is usually interpreted as the probability of an event happening (Y=1) or not happening (Y=0).
If Y is an event (response, pass etc),
and p is the probability of the event happening (Y=1),
then (1-p) is the probability of the event
It turns out that log(p/1-p) is linear in the predictors, X
From the data given, we know the X and can compute the p for each value of X. After this of course the problem is essentially similar to linear regression. (To see the sigmoid curve, the variables need to be transformed from the p-space to the Y-space).
The logistic regression model from Eq. 1 ultimately delivers the probability of Y happening (i.e. Y=1), given specific value(s) of X.
7-steps to a simple logistic regression model in RapidMiner
The data we used comes from an exampleherefor a credit scoring exercise. The objective is to predict DEFAULT (Y or N) based on two predictors: Loan age (business usage) and number of days of delinquency. There are 100 samples.
Step 1:Load speadsheet into RapidMiner. Usethe process described here. Remember to set the DEFAULT column as Label
Step 2:Split data into train and test samples using the Split Validation operatoras shown here
Step 3:Add the Logistic Regression operator in the training window of the split validation operator
Step 4:Add Apply Model operator in the testing window of split validation operator in asimilar manner as discussed here. Just use default parameter values.
Step 5:Add Performance evaluation operator in the testing window of split validation operatoras discussed here.
Step 6:Connect all ports as shown below
Step 7:Run the model and view results. In particular check for the Kernel Model which shows the coefficients for the two predictors and the intercept. Also check the confusion matrix forAccuracy, Sensitivity, and Specificity and finally view the ROC curves and check AUC.
The accuracy of the model based on the 30% testing sample is 83%. The ROC curves has an AUC of 0.863 which is quite acceptable. The next step would be to review the kernel model and prepare for deploying this model.
Download all our logistic regression articles in one digest e-book below
Topics:data mining with rapidminerbusiness analyticslogistic regression modelslogistic regressionlift/gain charts