User:Bsharman: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 47: Line 47:


In this paper, there are two methods that have been used for dimensionality reduction –  
In this paper, there are two methods that have been used for dimensionality reduction –  
1. Correlation based Feature Selection (CFS): This is a feature selection method in which a subset of features from the original features is selected. In this method, the algorithm selects features from the dataset that are highly correlated with the output but are not correlated with each other. The user does not need to specify the number of features to be selected. The correlation values are calculated based measures such a Pearson’s coefficient, minimum description length, symmetrical uncertainty and relief.  
 
2. Principal Components Analysis (PCA): PCA is a feature extraction method that transforms existing features into new sets of features such that the correlation between them is zero and these transformed features explain the maximum variability in the data.  
1.Correlation based Feature Selection (CFS): This is a feature selection method in which a subset of features from the original features is selected. In this method, the algorithm selects features from the dataset that are highly correlated with the output but are not correlated with each other. The user does not need to specify the number of features to be selected. The correlation values are calculated based measures such a Pearson’s coefficient, minimum description length, symmetrical uncertainty and relief.  
 
2.Principal Components Analysis (PCA): PCA is a feature extraction method that transforms existing features into new sets of features such that the correlation between them is zero and these transformed features explain the maximum variability in the data.  




Line 55: Line 57:
The four Algorithms that have been used in this paper are the following:
The four Algorithms that have been used in this paper are the following:


1. Multiple Linear Regression: In MLR, the relationship between the dependent and the two or more independent variables is predicted by fitting a linear model. The model parameters are calculated by minimizing the sum of squares of the errors. The significance of the variables is determined by tests like the F test and the p-values.  
1.Multiple Linear Regression: In MLR, the relationship between the dependent and the two or more independent variables is predicted by fitting a linear model. The model parameters are calculated by minimizing the sum of squares of the errors. The significance of the variables is determined by tests like the F test and the p-values.  
2. REPT: REPT stands for reduced error pruning tree. It uses regression tree logic and creates many trees across several iterations. This algorithm develops these trees based on the principles of information gain and variance reduction. At the time of pruning the tree, the algorithm uses the lowest mean square error to select the best tree.  
 
3. Random Tree: A random tree selects some of the attributes at each node in the decision tree and builds a tree based on random selection of data as well as attributes. Random Tree does not do pruning. Instead, it estimates class probabilities based on a hold-out set.
2.REPT: REPT stands for reduced error pruning tree. It uses regression tree logic and creates many trees across several iterations. This algorithm develops these trees based on the principles of information gain and variance reduction. At the time of pruning the tree, the algorithm uses the lowest mean square error to select the best tree.  
4. Artificial Neural Network: In a neural network, the inputs are transformed into outputs via a series of layered units where each of these units transforms the input received by it via a function into an output that gets further transmitted to units down the line. The weights that are used to weigh the inputs are improved after each iteration via a method called backpropagation in which errors propagate backward in the network and are used to update the weights to make the computed output closer to the actual output.
 
3.Random Tree: A random tree selects some of the attributes at each node in the decision tree and builds a tree based on random selection of data as well as attributes. Random Tree does not do pruning. Instead, it estimates class probabilities based on a hold-out set.
 
4.Artificial Neural Network: In a neural network, the inputs are transformed into outputs via a series of layered units where each of these units transforms the input received by it via a function into an output that gets further transmitted to units down the line. The weights that are used to weigh the inputs are improved after each iteration via a method called backpropagation in which errors propagate backward in the network and are used to update the weights to make the computed output closer to the actual output.

Revision as of 13:35, 6 October 2020

Risk prediction in life insurance industry using supervised learning algorithms

Presented By

Bharat Sharman, Dylan Li, Leonie Lu, Mingdao Li

Introduction


Risk assessment lies at the core of the Life Insurance Industry. It is extremely important for a Life Insurance Company to assess the risk of an application accurately in order to make sure that applications with an actual low risk are accepted and an actual high risk are rejected. Otherwise, individuals with an unacceptably high risk profile will be issued policies and when they pass away, the company will face large losses due to high insurance payouts. Such a situation is called ‘Adverse Selection’, where individuals who are most likely to suffer losses take insurance and those who are not likely to suffer losses do not and thus, the company suffers losses as a result.

Traditionally, the process of Underwriting (deciding whether or not to insure the life of an individual) has been done using Actuarial calculations. Actuaries group customers according to their estimated levels of risk determined from historical data. (Cummins J, 2013) However, these conventional techniques are time consuming and it is not uncommon to take a month to issue a policy. They are expensive as a lot of manual processes need to be executed.

Predictive Analysis has emerged as a useful technique to streamline the underwriting process to reduce the time of Policy issuance and to improve the accuracy of risk prediction. In this paper, the authors use data from Prudential Life Insurance company and investigate the most appropriate data extraction method and the most appropriate algorithm to assess risk.

Literature Review



Before a Life Insurance company issues a policy, it must execute a series of underwriting related tasks. (Mishr, 2016)These tasks involve gathering extensive information about the applicant. The insurer has to analyze the employment, medical, family and insurance histories of the applicant and factor all of them into a complicated series of calculations to determine the risk rating of the applicant. On basis of this risk rating, premiums are calculated. (Prince, 2016)

In a competitive marketplace, customers need policies to be issued quickly and long wait times can lead to them switch to other providers. (Chen, 2016). In addition, the costs of doing the data gathering and analysis can be expensive. The insurance company bears the expenses of the medical examinations and if a policy lapses, then the insurer has to bear the losses of all these costs. (J Carson, 2017). If the underwriting process uses Predictive Analytics, then the costs and time associated with many of these processes can be reduced via streamlining.

Methods and Techniques



In Figure 1, the process flow of the analytics approach has been depicted. These stages will now be described in the following sections.

Description of the Dataset

The data is obtained from the Kaggle competition hosted by the Prudential Life Insurance company. It has 59381 applications with 128 attributes. The attributes are continuous and discrete as well as categorical variables. The data attributes, their types and the description is shown in Table 1 below:

Data Pre-Processing

In the data preprocessing step, missing values in the data are either imputed or those entries are dropped and some of the attributes are either transformed in a different form to make the subsequent processing of data easier.

Dimensionality Reduction

In this paper, there are two methods that have been used for dimensionality reduction –

1.Correlation based Feature Selection (CFS): This is a feature selection method in which a subset of features from the original features is selected. In this method, the algorithm selects features from the dataset that are highly correlated with the output but are not correlated with each other. The user does not need to specify the number of features to be selected. The correlation values are calculated based measures such a Pearson’s coefficient, minimum description length, symmetrical uncertainty and relief.

2.Principal Components Analysis (PCA): PCA is a feature extraction method that transforms existing features into new sets of features such that the correlation between them is zero and these transformed features explain the maximum variability in the data.


Supervised Learning Algorithms

The four Algorithms that have been used in this paper are the following:

1.Multiple Linear Regression: In MLR, the relationship between the dependent and the two or more independent variables is predicted by fitting a linear model. The model parameters are calculated by minimizing the sum of squares of the errors. The significance of the variables is determined by tests like the F test and the p-values.

2.REPT: REPT stands for reduced error pruning tree. It uses regression tree logic and creates many trees across several iterations. This algorithm develops these trees based on the principles of information gain and variance reduction. At the time of pruning the tree, the algorithm uses the lowest mean square error to select the best tree.

3.Random Tree: A random tree selects some of the attributes at each node in the decision tree and builds a tree based on random selection of data as well as attributes. Random Tree does not do pruning. Instead, it estimates class probabilities based on a hold-out set.

4.Artificial Neural Network: In a neural network, the inputs are transformed into outputs via a series of layered units where each of these units transforms the input received by it via a function into an output that gets further transmitted to units down the line. The weights that are used to weigh the inputs are improved after each iteration via a method called backpropagation in which errors propagate backward in the network and are used to update the weights to make the computed output closer to the actual output.