Data Mining
Essay by Aarti Singh • April 9, 2016 • Research Paper • 3,344 Words (14 Pages) • 2,284 Views
PVA – FUNDRAISING
IDS 572 – DATA MINING
ASSIGNMENT 2
Arati Singh
Nirav Dedhia
Sachin Pandey
Table of Contents
Contents
Question 1: Data Analysis
Question 2: Modelling
Question 3: Classification under Asymmetric Response and Cost
Question 1: Data Analysis
- Data Statistics:
The table 1 in the appendix gives the mean, number of missing values, minimum and maximum values for each attribute. The appendix also includes the distribution of the variables.
Following are some of the inferences from the data analysis:
- The Target_B variable shows 65% of donors and 35% of Non-donors.
- There are no missing values for Target_B, State, DOB, HIT, however there are missing values in AGE, HOMEOWNR, NUMCHILD, INCOME, and GENDER.
- The donors consist of 5367 Females and 4107 Males.
- The average age of the population is 61.6.The minimum age is 25 for the Donors.
- The maximum donors are from State CA.
- The average number of children is 1-2.
- The number of homeowners is 5478.
Note: Click on DATA SUMMARY: for detailed view
- Data cleaning and transformation model
Below we have described each process we performed on the attributes:
- Generate New Attribute: We used generate new attribute to replace the missing values with 0 and non-missing values with 1 and thus we created new variable for every variable transformed. The details of the variable generated and the function applied is given in the appendix table 4.
Note: Click here Generate attribute for detailed view.
- Remove old attributes: We selected attribute for which we created a new attribute in the previous step and removed the old variable for the transformed variables. The details of the attributes removed is given below.
Remove old Attribute:
CHILD03 | CHILD07 | CHILD12 |
CHILD18 | DOMAIN | GENDER |
HOMEOWNR | MAJOR | PEPSTRFL |
PVASTATE | RECINHSE | RECP3 |
RECPGVG | RECSWEEP |
Note: Click here Remove old variables for detailed view.
- Eliminating less relevant Variables:
- Remove useless attributes: In this step, we eliminated variables which seemed less relevant to us to obtain donor/non donor target prediction. We deleted some variables which had more than 50% null values and variables which were highly skewed, also some variables like past date of donation or amount of donation, as these types of variables did not contribute much to modelling , we decided to delete these.
The details of the less relevant attributes removed is given below.
Attributes | Reason for removal |
ADATE_1-ADATE_24 | These are all historical promotion values which we found irrelevant for prediction. |
MDMAUD MDMAUD_A MDMAUD_F MDMAUD_R | These variables represent Major donor matrix who have given gift previously which we think is not necessary for our target prediction. Also the values for this field is highly skewed. |
WEALTH2 | WEALTH1 is already considered making this redundant. |
ANC1 - ANC15 | Ancestry of persons is of no significance in predicting donors and non-donors. |
LSC1 - LSC4 | Language of persons is of no significance in predicting donors and non-donors. |
ODATEDW, OSOURCE, TCODE, DOB, NOEXCH, AGEFLAG, DATASOURCE, GEOCODE, LIFESRC, HPHONE_D, MAILCODE, | These variables about Geocode, zip, Phone number and other donor’s basic info is also irrelevant for prediction of our Target variable |
RECHINSE,RECGVNG,RECP3,RECSWEEP,NUMCHILD, CHILD03-CHILD18,SOLP3,SOLIH,MAJOR,COLLECT1,VETERANS,BIBLE, CATALOG,HOME,PETS,CDPLAY,STEREO,FISHER,GARDEN,BOATS,WALKER,PEPSTRFL | Some of the variables which were highly skewed and would over predict the results and thus eliminated. |
HHAGE1-HHAGE3,DW1-DW9,HV1-HV4 ,HU1-HU5,HHD1-HHD12,HHAS1-HHAS4,MC1-MC3,TPE1-TPE13,LFC1-LFC10,AFC1-AFC3,HC1-HC21, | We also removed some Neighbor population attributes with skewed value and some which we found redundant with respect to target variable prediction |
TARGET_D | As suggested in the case we discarded this |
Note: Click Here Remove useless variable for detailed view.
- How did we handle missing values:
- Map: In this step we transformed nominal variable with “?” by replacing with “N”.
- Replace Missing Value: In this step we replaced all numeric variables with unknown value as “0”.
Below is the summary table of variables with the replace missing techniques used.
Attribute Value | Original Value | Missing Values Replaced By | Transformation Technique |
Domain | 1st byte = U,C,S,T,R 2nd byte=1,2,3 | N/A | Cut(Domain,0,1) |
collect1, cards, kidstuff | Y / N | 0 | if(Value=Y,1,0) |
CHILD03, CHILD07, CHILD12, CHILD18 | M, F, B | 0 | if(value=”M” || “F”|| “B”, “1”,” 0”) |
...
...