Knn Model Data Analysis
Essay by anilsarode • March 10, 2016 • Essay • 1,340 Words (6 Pages) • 1,787 Views
1. Executive Summary
After using a robust method to fill the blanks in the dataset, we got the opportunity to fully utilize the dataset that we had and not make a compromise by deleting or averaging the data. We were then able to work with a larger dataset and have more confidence in the results.
For StumbleUpon to differentiate between which pages are evergreen and which pages are not, the KNN model is most helpful given the data that we currently have. Even though the there is a possibility that the model is over fit, the KNN model performs well against the other models. If looking at the contribution of individual variables for their contribution within the model is important to StumbleUpon than the logistic regression model should be used instead of the KNN model.
2. Data Preprocessing
The data for each numeric variable was plotted with box plots or histograms to determine its shape. A number of distributions had significant outliers but were not removed because they appeared to be correct data, just not well fit data. For categorical variable pivot tables were used to show the number of categories and the number of observations for each category.
2.1. Data Cleaning
2.1.1. Label:
The response variable “Label” had missing values. Those were deleted to reduce the dataset from 10,000 to 7,396 record including a header.
2.1.2. Is_News:
The Variable “Is_News” had only one value in it so it was deleted.
2.1.3. Frame_Based:
The variable “Frame_Based” has only one value so it was deleted.
2.1.4. Alchemy_Category:
The missing values were assigned by parsing the “title” section of boilerplate column and conducting a word search to see which category the URL belonged to. To do this, the title of the URL, from boilerplate column, was separated from rest of the text and copied to a new sheet. The titles were then looked up to see it they contained words specific to any of the categories. For example, a URL title was searched for words such as “racing”, “car”, and “medal” etc. to determine if the URL belonged to “sports” category. Similarly, the same URL title was then searched for words that were more relevant to the remaining categories. Overall, 3189 titles were cross referenced with 300 words for each of the categories. The URL’s were then assigned “word match” points depending on number of matched words found in the title. For example, if the title contained the three words “racing”, “car” and “medal”, then it received 3 “sports” points. And if the same title additionally contained the words “Calorie” and “body”, which belong to “health” category, then it received 2 “health” points. Finally, depending upon the category which had the highest number of points, the URL was assigned its Alchemy Category. If a URL had 3 “sports” points, for example, and had 3 or less number of points for the other categories, then that URL was assigned with ‘sports’ as Alchemy_Category. In case of a tie in category points, for instance 3 for sports and 3 for health, the category was chosen at random between the members involved in tie, recreation and health in this case.
2.1.5. Alchemy_Category_Score:
After assigning the Alchemy categories for the missing values, the Alchemy scores were determined by taking the average alchemy scores, by category, of non-missing entries and allocating the same, with some penalization, to the newly estimated alchemy category values. The category with the highest “word match” average points received a score of average “alchemy_score” of that particular category (non-missing only) less one standard difference. For example, all the newly predicted “recreation” categories received score of average alchemy_score of non-missing “recreation” values minus one standard difference of the score. The category scores were penalized depending on how low their average “word match” points were.
2.1.6. Is_News_Front:
The frequency of the 0’s and1’s was determined to be .95 and .5 respectively. The missing values were replaced through a random number generated between 0 and 1; if the random number was greater than .95 then the missing value was replaced with a 1, otherwise it was a 0.
2.2. Additional Data Set
A dataset with all the missing values deleted was created
...
...