Knn Model Data Analysis

Essay by anilsarode • March 10, 2016 • Essay • 1,340 Words (6 Pages) • 2,080 Views

Essay Preview: Knn Model Data Analysis

prev next

Page 1 of 6

1. Executive Summary

After using a robust method to fill the blanks in the dataset, we got the opportunity to fully utilize the dataset that we had and not make a compromise by deleting or averaging the data. We were then able to work with a larger dataset and have more confidence in the results.

For StumbleUpon to differentiate between which pages are evergreen and which pages are not, the KNN model is most helpful given the data that we currently have. Even though the there is a possibility that the model is over fit, the KNN model performs well against the other models. If looking at the contribution of individual variables for their contribution within the model is important to StumbleUpon than the logistic regression model should be used instead of the KNN model.

2. Data Preprocessing

The data for each numeric variable was plotted with box plots or histograms to determine its shape. A number of distributions had significant outliers but were not removed because they appeared to be correct data, just not well fit data. For categorical variable pivot tables were used to show the number of categories and the number of observations for each category.

2.1. Data Cleaning

2.1.1. Label:

The response variable “Label” had missing values. Those were deleted to reduce the dataset from 10,000 to 7,396 record including a header.

2.1.2. Is_News:

The Variable “Is_News” had only one value in it so it was deleted.

2.1.3. Frame_Based:

The variable “Frame_Based” has only one value so it was deleted.

2.1.4. Alchemy_Category:

The missing values were assigned by parsing the “title” section of boilerplate column and conducting a word search to see which category the URL belonged to. To do this, the title of the URL, from boilerplate column, was separated from rest of the text and copied to a new sheet. The titles were then looked up to see it they contained words specific to any of the categories. For example, a URL title was searched for words such as “racing”, “car”, and “medal” etc. to determine if the URL belonged to “sports” category. Similarly, the same URL title was then searched for words that were more relevant to the remaining categories. Overall, 3189 titles were cross referenced with 300 words for each of the categories. The URL’s were then assigned “word match” points depending on number of matched words found in the title. For example, if the title contained the three words “racing”, “car” and “medal”, then it received 3 “sports” points. And if the same title additionally contained the words “Calorie” and “body”, which belong to “health” category, then it received 2 “health” points. Finally, depending upon the category which had the highest number of points, the URL was assigned its Alchemy Category. If a URL had 3 “sports” points, for example, and had 3 or less number of points for the other categories, then that URL was assigned with ‘sports’ as Alchemy_Category. In case of a tie in category points, for instance 3 for sports and 3 for health, the category was chosen at random between the members involved in tie, recreation and health in this case.

2.1.5. Alchemy_Category_Score:

After assigning the Alchemy categories for the missing values, the Alchemy scores were determined by taking the average alchemy scores, by category, of non-missing entries and allocating the same, with some penalization, to the newly estimated alchemy category values. The category with the highest “word match” average points received a score of average “alchemy_score” of that particular category (non-missing only) less one standard difference. For example, all the newly predicted “recreation” categories received score of average alchemy_score of non-missing “recreation” values minus one standard difference of the score. The category scores were penalized depending on how low their average “word match” points were.

2.1.6. Is_News_Front:

The frequency of the 0’s and1’s was determined to be .95 and .5 respectively. The missing values were replaced through a random number generated between 0 and 1; if the random number was greater than .95 then the missing value was replaced with a 1, otherwise it was a 0.

2.2. Additional Data Set

A dataset with all the missing values deleted was created

...

Download as: txt (8.3 Kb) pdf (169.6 Kb) docx (11.4 Kb)

Continue for 5 more pages »

Read Full Essay Save

Only available on OtherPapers.com

Similar Essays

Trend & Data Analysis

Trend and Data Analysis By: Yuleidis Ramos EDD 9200-M14 Trends and Issues Nova Southeastern University June 19, 2011 Introduction What does the future hold for

1,316 Words | 6 Pages
Steps Required to Understand Educational Research - Methodological Approach, the Research Design, the Sample, Data Collection Data Analysis, Ethical Issues, Dissemination and the Audience

Introduction In this essay, my aim is to examine the steps required to understand educational research. These steps consist of the methodological approach, the research

4,872 Words | 20 Pages
Critique of the Data Analysis: Kraut/kiesler

Critique of the Data Analysis: Kraut/Kiesler study. Though it has been many years since the Kraut/Kiesler study was conducted, the spirit of the study still

844 Words | 4 Pages
Financial Data Analysis Pfch

Financial Data Analysis Patton-Fuller Community Hospital (PFCH) was established in 19 to provide health care services to the members and residents of the Kelsey Community(PFCH

754 Words | 4 Pages
Real Data Analysis

A1: What were the results in terms of the mean? The coffee Group mean was 22.5 and the Placebo Group was 36.4. The mean difference

3,385 Words | 14 Pages
Patton Fuller Community Hospital - Financial Data Analysis

Financial Data Analysis Tim Anderson HCS/577 December 10 2013 Sharon Gomes-Sanders Financial Data Analysis This writer will analyze Patton Fuller Community Hospital, a for-profit organizations

736 Words | 3 Pages
Biostatistics - Data Analysis and Presentation

Data analysis and presentation Students were given two (2) sets of data to analyze. Both sets of data were answer sheets of students from high

512 Words | 3 Pages
Bus 475 - Business Model - Swott Analysis Paper

Business Model Part II Rodriguez 1 Victor Rodriguez 8/2/2015 Bus 4 Business Model Part II SWOTT Analysis Paper Prof Bergman A swott analysis is a

868 Words | 4 Pages