Multiple Regression Analysis: "locating New Pam and Susan's Stores"
Essay by TrudyGirl10 • January 23, 2012 • Research Paper • 2,110 Words (9 Pages) • 5,510 Views
Essay Preview: Multiple Regression Analysis: "locating New Pam and Susan's Stores"
Introduction
The main purpose of this study is to identify the factors that significantly increase the sales at any of the sites of Pam and Susan's stores, which are mostly located throughout the South. In order to estimate the sales, the data that has been considered is given in an Excel sheet "pamsue-1.xls", includes 250 observations and 33 variables including annual sales ($1000s) in each of the 250 stores. Using Excel tools, the required analysis has been carried out with this data and the corresponding results are used to make prediction of sales at two sites A and B.
Data
The given data consists of 250 observations and 33 variables, of which some of the variables quantitative and some are qualitative. The variable "store #" can be used for the analysis as it is refers to numerals/labels for the store number that have no inherent meaning. Also, the variable 'comtype' cannot be used directly for the analysis as it is given in ordinal level and as such the dummy variables can be created to utilize this information in the analysis. The description of each variable including the units of data values is presented in the following table:
Sl. No. Variable Description
store # Stores are numbered consecutively
%black % of population that is black
%spanishsp % of population that is Spanish speaking
%inc0-10 % families in each of the following family income categories (in $1,000s):
0-10; 10+-14; 14+-20, 20+-30; 30+-50; 50+-100; >100
%inc10-14
%inc14-20
%inc20-30
%inc30-50
%inc50-100
%inc100+
medianinc median yearly family income (in $)
medianrent median rent per month (in $)
medianhome median home value (in $)
%owners % home owners
%nocars % with no cars
%1car % with 1 car
%tvs % households with TV
%washers % households with washer
%dryers % households with dryer
%dishw % households with dishwasher
%aircond %households with air conditioner
%freezer % households with freezer
%sechome % households with second home
%sch0-8 % adults (over age 25) with the following years of education:
0-8; 9-11; 12; 13-15; > 16
%sch9-11
%sch12
%sch12+
population total population
familysize average family size
Selling Sq Ft (in 1000s) square feet of selling area (in thousands)
sales (in $1000s) annual sales (in thousands of dollars)
% hard goods % goods stocked that are hard goods
comtype Competitive type category number
From the above table, it can be clearly seen that most of the variables are given in percentages and those variables are said to be ratio variables. Also, the variables such as 'medianinc', 'medianrent', 'medianhome', sales that are given in dollars and also the variables 'population', 'familysize', and 'Selling Sq Ft' are all measured in ratio scale and hence these are also called as ratio variables. But, the variable 'comtype' is an ordinal variable with seven categories in order (from 1 to 7).
With these variables, the required analysis has been carried out using Excel tools (Data Data Analysis and Insert bar charts or scatter plots). Our main purpose of this study is to identify the independent variables which are significantly affecting the annual sales. As a first step, the correlation analysis has been carried out with all, except 'store #' and 'comtype', to identify the variables which are highly correlated with the dependent variable 'Sales ($1000s)'. After identifying the 10 independent variables, which are highly correlated with 'Sales ($1000s)', the regression analysis has been carried by treating 'Sales' as the dependent variables and these 10 highly correlated variables, including comtype1, comtype2, and comtype7 as the independent variables. Therefore, there would be 13 independent variables and one dependent variable 'sales' to carryout regression analysis. After conducting the regression analysis, the most insignificant variables have been identified and eliminated from the dataset. Then again the regression analysis has been carried out with the remaining variables and applied the same procedure, as earlier, until all the independent variables in the model are statistically significant (say at 5% level of significance). Once we obtained the final model with only significant independent variables, the regression equation has been constructed and then the estimated sales has been calculated to identify the Site that will have the higher predicted sales, according to our regression.
Results and Discussion:
Scatter Plot:
First the scatter plot has been constructed for 'Sales' against 'comtype' using Excel tool (Insert Scatter) and is presented below:
Figure 1
The above scatter plot shows that the sales is in the middle categories 3, 4, 5, and 6 are in similar ranges on the vertical axis, while the categories 1 and 2 have fairly higher sales and category 7 appears to have somewhat lower sales. This information would be very much useful for creating dummy variables for the variable 'comtype' while conducting multiple-regression analysis. So, we have created 7 dummy variables, comtype1, comtype2,..., and comtype7 to be included in the multiple-regression model. The above scatter plot shows that when we create dummy variables
...
...