Analysis of How Different Covariates Affect the Presence and Volume of Brown Fat in Patients
Essay by people • July 9, 2011 • Case Study • 7,438 Words (30 Pages) • 1,943 Views
Essay Preview: Analysis of How Different Covariates Affect the Presence and Volume of Brown Fat in Patients
Analysis of How Different Covariates Affect the Presence and Volume of Brown Fat in Patients
Brown fat is a type of fat primarily found in human infants and small animals living in cold environments, which plays a huge role in heat conservation. With newly developed technology, researchers recently discovered the presence of brown fat in some adult humans. With the use of body scans, researchers discovered that one way of triggering brown fat production in humans is through increased exposure to cold ambient temperatures. However, there may be other factors contributing to brown fat production. In this project, we investigated several different factors that may significantly affect the presence and volume of brown fat in adult humans.
Method
Data Collection
We obtained the dataset from the Statistical Society of Canada website, which was originally collected from the Molecular Imaging Center at the University of Sherbrooke. Their dataset contains 20 predictor variables and six explanatory variables with 4,842 observations for each variable, and each row of observations pertains to one particular patient selected at random for the study.
The categorical predictor variables are sex (Female=1 and Male=2), presence of diabetes (No=0 and Yes=1), season (Spring=1, Summer=2, Automn=3 and Winter=4), presence of cancer (No=0 and Yes=1), and type of cancer (No=0, lung=1, digestive=2, Oto-Rhino-Laryngology=3, breast=4, gynaecological (female)=5, genital (male)=6, urothelial=7, kidney=8, brain=9, skin=10, thyroid=11, prostate=12, non-Hodgkin lymphoma=13, Hodgkin=14, Kaposi=15, Myeloma=16, Leukemia=17 and other=18). The numerical predictor variables are age (in years), day of current year, month of observation, external temperature, average temperature (within last 2 days), average temperature (within last 3 days), average temperature (within last 7 days), average temperature (within last month), duration of sunshine, weight (in kilograms), size (in centimetres), body mass index (BMI), glycemia, lean body weight (LBW) and TSH. One of the six response variables is the presence of brown fat in the patients, which is a categorical variable denoted as BrownFat in the dataset with two levels (No=0 and Yes=1). The remaining five response variables are the volumes of total, cervival, paravertebral, mediastinal and perirenal brown fat in the patients, which are all continuous numerical variables. They are denoted as Total_Vol, Cervical_vol, Paravertebral_vol, Mediastinal_vol and Perirenal_vol respectively in the dataset.
Missing values
Doing experimental designs on datasets with missing observations can be very problematic. (Put reference) discussed three ways to deal with these issues, which are called case or pairwise deletion, parameter estimation and imputation techniques. Case or pairwise deletion is when all rows of observations with at least one missing observation are deleted, which is the most popular way of dealing with missing observations. This method is only useful if the dataset is large enough so that losing some rows does not make a huge difference in the dataset, and should only be used if the missing observations were completely made at random. Parameter estimation utilizes algorithms similar to the Expectation-Maximization algorithms to calculate the maximum likelihood estimates of the appropriate parameters for datasets with missing observations. Imputation techniques are used to estimate the true values of the missing values based on the rest of the dataset, which fills in all of the gaps in the dataset.
For our dataset, we used the case or pairwise deleting method to deal with the missing values by omitting the appropriate rows and columns in R. This choice is reasonable, because our dataset is very large with more than four thousand rows of observations and approximately two hundred rows with missing values. First, we omitted the predictor TSH, because the mass majority of observations in the corresponding column of the dataset are missing observations. This means that it is irrelevant and not useful for explaining the variation in the response variables. After we excluded this predictor variable from the dataset, there were approximately two hundred rows with missing observations. These missing observations coincidentally corresponded only to the Cancer_status and Cancer_type predictors. We omitted the 250 rows with NA's for both cancer_status and cancer_type, and then omitted the remaining 119 rows with NA's for just cancer_type.
Partition Data Set
We also divided our dataset into two subsets. One subset consisting of 20% of the original dataset was randomly selected without replacement, and is known as the test dataset. The other subset consisting of the remaining 80% of the original data is known as the training dataset. We built our model based on the training dataset and then predicted the outcome based on the test dataset.
Statistical Analysis
Penalized logistic regression for response: Brownfat Presence
In the pursuit of the best model, we used the penalized logistic regression methods. The generic logistic regression model is of the form
In this equation, the Xi's are our 19 predictors, P(Y=1|X) is the probability that brown fat is present (e.g. brownfat=Yes) and P(Y=0|X) is the probability that brown fat is not present (e.g. brownfat=No). We also noticed that there seems to be relationships between age & bmi, age & glycemy, age & lbw, weight & glycemy, weight & lbw, bmi & glycemy, bmi & lbw, diabetes & age, and glycemy & lbw. We therefore included the corresponding interaction terms in our model. The summary in the generic logistic regression output (Table 1) reveals that the predictors sex, diabetes, month of February, month of January, digestive cancer, ORL cancer, kidney cancer, Type I skin cancer, Type K skin cancer, Type M skin cancer, Type N NHL cancer, other type of cancer, external temperature, 7 days temperature, age&bmi, age&lbw and diabetes&age may affect the presence of brown fat in patients.
Logistic regression coefficients are normally estimated by maximum likelihood methods. For penalized regression, we need to maximize the log-likelihood equation subject to a size constraint on L1 norm of the coefficients. This is done by minimizing the equation
In this equation, λ is the penalty parameter that
...
...