Recommendations, Analyses, and Data for Health Services Research, Diagnosis, and Prognosis Clinic Notes 2014

2014 Dec 15

Amos M. Sakwe, Department of Biochemistry and Cancer Biology, Meharry Medical College

2014 Nov 24

Paula DeWitt, Center for Biomedical Ethics and Society; Madhu Murphy, pediatric cardiac intensive care unit

Email: We are wanting to test the effectiveness of a “journey board” (see attachment) designed to better prepare parents for their child’s stay and eventual discharge from Vanderbilt’s pediatric cardiac intensive care unit. This will entail giving self-administered surveys with preparedness and satisfaction items to parents of children hospitalized in the pediatric ICU immediately before and immediately after the parent has been exposed to a 15-minute educational intervention using the journey board. The intervention will take place in the child’s hospital room (or another room in the unit) and will consist of a clinician walking the parent through the journey board, and answering any questions the parent may have. Immediately before this, a researcher (not the clinician) will approach the parent, explain the study, and ask if the parent would like to participate. If yes, the parent will be given two short (5-10 minute) self-administered surveys to complete. He/she will be asked to complete one prior to the intervention and one immediately after the intervention. The data will be used to assess the effectiveness of the journey board in preparing parents, and we would like your advice concerning numbers of parents we will need to interview to obtain statistical significance and statistical techniques to be used.
  • Frank's note: Design is confounded with time/fatigue/learning. Also there is little precedent for doing a pre-post study with such little time between pre and post. I think you will need to do a randomized study to attribute any effect to the intervention. Randomize 1/2 of families to get the intervention, 1/2 to get the prevailing treatment, and give survey at the "after" time point for both groups.

Amma N. Bosompem, MPH graduate student

I'm a graduate student in the Masters of Public Health program who will be looking at treatment completion rates in a specific population for my thesis. I need some guidance on study design and what analyses are appropriate given my sample size.
  • TB clinic in Nashville, 203 cases (information on case only) in year 2013.
  • Treatment: completed vs not completed (refused, lost to follow up, etc.)
  • Research question: 1. treatment completion rate. 2. the association between patient's characteristics and treatment completion.
  • Prepare data set as
  • Is there an association between country of origin and acceptance of treatment (accepted vs. no accepted)
  • Apply logistic regression analysis (dichotomous or binary response variable) and include the variables of interest. *General rule of thumb the smaller sample size/10 will help you assess your regression power or how many variables you can include in regression model *N=48 that refused and will be the limiting sample size in regression analysis
  • Country of origin main factor and will have to think of best way of grouping *The covariates of interest: Age as continuous non-linear; gender, marital (married vs. non-married) and country of origin

2014 Nov 17

Mary Van Meter, medical student

My project is looking at the cost of packaging surgical instrument trays (possiblly using regression) and calculating the percentage of instruments used in trays across various Gyn specialties.
  • Total number of instruments used per tray (25-100), usually less than 50% is used.
  • Will compare unnecessary cost between specialty
  • Will apply for VICTR voucher. A standard $2000 is appropriate.

Tomas DaVee, GI

  • Patients underwent liver transplant who had plastic stent to treat leak, about 20-30% needed mental stent later
  • Want to predict early whether patient needs mental or not so pt does not need to surfer pain
  • The current data only gives conditional needs to mental if had plastic already
  • Suggest do descriptive statistics and plan bigger study to develop prediction model
  • Use R for internal validation and calibration using bootstrapping method (rms package)

2014 Nov 3

Monica Ledoux

I am an adjunct at Vanderbilt's Dermatology department, working with Zhengzheng Tang from biostats on microbiome and skin and would like to know the biostat budget for VICTR application(s)
  • Want to know the relationship between Cortisone treatment and bacterial change.
  • Each subject will be his own control: cortisone on one arm and no cortisone on the other. Each arm will be tested at two sites, one normal skin and one tape stripping skin. Observe bacterial change. Therefore, each subject will have 4 tested samples and each sample measured twice (total 8 per person)
  • Look at treatment effect on normal skin. Suggest amount of $2000.

2014 Sept 29

Bryan I. Hartley, MD Department of Radiology

Review an abstract.
  • There are limitations of pre post design. Many factors will affect the outcome besides the intervention like time.
  • Box plot with raw data to explore the distribution
  • Can use Wilcoxon signed rank test to compare continuous outcomes before and after
  • Consider ANCOVA (Analysis of Covariance) to analyze post while adjust for pre specified covariates like previous experience

Adam P. Bregman, MD, MBA, Annette Ilg, Vanderbilt Internal Medicine Resident

I am starting a project with Dr. Anthony Langone in the renal transplant division and I have come across some questions with my redcap database that I would like to ask
  • Retro spective study of post renal transplant patients. Follow those patients for two years to observe a rare event.
  • Describe users characteristics. Some pts took medication for entire 6 months, some stopped prior 6 months for certain reasons, some retook it later. Can consider using certain amount of time to define user.
  • Binary/categorical variables can be described as frequency and percentage

2014 Sept 22

Kasim Ortiz, Sociology, Doctoral Student

I’d like to obtain assistance with a project I completed this summer while attending the Fenway Summer Institute on LGBT Population Health. I used a dataset that permitted me to analyze the effects of states having restrictions on same-sex marriage on smoking, extending previous work by examining these impacts among racial minorities that are sexual minorities. My findings were counterintuitive to previously published work, insofar that I did not find where states restricting same-sex marriage policies had a negative impact on smoking among racial sexual minorities. The dataset is the Social Justice Sexuality Project.

Most previously published papers examining the effects of same-sex marriage policies on the health of LGBT populations have utilized GEE in their statistical analyses. However, I was having problems getting GEE models to converge with the cross sectional data I am using and thus I utilized GLM for the binomial family with adjustments for state clustering (vce) using STATA. I’m apprehensive about wrapping up the manuscript out of fear that I might get a ton of push back because of the methodology I’ve chosen. Also I was having problems with conducting post-hoc analyses on some of the interactions that I included in the analysis (sample sizes kept changing across models preventing me from appropriately conducting likelihood ratio tests). Thus, I wanted to obtain statistical advice on the differences between the two and consultation on how I should proceed. It is my understanding that R is utilized predominantly and since I am learning R in my statistical sequence this year, I thought this would be a great opportunity to compare STATA with R on a question that I’m invested in and a dataset that I’m familiar with.

  • Logistic regression model with robust standard error is appropriate.

Robin Jones, Assistant professor

We have two questions regarding our project on respiratory sinus arrhythmia (RSA) and stuttered disfluencies.

We are interested in how that RSA (respiratory sinus arrhythmia, parasympathetic activity on the heart) is associated with speech disfluencies. RSA during various tasks is influenced by it’s baseline value, thus, in our model we need to account for baseline RSA in some way. When running a model predicting stuttered disfluencies with RSA during an emotional speaking task, should we use a) RSA during the task and RSA at baseline as covariates, or b) create a residualized difference score of RSA during task relative to baseline as a covariate? With option “a” we would be using two covariates in the model (RSA during task & RSA at baseline), whereas with option “b” we would be using only one (RSA difference score).
  • Could include baseline RSA if colinearity is not an issue.

I would like to verify that our analytic plan is the most appropriate. We are using a Generalized Linear Regression with Negative Binomial Distribution. We are using this distribution because stuttered disfluencies are not normally distributed.
  • Generalized Linear Regression with Negative Binomial Distribution is good.

The model for our 3rd research question: Is the relationship between RSA and stuttered disfluencies different for children who stutter versus children who do not stutter during various emotion conditions. We used a 3-way interaction (as well as the various 2-way permutations) to assess this: RSA*Group*Condition. However, when running this model, I am not getting an intercept term. Is the model flawed in some way?
  • Probably.

Lastly, we have some outliers, and I am curious on what is the preferable way to handle this: 1) next case imputation, 2) log transformation, 3) square root transformation?
  • Not a real outlier

  • Take into account the correlation within each subject.
  • Might have carry-over effects between different periods. Could test on equivalent carry-over effects.

2014 Sept 15

Stacy S. Klein-Gardner, Ph.D. Biomedical Engineering

My data that I would like to analyze is survey data from a Likert scale that I believe to generate interval data. I am attaching the raw data to this message. We can focus only on the first spreadsheet as the analysis will be the same for all. The data that is shown includes pre and post data from each of three summers of educational intervention. On the spreadsheet, I have entered the data first under the summer in which the intervention and testing took place. I then moved the data around to indicate which summer it was for the student. For example a student may have come first the first time in 2012, 2013, or 2014. She may have come only one summer or she may have repeated the program. My question is: Does repeated participation in the intervention improve the impact that the program has on the measures indicated by this survey? Is one summer sufficient to increase this measure?

  • Outreach for engineering education
  • looking at data from engineering camp for girls (looking for changes in self- efficacy)
  • self- efficacy- feeling that you can accomplish something in your life (this scale has been validated) * some girls have participated for one year, some have participated for about three years
Research Question: * interested to see if there is an effect of attending camp for more than one year
  • also interested in differences in pre-post scores for the one year attended by student
  • descriptive statistics: consider summary statistics across different categories (i.e. pre and post scores by different school types, year of study, grades)
  • Consider repeated measures type analysis (longitudinal data analysis) for assessment of slope over time (year) of self-efficacy variable
  • Per question of interest - may need to reformat data to “long style or vertical format” (i.e. have row 1 id=1: 2012 post self-efficacy score, row 2 id=1: 2013 post, row 3 id=1 2014 post self-efficacy score, etc. for each of the girls)
  • adjust for age , school type (consider the role of additional potential confounders)
  • Consider applying for VICTR funding– for assistance in repeated measures type of analysis.
  • Need to account for the correlated nature of data and verification of assumptions (such as in Mixed effects modeling or generalized least squares)
  • Account for the missing data

  • Limitation: lack of control group (there is no way to conclude that the program is the only thing that is improving self efficacy)

  • pre vs. post score for any given year
  • consider doing boxplots for each of the pre and post scores for each year (these can serve as your summary statistics)
  • Univariate analysis: Wilcoxon Signed rank test to see if there is a difference between the distributions of pre and post scores (data in horizontal format works)
  • cautioned combining the pre-scores over the three years, and post scores over the three years (year may be a confounder and impact trend of the data)
  • pre vs. post study (may see a difference, however no guarantee that improvement is from the program- not an randomized controlled trial)
  • Motivated and selected group of girls and may have higher self-efficacy baseline score (pre) - consider comparing self-efficacy scores with those reported in other studies among girls.

2014 Sep 8

Jason Winnick, MPB

I perform biomedical research where the sample sizes are generally approximately 6 to 10 per group. The primary outcome variables are usually something like counterregulatory hormones levels, glucose in fusion rates, or muscle glucose uptake. I will be submitting an R01 in October and I would like to receive your advice regarding the estimation of sample size. In addition there may be the potential to receive statistical assistance over the life of the grant if it's funded.
  • Two primary endpoints; one with greatest variability (glucose infusion rate needed to maintain desired blood glucose level) has most variability and hence will be conservative to plan for
  • 10 with type I DM 10 without
  • Other covariates: age, insulin required to maintain blood glucose, HbA1c
  • Baseline liver glycogen assessment
  • Start with 3 hour fructose infusion to stimulate liver glucose update vs. saline infusion (randomized), then insulin infusion then 2 hour period where become hypoglycemic (using clamp)
  • Need a good estimate of the standard deviation across patients for infusion rate - use the dog data taking all relevant time periods and stratify by liver glycogen to compute 12 SDs; then we can compute an averaging by averaging the variances and taking the square root
  • Need clinically relevant difference (in mean infusion rates) not to miss: estimate 1 mg/Kg/min
  • Language for grant application something like: The power calculation was based on a 2-sample t-test without covariate adjustment for HbA1c, age, etc. The actual statistical test will be ANCOVA adjusting for these factors, which will increase the actual power a bit (increase would be more had the sample size been larger; the sample size chosen has a penalty for estimating the effects of the baseline covariates).
  • Last aim: most general way to assess to to fit a smooth function of time to the longitudinal (serial) measurements, separately for each of two groups, and test for differences in shape of the two curves. A convenient choice is to fit a quadratic function of time to each curve. This increases power over individual time point tests. Suggested statistical method: generalized least squares or mixed effects linear model.
  • Suggested contacting Li Wang to tell her that a VICTR voucher is in the works

2014 Aug 18

Cecilia Di Pentima, Zachary Willis, Pediatric

  • Want to assess the impact of the implementation of an ASP on antibiotics use.
  • Monthly antimicrobials (AMs) use in days from 2009-2012 April. Data from many hospitals including Vanderbilt. Want to compare Vanderbilt to ALLCHA.
  • ASP intervention started 2012 March at Vanderbilt. Can see less use of AMs after intervention.
  • The comparison of pre and post might be biased by other factors like time not just by intervention. Institution effect is hard to assess since all institutions started intervention at different times.
  • Also needs to adjust for other factors like date for seasonal effect.
  • Linear model of VCG ~ intervention + rcs(time)
  • Better to have individual data for all the hospitals which had both pre and after data to assess intervention effect using mixed-effects model, or just compare between hospitals using data after intervention to see whether Vanderbilt does better than others
  • Consider get Vanderbilt rank among all CHA

2014 July 21

Kelvin Moses, Urologic Surgery

  • wants to do a pilot study to get preliminary results for a grant submission.
  • requesting data from Southern Community Cohort Study (SCCS). Needs power analysis and statistical plan for the data request.
  • applying for VICTR biostats support for funding for this prelim project. Needs estimate.
  • about 3200 men enrolled. max follow-up 10 years. about half finished the whole study period.
  • Prediction of screening frequency by baseline characteristics. Association between prostate cancer stage and frequency of screening.
  • all patient self-reported data, at 5 year and 10 year. (have you had screening within the last year?)
  • GEE model of screening frequency (recent screening yes/no at 5 year, 10 year) on age, race, interaction between age and race, ...
  • Ordinal logistic regression model of prostate cancer stage/grade on screening frequency (need be carefully defined) prior to diagnosis. Need consider different follow-up of the patients.
  • Contact Li Wang( for budget estimate.

Austin Beason, summer medical student, MOON Group

  • How can I determine the required sample size (i.e. number of subjects or raters) for interval estimation of the Kappa statistic for an intraobserver and interobserver study with multiple raters? Our number of subjects is currently 20 (N=20) and our current number of raters is 27 (n=27). Further, we are hoping the given sample size will give at least 80% power at the 0.05 level of significance (two-sided).
  • >library(kappaSize)


April Barnado, Leslie Crofford, Division of Rheumatology

  • Email: I am submitting an early career grant for a starter type project due August 1 and needed help with performing and writing up power/sample size calculations.
  • Specific Aim #1: identify group of lupus patients of about 1135. Lupus nephritis patients of about 400. Nephritis is severity indicator.
  • Specific Aim #2: Determine the association between ED use and meeting standards of quality of care in management of SLE and in the treatment of SLE nephritis, as defined by the Quality Indicator Set for SLE. For aims #2, I would likely be performing Chi squared tests comparing 3 groups (non, occasional, and frequent ER users) for most of those sub-aims.
  • Specific Aim #3: Determine the association between ED use and corticosteroid use in SLE and SLE nephritis. For aim #3, I would likely be using multiple linear regression.
  • For binary outcomes, use logistic regression with adjustment of other confounders.
  • Ratio will be treated as continuous variable and will be analyzed using general linear model.
  • Hypothesis: more ED use will have higher steroid dose. Will analyze current steroid dose and #ED visits in the past 12 months. Steroid dose will be a ordered categorical variable with 4 levels. Can use Chi-square test. Proportional odds model can be used to adjust for other confounders.
  • Grant due Aug 1st, need to be done July 21st.

Taylor Leath, Trauma in Surgical Sciences

  • Survey on quality of life (N=1000). There are 7 GOSE questions about health states (0-100). Can describe the distribution for each GOSE. Predictors include gender, age, and years of education.
  • Want to compare between GOSE scores. Multiple comparison issues (21 comparisons).
  • Can use mixed-effects model taking into account of within subject correlation.

Ola Oluwole, Medicine

  • N=36 patients who had CLL transplant with two types (8 vs. 27). Want to compare survival between two groups.
  • Time from transplant to death or relapse. Sample size is limited. Mainly descriptive. Want to write manuscript.
  • Can apply for voucher of $4000.


Neelam Patel, Medical Student

  • I am fourth year medical student doing a project for dermatology. We are doing a meta-analysis of pediatric vitiligo patients to assess which populations need thyroid studies performed. I have a spreadsheet of the data. I need help analyzing it.
  • Research question: the percentage of thyroid abnormalities in pediatric vitiligo patients.
  • Only have aggregated data. Could have an overall estimate of percentage. Also could explore the variability between studies.
  • Apply for a $2000 Voucher.

Tyler Kendrick, Anesthesiology, Medical Student

  • One-year prospective study. Will record the numbers of surgeries in Ethiopia (an African country) and the number of perioperative mortalities.
  • Sample size calculation to reach a desirable precision of mortality rate estimate.


Wei Xie, Computer Science, Brad Malin, DBMI

  • we want to find out if the IRLS estimation algorithm is reversible -- e.g., given only the Fisher information matrix and scoring function (and \beta coefficients), can we go back to the original Y or X matrices
  • Context is confidentiality with data coming from multiple sites, with each site's data maintained independently, and controlled
  • How to do model diagnostics without residuals?
  • Does the distributed computing model lead to good statistical modeling practice? E.g.: covariate transformations, Y transformation, normality of residuals [could compute residual vector separately by center and share an ECDF of the residuals)
  • How often are practitioners of distributed statistical analysis assuming linearity of covariate effects? Being careful about transforming Y or modeling Y robustly?
  • Can't reverse the process to solve for an individual's datum if model is full rank, n > p, no parameter is devoted to only one subject, residual vector is secret
    • If a single parameter is devoted to 5 subjects at one site, may possibly be able to solve for a summary statistic for the 5 (e.g., race has 4 levels and one of the levels only applies to 5 subjects at a site)
  • May be able to discern that one site has an overall better level of Y than another site
  • Not able to get a robust sandwich covariance matrix estimator if residual vector is not provided; sandwich estimation requires U matrix not just U vector
  • Even if residuals are available, it may not be possible to work backwards to an individual from a given site because estimates come from a global beta vector over all sites
  • We seldom use OLS with health care data; the need for weighted X'X (X'VX) instead of X'X as used in OLS makes the identification problem more difficult in general, because V is a function of the current beta estimate (for all sites combined)
  • Worthwhile working out the special case where Y is binary and there is a single X that is binary or polytomous, and there is no special knowledge (e.g., k subjects are of type x and all have the same Y)
  • Worth taking another look at data squashing

Neil Templeton, Engineering, CHBE

  • Metabolic flux analysis
  • Rate of metabolite turnover
  • Which metabolic phenotypes are produced in high titre-achieving production processes
  • Protein therapeutics; cost of production
  • 14 conditions (cell lines); correlations between fluxes (80 reactions- flux, mass spec); looking for up-regulation
  • 80 Spearman rank correlations x 14; each correlation 10 observations (clones)
  • Two controls; secondary controls
  • Independent experimental units: clones, manipulations of cell lines
  • See if a unified model would be a better approach than pairwise analysis
  • Must be able to precisely estimate a quantity such as a correlation coefficient in order to be reliable in picking "winners" across reactions
  • Low precision (low number of independent experimental units) implies low probability of selecting the optimum reaction/condition
  • Dimensionality is high enough that an "omics" method may be needed
  • Recommend contining discussion at a Tuesday or Friday clinic


Elizabeth Morse, RN, MSN, FNP-BC, MPH Vanderbilt University School of Nursing

  • My project involves survey data of 220 Spanish and Arabic-speaking patients in the Center for Women's Health. I've completed all of the descriptive statistics but need help with the correlations. For example, I know from having surveyed patients myself that those patients who reported speaking "Arabic only" at home were more likely to self-report speaking English "not very well", but I don't know how to express this statistically.
  • To test association between two variables A and B,
    • If A is a continuous variable and B is categorical variable, use Kruskal Wallis test (or Wilcoxon rank-sum test)
    • If A and B are both categorical variables, use chi-square test
    • If A is ordinal variable and B is binary, use chi-square trend test
    • If A and B are both continuous variables, use spearman's correlation coefficient.


Brett Byram, BME

* Clinical image degradation with ultrasound
  • What are major factors of degradation? Pulling apart mechanisms.
  • Clinical target: liver tumors/biopsy; visualize needle
  • What is the best study design?
  • Ask trained readers to assess utility of image
  • Discusssed hypothesis testing vs estimation study
  • One estimand could be the mean absolute number of levels different
  • Can relate an ordinal measure to quantitative measures of image quality
  • Can estimate # patients needed if have a reliable estimate of the standard deviation of an absolute difference of interest
  • May consider progressively ruining an image to see when it becomes uninterpretable
  • One goal is to develop a model to predict expert's quality rating from multiple quantitative physics-based measures
    • May consider an ordinal response model / multinomial model


Steve Kahn, General Surgery Fellow

  • can't arrive before 1pm on Wednesdays, so attending Monday clinic
  • "I am going to perform an email survey of surgical residents (approx 5500 in the US) and wanted to know what you think an appropriate response rate would be and the best method to do statistical analysis (rough draft of survey attached). Or should the questions be revised to facilitate a better statistical analyisis?"
  • make the variable as continuous as possible using sliding bar

Philip Budge, Fellow, Division of Infectious Diseases

  • grant proposal relating to the development of new diagnostic technologies for neglected tropical diseases

LIsaMarit Wands, nursing

  • Survey on two cohorts, VA-based cohort and university-based cohort.
  • Outcome: global physical and mental health score. Pain is part of global score, and also a barrier to level of reintegration success. Could calculate a global score without pain. Could examine how pain correlates with reintegration and outcome.
  • A specific question (meaning of life) in two standardized questionnaire. Could include both in the model predicting outcome.


Stephanie Fecteau, Psychiatry Post-Doc

  • Cortisol measures 3 per day
  • % of increase because times not noted accurately
  • Need Bland-Altman plot to check proper transformation: post - pre vs. (post + pre)/2 or log(post) - log(pre) vs. geometric mean of pre and post
    • want the transformation that makes the graph flat and random
  • 1/2 of families received a service dog after 3 weeks
  • Suggest longitudinal analysis using 3 daily x 15 weeks, allowing for correlation; only one day per week
  • Correlation structure based on approximate time of measurements in days + fraction of day
  • Model smooth time trend, allowing for separate trend in those randomized to service dog; check for shape change between two groups
  • Easiest-to-interpret method generalized least squares with AR1 continuous-time correlation structure

David Dantzler and Donald Lynch, Cardiovascular Medicine

  • ECMO: what predicts survival to hospital discharge; initiated by cardiac surgeons
  • Collecting patients from last 2 years (N=60 so far)
  • Discussed margin of error of 0.1 in estimating a single probability with n=96
  • Alternate endpoints: LOS, censor on death, i.e. Y=time to successful discharge
  • Or: ordinal outcome Y=1, 2, 3, ... longest LOS, dead = longest LOS + 1; effective sample size almost equal to # subjects
  • Also have Glasgow coma scale at discharge; could factor into ordinal outcome
  • May be possible to use a complex high-information scale to derive a severity of illness-based score that is then used to predict mortality
    • Has reduced many variables to one
  • What to do with patients who died before ECMO was available?


Mitchell Odom, VUMC 3rd year medical student, Department of Neurosurgery.

I am currently helping with a project that requires a survey be employed, and we are creating an original one to send out. I would like to get some expert opinions on the questions that we ask, and to make sure we are honing in on what we're really looking for.
  • CTE - Chronic Traumatic Encephalitis caused by multiple concussions. Survey is designed to ask questions about awareness of CTE among parents of young athletes (junior high and high school). The plan is to distribute the survey using Vanderbilt connections with local high schools.
  • Recommendations:
    • Maximize response rate (by giving parents incentives of some sort)
    • Ensure that the survey is brief
    • Make sure the responses are anonymous
    • Use numbers instead of categories
    • Simplify the language
    • Branch questions
    • Incorporate visual analog scale (instead of categories)
    • Order questions in a logical way
Topic revision: r1 - 15 Jan 2021, DalePlummer

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback