You are here: Vanderbilt Biostatistics Wiki>Main Web>Clinics>ClinicGeneral>GenClinicAnalyses>MondayClinicNotes2020 (15 Jan 2021, HeatherPrigmore)EditAttach

Notes 2020

Project Title: Impact of Hypertension on Diabetic Macular Edema (DME) development among adult VUMC patients in the Synthetic Derivative Database with a prior diagnosis of diabetic retinopathy. Cohorts: Controlled HTN, uncontrolled HTN (defined by mean systolic/diastolic pressures + use of antihypertensives). Questions: 1) This is my second biostats clinic. I have now established my confounders and have preliminary numbers on my cohort sizes. My question is what type of analysis should I consider/how large should my sample size be? 2) Am I eligible for a VICTR Voucher with biostatistical support? I would use this voucher to establish my cohorts via an automated pull of key variables (e.g. mean systolic & diastolic pressure) & then analyze the data once all confounders (I am manually reviewing for) have been collected. Mentor confirmed. VICTR biostatistics voucher.

Clinic Notes:

- N is ~1800. This is total in the SD, won't be able to get more.
- Time zero - date of diagnosis for diabetic eye disease.
- End point - minimum of 1 year, study end (12/31/20), death, DME, LTFU.
- Do the time points make sense? Need to be consistent with data/trust worthy dates in data your are getting & consistent with study itself. Could consider two "time zeros", time zero and "landmark time".
- Consideration related to 5 means - you have 5 groups classification that handles tricky problem (uncontrolled vs controlled), depends on how physiology works if you even want to look at systolic bp vs trusting need for therapy as marker of underlying process. Pull raw BP so that you can summarize different ways if need be.
- Covariates - with 450 outcomes you can have almost as many as you currently have - prioritize which are non-clinically strong predictors of edema, can use propensity scores to collapse others down
- Recommend come to clinic one more time to clear out what you need for VICTR voucher. Generally students are not eligible to get one. No one available for general contract work in department.

Clinic Notes:

- Wanting to compare surgeon prediction vs prediction tool (NSQIP) in terms of surgical/clinical outcomes and mortality. Hypothesis - surgeons are better at predicting outcomes than the calculator. Possible restate problem as to what degree does surgeon predictions differ from prediction tool. N ~123 patients.
- You can get Vizient data at individual level
- What is a good platform?
- Depending on tests needed you can use Stata, Excel
- Biostatistics does offer an R clinic for questions about coding if you choose to use R (https://biostat.app.vumc.org/wiki/Main/RClinic)

- What tests would you recommend?
- Recommend to start visually to see relationship between surgeon prediction and tool prediction.
- Possible: C-index (concordance probability)

- For return visit - no benchmark you have to achieve to return, come back to clinic anytime. Visual analyses you can probably do on your own, if you want more in depth/complicated analyses might be a good idea to go for VICTR voucher.
- VICTR Award for biostatistics support (90 hours). Application website (https://starbrite.app.vumc.org/) and research proposal template (https://starbrite.app.vumc.org/funding/templatesforms/).

Clinic Notes:

- Postpartum IUD and complication in one year post complications. There was a previous meeting in August to prepare for the study. Suggest computing 95% CIs, rather than P values. Estimate effect size. Sample size is 74. Patient could have more than one complication. Would be simpler to reduce this to complication per patient, rather than look at each complication. Longitudinal model, with random intercept per patient. Sample includes people who had a device placed, and then who subsequently had a follow up billed. Suggest gathering a complete sample, including those lost to follow up. Also, would structure dataset so that the patient is the object of analysis. SPSS has an option, PROPOR.

We aim to correlate known single nucleotide polymorphisms (SNPs) in CXCL8:CXCR1/2 with Diabetic Retinopathy (DR) susceptibility and progression in cohorts of patients with Diabetes Mellitus (DM). We have developed 5 cohorts of patients with varying degrees of DR (No DR, Non-proliferative DR, Proliferative DR, Macular Edema, Vision-threatening DR) using Vanderbilt’s synthetic derivative (SD). From previous Biostats Clinic sessions, we developed a plan to test our hypothesis by comparing the prevalence of SNPs in CXCL8:CXCR1/2 between our groups. We have yet to obtain genotyping data, and are interested in analyzing baseline clinical characteristics between our groups of interest. This will help us validate the validity of our cohorts, by allowing us to confirm that the cohorts exhibit expected clinical characteristics. My goal is to show that, consistent with what we might expect these diagnoses/treatments are more prevalent/frequent in patients with more severe DR. Through excel, I have applied Pearson’s test to compare prevalence of diagnoses (e.g. nephropathy) seen with Diabetes, as well as frequency of treatments (e.g. vitrectomy) provided for Diabetic Retinopathy (DR) between my groups of interest. I would like to discuss the appropriateness and interpretation of my analyses, and identify different analyses that I may need to work on as well. I understand the Chi-square analysis is a test for any difference and cannot comment on whether prevalence is higher in one group than the other. Thus I used the phrase “significantly different” and added that prevalence was highest in PDR. This same wording (“significantly different, with unreported ethnicity highest in children < 3 years old”) was used in my ARF manuscript. Is this valid? For the comparisons between No DR -> NPDR -> PDR, we would expect a stepwise increase in prevalence/frequency—lowest in no DR, highest in PDR, with NPDR in the middle. My initial approach was to use a 3x2 contingency table Pearson’s test to show significant difference, then comment that prevalence was highest in PDR. Would it be better to execute separate Pearson’s tests between No DR and NPDR, as well as NPDR and PDR, to confirm differences between the individual groups? Is there another test that is more appropriate in this case?

Clinic Notes:

- Looking to correlate genetics with diabetic eye disease. 5 cohorts with varying severity of diabetic eye disease.
- Completed comparisons via chi-square test, is this correct? Just looking at baseline without controlling for anything (conducting analyses themselves)
- Are there any confounding variables (ex. age)? Put more focus on difference in proportions (absolute effect) instead of p-value (can also do odds ratios for relative effect)

- Comparing 2 factors (at least 1 with more than 2 levels [2x3 table]) - use chi-square, if chi-square is large you can talk about apparent differences (proportions) and not worry about statistical tests once you do the "overall" chi-square.
- Confidence intervals (CI) are a good way to present results. The limitations are self contained - if you have sample size of one comparison that is not large, that could be a reason that chi-square gives you small value, in this case CI would be wide and let you know that you cannot conclude anything - it is uncertain. Descriptive studies it is sometimes better to present CI. Present them any time attempted to say the p-value is big so we think there is no difference. "Absence of evidence is not evidence of absence". Big p-value means you need more data, not that there is no difference (ie. should say, we do not have enough evidence to conclude that there is no difference or we were unable to find evidence that there is a difference).

Clinic Notes:

- Grant proposal due in June. Cluster randomized, with families in the cluster. Intervention is a curriculum, measuring effect of telling PMs at community centers they can modify a tested curriculum. Outcome is BMI of kids of all ages. Collaborator wants to use BMI percentile (per CDC), which is problematic. You should percentile something only if 1) you don't understand measurement or 2) if there is competition. Can make value judgments based on height and weight in regards to BMI.
- Should look at out come you want, intended audience, potential journal (might be more of a policy type study, can advise on how to design). Suggest having design studio.
- Helpful link: http://hbiostat.org/papers/RCTs/cluster

Clinic Notes:

- Want guidance on practicality of the project, multivariable logistic model.
- Sample size is >2,000. ~60 have the outcome of interest. Need roughly 15 events of the outcome per predictor in the model. Might need 2-3 terms for a continuous variable if not linear relationship between it and the outcome (ex. age - if outcome increases exponentially at a certain age). Age should be controlled for, in months or even days. Resource: https://hbiostat.org/rms/.
- Outcome: Binary (need for follow-up/not due to presence of amblyopia on eye exam). Suggest using graded scale if possible, then convert clinical model (binary).
- Validation of model: resampling method (ex. bootstrap) or cross validation, rms allows you to do this (ex. "validate()"). Do not exclude variable based on how it correlates to outcome.

Clinic Notes:

- Cohort defined by ICD10, confirmed by chart. This is a sub-study or a larger study.
- Patients with DM type I and 2, with hypertension. Confounders include type of DM and hypertension (exposure), age, race. Outcome of interest is diabetic macular edema. PHEWAS project.
- Propensity score or regression?
- Either could work, would depend on the number of confounders. Rule of thumb for number of confounders in regression - have absolute minimum 5 events per confounder/category of confounder, typically use 15-20 events. Propensity score is used as a data reduction method - typically used when you have a low sample size or a large number of confounders.
- Do not categorize continuous variables (A1C, age). Don't let size influence confounders--include all that are medically relevant.

- How many patients needed for prelim power analysis?
- Go with max number you can get and then find the power you have if you include them all. Since you are not allowed to choose your sample size, you are finding how adequate the sample size is for your questions. Max power is having equal non-event to events, but do not manipulate that, go for the max you can get.

- First steps:
- Establish and define confounders (discuss with 5 clinical experts)
- Document all decisions with detail
- Understand capacity of EMR and what you can get

Follow up on statistical planning for the project to review the impact of the selection of smooth versus textured tissue expanders in prosthetic breast reconstruction in post-operative complications and outcomes at the Vanderbilt University Medical Center. Mentor confirmed. VICTR voucher request.

Clinic Notes:

- Retrospective study, ~1600 possible participants. Would like VICTR voucher. Primary question is safety differences. Frank: we want to be sure to capture the entire population, don't filter out cases on data pull side. Multiple complications are possible. Follow up typically once a year, after initial healing period. Could do a time to event analysis with KM curve by group. Could also estimate risk over time by group. Kent: could do early complications, perhaps in 6 months post surgery?
- Recommendations:
- Descriptive statistics (Kaplan-Meier estimator)
- Need to take in to consideration time to event - Cumulative incidence curve (1 - kaplan-meier survival curves)

- Primary Analysis: Cox Proportional Hazard Model
- Sample size: Safety outcomes look at absolute risk, not relative risk. Typically, size based on number of complications to detect differences (rough rule - for cox model need at least 15 events per variable you want to study as covariate). Say for estimating absolute incidence, 1600 will give good precision of incidence estimates, but estimating relative rate of complications between groups, would need (15 events x # of variables).
- VICTR application: include inclusion criteria (how you get analysis population), work out definition of chart review outcome, have plan to ensure that you are doing chart reviews correctly.

- Descriptive statistics (Kaplan-Meier estimator)

- Current VICTR voucher for this. This is an early stage study. Prelim data for a larger study.
- Q: How big of a sample size do we need?
- Minimum of 96 patients to estimate a probability (sensitivity) no worse than +/- 0.1 of the truth. Every time you want to be twice as precise it take 4x the sample size (ex. +/- 0.5 = 96x4). Notes showing formula for 96: https://hbiostat.org/doc/bbr.pdf Page 5-45.
- Consider indeterminates, rather than just clear yes/no. Indeterminates are often the "gray area" where we need more help identifying the correct answer. Would suggest not limiting to only sensitivity or specificity, but do both. ~200 samples as suggested may be prohibitively expensive. Could do less, but need to be clear about reason why it is lower. Need to set a minimum sample size for the hardest category to sample as this would be your limiting case - other categories can be more.

- Q: How to compute 95% CI for probability estimate - Stata gives 95% confidence interval around coefficient.
- This function should available in Stata. Ignore CI on coefficients, need CI on predicted value. Email Tom if questions persists as he is resident Stata user.

- Comparing safety and efficacy of textured vs smooth tissue expander. Rare cancers linked to textered expander, recall issued in 2019.
- Retrospective chart review, EMR with CPT code, will use RedCap
- Clearly specify the definitions for the outcomes and how those outcomes will be defined using ICD/CPT codes.
- When extracting data - be sure to include calendar time in data (dates). Consider factors that go into patient and provider decision making. May need to get physician level data.
- Timeline - finish by December
- Analysis Notes:
- Emphasize descriptive methods and difference in groups, 95% CI should be used for differences in proportions. Univariate analyses are not really a good way to inform multivariate analyses.
- Could try to predict which device they received by using profile of the outcomes (secondary to main). Logistic regression model. This says there is a difference in the complication rates among the two. Looks at joint action of all the complications together.

- VICTR Award for biostatistics support (90 hours): application website (https://starbrite.app.vumc.org/) and research proposal template (https://starbrite.app.vumc.org/funding/templatesforms/).

- Summer Research Program (completed 11 years, 15-20 students per year) - tracked every student and their career progression in STEM careers
- Been told need to identify matched cohort and track that cohort as well
- Have IRB submission but want to ask questions prior to submission - possible VICTR voucher in future but need "planning" right now.
- Are we on track for length of study and numbers we need?
- Are we missing data points?
- Outcomes:
- Primary outcome - do they persist in STEM career?
- HS Students - how many go out of state to 4-year college (largely from rural locations)?
- Other: Where they go to college, what they do when they complete college (workforce, advance STEM degree, non-STEM workforce, etc.).

- Need to define/finalize outcomes
- Could base sample size on specific outcomes; however, since you are limited in funding, sample size (dichotomous outcomes) may be too high to achieve.

- 1. “Do NOT treat them as distinct cohorts but as graduation (Outcome variable is 1-4, not 0 vs 1 vs 2, vs, 3, vs 4). If strongly ordered, Y=1-4 and use ordinate regression like proportional odds model (for ordered outcomes where no equal spacing between groups without/do not need to make assumption of what distance there is). Not assume anything about spacing between those 4 categories. Best when ordinal phenotypes. If oversimplify as if binary outcome instead of 4 levels the number would be 1/15 as number of variables in model. ie IF 30 in a cohort would only be able to adjust 2 variables”
- Two of our 4 groups (NPDR&DME vs. PDR only) cannot ordered as initially suggested (see attached figure). I recognize your interest in reducing the use of binary outcomes, but in this case, feel it cannot be done as Diabetic Macular Edema (DME) and Diabetic Retinopathy (DR) represent two, albeit related, disease processes. I’d love to hear your thoughts. At the 7/23 Biostatistics Clinic, they suggested splitting this analysis into two variables:
- i. 3 categories- no DR vs. Non-proliferative DR (NPDR) vs. Proliferative DR (PDR)
- ii. 2 categories- no DME vs. DME

- Two of our 4 groups (NPDR&DME vs. PDR only) cannot ordered as initially suggested (see attached figure). I recognize your interest in reducing the use of binary outcomes, but in this case, feel it cannot be done as Diabetic Macular Edema (DME) and Diabetic Retinopathy (DR) represent two, albeit related, disease processes. I’d love to hear your thoughts. At the 7/23 Biostatistics Clinic, they suggested splitting this analysis into two variables:

- 2. “Get as much info from icd 9 and 10 as can (mild, mod, severe if coded as such), Then will have manual curation (RH brought up challenge of subjectivity if 5 doing this, DAPC thinking ideally one person can be the master curator). The sd creates binary decisions of icd9 in its design(ie if captures a group with fair sensitivity and specificity, will assume that all belong in that group, but will not specify the sen'y and spec'y of that classification, as much as we can give a a score to our assumptions of phenotype, then it would be easier to rank) In analysis should have these 0.9, 0.85, 0.4 (instead record how certain you feel the dx is instead of says cohort 1, 2,3 4 is what they belong to. SHOULD ASSIGN A VALUE OF CERTAINTY. Most importantly DIVIDE Into finest gradation that can get from the data, Don’t call it a phenotype but set of conditions, If in data can break these 4 to 11 then break them up For analysis purpose and focus the power to detect snp associations. CAUTION: Break each component into more than + or – in a cohort. If breaks down within cohort but not between cohorts this method could backfire…good that we are upcoding. Some studies what they do is get description of each case and put in index card, then have someone to rank them 1-250 then analyse subjective rankings as outcomes; come out with as fine grading as can to increase power while still feasible for student s to extract. If use those rules from available data, maybe can end up being 6 vs 4 levels”
- Can we review these processes in greater detail? I was not present for this first Biostatistics Clinic session, and so, have trouble imagining the manual curation, the interpretations of ICD codes, and the 1-250 subjective rankings as outcomes. We have not been collecting the ICD codes in the figure above as of yet, as we have noticed that coding is sometimes inconsistent and incomplete. Thus, we have been manually extracting clinical histories through chart review in the SD. However, if there is a useful application of the ICD codes, they can be easily extracted.
- With the 15:1 ratio of patients to parameters/coefficients, whereby parameters = categories – 1, would ranking 250 patients into 11, 6, etc. categories only work to decrease our power and increase the necessary sample size? Determining how many patients we need to extract data from will be pivotal in structuring our workflows, and potentially, needs to pay for more genotyping of more patients.

- 15:1 rule - 1 predictor parameter for every 15 occurrences of the outcome you have in the model.
- Case-control study makes more sense here due to budget concerns
- Think about how covariates come in to play when sampling from SD
- Check if department has collaboration plan with biostats
- Would probably look at most 30-40% of data as controls - don't want a lot of "ties" in the data (records with the same value for a variable)
- How to rank 1-250? Good for proof of concept things but hard to reproduce - you ask the experts to break the ties.
- If you have a bunch of categories with clinical consesus of order, use that as it will give you more power.
- Possibly rank SNPs by how they seperate across the cateogries
- Do not do power calculation for testing hypothesis, do sample size to estimate something (ex. correlation coefficient, rank)
- www.hbiostat.org/bbr - chapter on challenges on data/power calculations
- Check in to applying for VICTR Award for biostatistics support (90 hours): application website (https://starbrite.app.vumc.org/) and research proposal template (https://starbrite.app.vumc.org/funding/templatesforms/).

- Q: How to further characterize e-coli isolates - is there a statistical way I can put these isolates in target phenotype group? Is there a particular patient characteristic that these emerged from (ie. can phenotype be predicted)?
- Note that these type data will have a lot of noise, which will hamper our analysis--must keep this in mind as we proceed and interpret. Per visit analysis--patients may be represented multiple times, but we can't tell the ID of individual patients (limitation of data).
- More exploratory. Trying to find connections of isolate and clinical data. Need clear direction for this study--define question, future direction.

- Specificity/Sensitivity is for retrospective analyses (not prospectively actionable)
- Don't think about how often something triggers, think when can it detect something and what probability can it assign to the child.
- Regression trees are only useful if you have nothing but categorical variables and a very large sample size (trees are very unreliable and not reproducible).

- Create point system for the characteristics - certain characteristics should be given more weight.
- Frame it as prospective and think about probabilities for individual children.
- Rethink project (from prospective view) and come back to clinic.
- Chris Slaughter works with Neonatology, Pediatrics also has biostat support.

- Retrospective, large database, want to know how oxygen exposure in OR effects downstream outcomes, analysis plan is mainly logistic regression
- Instead of logistic regression - do time to event analysis, survival analysis, competing risk?
- Competing risk models are easy to do but hard to interpret. Used for terminating events. Was not meant to be used when something interrupting the event is not fatal. Can only use 1 event. Clean interpretation can only happen if you count the event or any event worse than that as the event (possibly use this for sensitivity analysis).

- Since you don't have timing information, if you ignore LOS and say we want a model for what happens during hospitalization - could do ordinal model that has a place for all those individual events. Need clinical consensus on order of events.

Clinic Notes:

- Q: How best to evaluate outcome, how to treat age (cont v categorical)? Treat toxicity as ordinal outcome utilizing all organ-specific and use proportional odds model. Would model age as continuous with cubic spline at least 5 knots (can always add more at specific ages if need be, choose knots at ages that are clinically significant).
- Q: What would be role of cubic splines, help with understanding them? Do not try to interpret coefficients, interpret the function/picture itself. Splines are effective because the relationship in the underlying biology is probably smooth (increases over time). They allow you to show the true, smooth underlying relationship between the 2 variables.

- Looks in to if organ specific toxicities are likely to impact a certain age
- If outcome age is ordinal - proportional odds model, if continuous - ordinary regression

Clinic Notes:

- Disproportionality analysis. Make sure to get negative controls.
- Possible violation of missing at random assumption. Imputation works when you have variables that predict missingness or timing variable when it is missing. If you do not have surrogate for the variable(s) to predict missingness, then imputation will not be well-informed/very effective.
- Can chat with pharmacoepidemiology group (Wayne Ray, Maria Griffin, Bill Cooper)

A: Could use log rank or cox model with no censoring (would be slightly better). Kruskal-Wallis is second best non-parametric test (test overall significance). Pairwise comparison - Wilcoxon Test (compares 2 groups).

- Discussed what could be accomplished with a sample this size, and how to frame expectations. https://hbiostat.org/doc/bbr.pdf section 5.12.1, 5.12.2, 8.5.2

- Outcome (type of injury) is a check all that apply option. Suggest logistic regression or multinomial logistic regression.
- Plan to apply for VICTR voucher, this will fit into a voucher framework. Deadline ~Early September (or as soon as possible).

- Feedback: Advise to not use p<0.05. Will be limited by sample size for splines and interactions and clustering. Rather than AUC, suggest liklihood ratio from the logistic regression model.
- Re: ViCTR voucher. Biostat support should come from NIH grant, VICTR may not fund. Could reach out to VICTR--ask about specifics of this case, with the specific nuances.

- Project judged appropriate for a VICTR grant for biostatistical support.

- Q: What is something to describe spaghetti plot to make it easier to interpret? A: challenge is growth curve ends at random point in time, could use generalized least squares (look at F. Harrell's course notes), if death is common can also have it as an ordinal longitudinal analysis where death is worst outcome (surface area).
- Q: How to check if model is stable? A: Look at generalized R-square (most powerful way to do it) but not easy to interpret. Could also use c-index, pseudo r-square, model chi-square.
- Q: How to approach convergence issue? A: This is more of a data issue, once you get below certain number of events the ability to do an analysis is severely restricted.
- Q: Is there a way to evaluate how much of the variability in survival is explained by covariates? A: The assumptions made in cox model are not testable due to current sample size (# of events). Just have to make them.
- Q: Can c-index be used to compare models? A: Not a sensitive index, it will miss real differences. Rank measure is not meant to be compared, good to look at for individual model.
- Helpful Link: http://fharrell.com/post/addvalue

- Show distribution of score between the 4 groups using histograms, if clear differences in groups they will appear in visual displays
- Non-parametric hypothesis test - Kruskall-wallis test
- Another option is ordinal (logistic) regression - score as outcome, group as covariate
- Due to amount of scores (0-36) it is possible to treat it as continuous

- Without having sources of variability (age, height, etc), cannot take current sample size very far.
- Suggest using simple regression model predicting log of pancreas volume from log of height, log of weight, age and anything else that is structural. Want to know what is relative contribution of those items on volume.
- With items that you normalize by taking ratios it is best to use logs. Will get confidence interval for difference in logs and when you anti log result you get the full change margin of error.
- Possibly state question as: how much bigger is the pancreas volume in one group than the other. Moving toward differences makes it more of an estimate problem not hypothesis. Asking matter of degree (precision) generalizes the hypothesis test approach.
- Comparing before and after treatment so patient acts as own control improves power.
- Use diabetes as continuous scale instead of binary (no,yes).
- Sample size notes section 5.9.2 - formula for pooled variance in section 5.9.1 (http://hbiostat.org/doc/bbr.pdf)

- Statistical models: Ordinal regression model, cumulative probability with logit link.
- Outcome: Creating a hierarchical system (ordinal scale) for grading how much medical care is needed reflecting severity of patients conditions would give more signal in data. Do not want to sensor on death. Want to count death as worse than all other things. Death should be included in outcome otherwise interpretation is difficult. Need outcome that does not get interrupted by other outcomes (ex. infection could get interrupted by death). Example outcome: count number of days in hospital, rank death higher than those numbers. Talk to clinic experts to provide insight on ranking scales.
- Predictors: Recommend using health literacy as continuous variable and putting it in regression model as 2 degree polynomial. Include interaction of health literacy and rural/urban.
- VICTR Award for biostatistics support (90 hours):
- Application website (https://starbrite.app.vumc.org/)
- Research proposal template (https://starbrite.app.vumc.org/funding/templatesforms/)

- MDC95 - typically useful for doing power calculation to say what difference you would like to detect, not something used in decision making (raw measurements are best)
- Wanting better understanding of how U-statistics (pairwise difference) relate to MDC95 and ICC calculation - Likely a scaling issue, factor of 2 is built in to MDC95 not in U-statistic
- Suggestions/Notes:
- Reference the way U-statistic is calculated is more akin to SD unit than a 95% confidence interval, MDC is aimed at 95% CI
- Responsiveness index might have too many assumptions

- Reporting Options:
- Remove 1.96 from MDC95 and redo calculation (add another column)
- Only report what ICC is, what SD is, and the different correlations

- Preparing to submit IRB for ecological study and would like help on how to do analysis for project.
- Looking at 1) number of deliveries and 2) number of terminations at Vanderbilt, by year, over last 20 years. Want to see relationship between the two over time. Also planning to look at the breakdown of type of termination procedure over time. Project aims to grow OB volume as a department.

- Suggestions:
- Recommend using quarterly and cumulative counts
- Plot data and look at correlation (number of deliveries by number of terminations)
- Plot curves over time (1 line for delivery, 1 line for terminations)
- Plot curves over time (1 line for each procedure type)
- Reference points for external sources - any significant events (legislation, clinic closing, etc.) notate at time on plots

- Community water system survey - over 3 year period (2017-2019)
- Most water systems don't incur violations
- Chose to use 3 year period to expand possibility of violation occurring - could occur based on staffing changes, etc.
- Dependent: 1. Total violations (combination of MR & HB) 2. MR violations 3. HB Violations (all 0/1)
- Concern: long term period only looking at first violation in the period
- Recommendations:
- 1. Use ordinal logistic regression to account for repeats (count of violations) or have smaller time intervals (monthly, quarterly) - repeats will be captured in multiple periods
- 2. Change "Total Violations" to "Violation of any type"
- 3. Anti-log results
- 4. Highlight confidence intervals of OR instead of p-values
- 5. Possibly look at how long in between each violation (inter-violation time)

- Most interested in NIS (National Inpatient Sample); sample is a probability sample with unequal probabilities
- Outcomes analysis is secondary for now, primary interest is in utilization and access
- Population receiving procedures tends to skew towards older
- Can match diagnoses to procedures for those diagnoses
- Ambulatory surgery is an a separate database SASD; Sharon has used these state-specific and has data from 4 states
- Sharon had 1998-2008 NIS data; Tom reported that Neurosurgery (Peter Moroni) has data until 2016; also check Russell Rothman/Amy Graves; also David Penson
- New users at VUMC can sign a document to get access to data already at VUMC
- Data years needed depends on how rare procedure is (possibly a couple years needed)
- Most recent (2012-) NIS does not have hospital identifier; would know state and region (4 regions)
- 2015 is when ICD9 switched to ICD10 codes
- AHRQ advises against using NIS for state-specific statistics and recommends using SID; NIS can still make general understanding of trends (by region)
- Ideally want database of surgeons with information about where they practice and procedures performed
- What about CMS data? (can only get about 10% sample of data now, have to specify codes upfront)
- These types of projects are typically best utilizing inter-department resources (find out what bigger groups have already done, what has been purchased, see if you can pool resources)
- CMS (research identifiable files - carrier files - 2015-2016) is probably better for granular analysis, NIS for more general information
- For HCUP (NIS data) - can query database "https://hcupnet.ahrq.gov/"

- There is growing interest in developing interventions that promote beiging of white adipose tissue in order to increase energy expenditure and possibly promote weight loss in obese individuals. Moreover, because beige fat dissipates energy as heat, an increase in beige fat could lead to increased utilization of glucose and lipids, and thus could have metabolic benefits such as increased insulin sensitivity.
- We will enroll 100 adult men and women (ages 18-55 years) without major medical problems. We will recruit approximately 50 lean subjects (18.5≤BMI<25 kg/m 2 ) and 50 otherwise healthy obese subjects (BMI≥30 kg/m 2 ).
- Aim 1: Determine whether beige adipose tissue markers differ between obese and lean individuals.
- Aim 2: Determine whether beige adipose tissue markers and energy expenditure are associated with natriuretic peptide receptor expression in adipose tissue and circulating natriuretic peptide levels in humans.

- Project was deemed eligible for a VICTR grant for biostatistical support

Clinic Notes:

- Random effects=subjects. One aim is to determine individual variability for each subject. Model time trajectory using flexible technique. Time to peak is the most important measure. Model time flexibly using e.g. restricted cubic spline, adjusted for random effects, and estimate the average time trajectory. Each subject will have their own intercept.
- Try a variety of within-subject summary statistics e.g. mean, median, 0.1 and 0.9 quantiles, SD or Gini's mean difference, mean absolute successive difference, slope from a linear spline fit per subject with a single knot at zero (infusion time).
- An exploratory analysis use the summary statistics to predict group membership using a binary logistic model. Do a chunk test for overall association, and if "significant", drill down to see what are the apparently biggest predictors of group. This is a univariate way of doing a multivariate model.

- This is a pilot study using MRI images with contrast in patients with MS and healthy controls. Hypothesis is there will be increased permability of the BBB in patients with MS. Outcome measure is mean of hundreds of measures. We could calculate mean and median and SD--see what measure would be best and assess the measure over time to assess the stability of the process.
- Suggest that the a pilot study should confirm feasibility and document obseved variance--not assess a difference.
- Please come back to clinic with changes and for more feedback. Also see: https://www.youtube.com/watch?v=txMBa_Aa7RQ&feature=youtu.be

- Would suggest repeating analysis completed using all variable levels. Suggest using all data, with multiple outcomes per patient (long file), and use measurement time as a covariate. Account for these data with cluster variance sandwich estimator (robcob). Bootstrap estimates for fit. *Come back to future clinic if needed

- *
- Our project is a case series of 653 patients who have been treated for age-related macular degeneration under the step therapy protocol at VEI. We have completed data collection and preliminary analysis. We would like guidance on our analysis to confirm the statistical tests we are running are appropriate. Third clinic, mentor unable to attend.

- I’m submitting a VICTR resource request. Project will include RD custom data pull, and I have that quote ready. I would like to include biostatistics request on same application.

- To determine HRs for % increases and decreases in skeletal muscle mass and density - with survival as the outcome. The assumptions involved in this question are large, and may not hold up if studied in a larger study. Suggest looking at alternate data sources to study change over time. One option using the current data would be to describe the shape of the relationship.

- I would like to address if the demographic variables of age, race, ethnicity, income level, preferred language, and age correlate with acceptance to the Undiagnosed Diseases Network (UDN). I have little to no background in biostatistics so I am hoping to discuss what statistical analysis would be best to perform as well as the methodology of selecting subjects. There are over 300 applications, so I’m wondering if it would be best to include every application in the study or if I should randomly select applicants (and if so, how should I do that?). VICTR voucher/mentor confirmed. Masters project (no voucher support), may return to clinic as often as needed.Best approach for these data may be descriptive. Descibe differences, and describe relationships between vaariables. Graphical means prefered.

- We are looking to assess the rates of infections in persons who inject drugs in Tennessee, compared to the nation as a whole. We are using HCUP data and plan to use TDoH HDDS as well. VICTR Voucher, Mentor confirmed. Study suitable for voucher. Analysis of infection rate in the state of TN.

- K23 with June deadline; will likely return with VICTR voucher request
- technical issues: (1) age normalization of RBANS will be inconsitent with the fact at EEG is not normalized. solution: direct adjustment for age.. make age a part of the model. (2) make magnitudes key quantities of interest; not whether it exists or not. (3) mean absolute difference from one difference to the next, as measure of entropy; or just multiple measures as opposed to one measure; which sombo of summary measures is the most predictive?

- data is longitudinal, so need to consider within-patient correlation: baseline & 6 weeks (+/- 7 days)
- simplest design: within-person change on some marker. everybody gets the drug at baseline and evaluate difference at 6wks
- parallel-group design: baseline should cancel out and look at 6week value of controls vs 6week value of test group, controlling for baseline values.
- will return to clinic with previous study's standard deviation of CD4 to help estimate sample size required.

- myoton lymphedema device comparing device ability to differentiate levels of stiffness etc in arm with lymphedema and without lymphdema
- myoton handheld device - looking at lymphedema patients after breast cancer. skin stiffness/muscle stiffness/creep/frequency; lymphedema in one arm typically; n=11 patients; high noise in the device; possible skew in some area measurements
- recommendations: rather than taking ratios, use the differences of logs of sums of arms and performing nonparametric test; OR, don't transform: use difference of sums of arms and then perform nonparametric test (to help with interpretation)
- pearson correlation: in skin v muscle, beware of within-patient correlation. could average the correlation over each patient.
- suggestion: scatterplot of affected vs nonaffacted for each measurement (x=unaffected , y=affected) --> this will help decision of whether to transform
- other study: ambulation study with fitbit; 2wks prior to surgery and 8 wks after surgery; RQ: when do women return to baseline after surgery? Q1: when do you define returning to baseline? hitting once, or hitting and maintaining? Q2: how many patients?
- ratio may not be best way to look at this data. (it means something different for an inactive person to double their steps than for an active person to double their steps; ratios eliminates that baseline activity level information)
- consider looking at a one-week average of scores (will be more stable), rather than individual days); calculate 7-day moving average and don't consider them returned to baseline until they've returned to their pre-surgery average); there is too much variation in indiviual days
- analysis: for each patient, calculate number of days to return to baseline; then you can use that measure to run analyses; sample size will depend on variability and how long you can follow up patients (ideally to return to baseline or until you're confident they wont return to baseline)

- The project is about whether a particular lab value (red blood cell distribution width, RDW) improves prognostic modeling for patients with glioblastoma multiforme. We have preop and multiple postop values for each patient, as well as the most important clinical variables predicting survival in this population. The main study question will involve a chunk test to determine if this lab contributes to predictive modeling. We also want to create a spaghetti plot of the lab values over time, and assess the trend in the lab value leading up to time of death.
- Project is eligible for a VICTR grant for biostatistical support
- Estimate of 40-80 hours required.

- RQ: how restrictions on work affect how they use restrooms -> urinary symptoms
- survey: pilot study done over 14 days / good response / recall data
- analyses: individual level and across groups; 5 groups of subjects, within hospital
- will likely return to clinic; considering VICTR voucher

- review of study; external validation

- Research Q: whether change in body surface area (%) between visit 1 &2 is correlated with survival.
- Question for clinic regarding units on hazard ratios.
- Looking at: (1) change in bsa between visits - percentage scale; (2) percent change in bsa between visits - will have more resistance accepting this measure
- Recommended to use log BSA1 and logs BSA2 in the model instead. Put them both in the regression model, rather than using their difference. (The different visits are weighted equally when you combine them into one variable.)

- We are conducting a secondary analysis of data collected regarding adherence with an ICU care bundle (i.e., the ABCDE bundle). In this hypothesis-generating secondary analysis, I’m attempting to use PYMC3 to develop a hierarchical regression model (both binary logistic & ordinal logistic) to determine if physical distances from some pieces of equipment are influential on bundle adherence. I took Chris Fonnesbeck’s class last Spring, but this is my first “real” analysis using pymc3. I was hoping to get feedback from Bayesian folks on whether I have appropriately parameterized my model(s). For the hierarchical component, we have multiple observations per patient and multiple patients within each of 10 units.
- Adherence measured per patient per day, binary yes/no.
- Currently looking at binomial model and hierarchical model.
- More likely to adhere the longer they're in hospital.
- Actual distance per patient not collected, so they're looking at close vs far scale, or possibly mean of longest/shortest distance.
- Distance measured in inches...
- Imputation: try R methods
- Use indicator variable to absorb lack of fit of distance variable and put prior on it; use additively.

- VA project to assess performance of stress test in patient's undergoing stress testing prior to low risk surgery.
- Looking for recommendations on variables to collect. Hoping for n=200 patients. With so few patients, the probabilitie cannot come from own data. A published risk model will need to be found and use the variables included in those models, which has the details of coefficients and scoring rules. Then they will return to discuss analyzing probabilities and showing difference post versus pre test.

Edit | Attach | Print version | History: r2 < r1 | Backlinks | View wiki text | Edit wiki text | More topic actions

Topic revision: r2 - 15 Jan 2021, HeatherPrigmore

Copyright © 2013-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback

Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback