Using the fill.missing() function

The function fill.missing() is automatically called by gendistance(), so the user will often be able to skip calling it directly. However, there are times when the user will need more control over the construction of the dataset with the missing values imputed and missingness indicators added. In those cases, the user may use fill.missing() directly to build their dataset.

The following example is written in R. Explanations are in the commented lines, which are preceded with # signs. The entire example should run when cut and pasted into R.

The example dataset is based on the study Clinical inertia: a common barrier to changing provider prescribing behavior (Roumie, et al. 2007). The dataset has been simulated using the variables, expectations, covariance structure, and missingness patterns of the actual data. The study's data is protected; no real patient or provider data is presented here. The primary study intervention consisted of an electronic alert system notifying providers when they were about to meet with a patient who qualified for hypertension therapy intensification under existing guidelines. The dataset consist of 1,341 patients from 182 providers. There are 2 deidentified ID variables (PatientDeID, ProvDeID), 5 provider characteristics (ProviderMale, Physician, NonPhysicianClinician, Resident, ProviderAge), and 19 patient characteristics (last_sys1_pre, last_dia1_pre, Active_medsPRE, ACE, CCB, BB, Diuretic, OtherDrug, dm_pre, lipids_pre, serum_cr_pre, stage2htnpre, currentSmoker, charlson0, charlson1or2, charlson3plus, PatientMale, PatientBlackorhispanic, PatientAge).

Missingness exists in the following variables:
  • systolic and diastolic blood pressure (if one is missing, the other is also missing)
  • number of active medications pre-study-intervention
  • serum creatinine pre-study-intervention
  • patient smoking status
  • patient race
  • provider age

Important - Update Package

As of September 2012, the nbpMatching is being frequently updated. To ensure you don't encounter a bug that has been resolved, the following code helps with updating the package. Note that you may need to be using the latest version of R, especially if installing from R-Forge.

# remove the old version and install nbpMatching and the essential Hmisc package
install.packages("nbpMatching", repos="")


# Read in the data.
d1 <- read.csv("", na.strings= c("NA"))

# Remove the ID variables; otherwise fill.missing() will use them in the imputation model.
# Remove the reference categories for the medications, Charlson-Deyo scores, and provider categories.
# Note medications (ACE, CCB, BB, Diuretic, OtherDrug) are broken into separate indicators.
# Using indicators is helpful because fill.missing() may treat a coding such as 1=ACE, 2=CCB, etc.
# as a continuous variable depending on the number of levels. However, including indicators for all
# levels may result in a singular covariate matrix (in this case ever patient was on exactly one of
# those medications). Thus, one indicator must be chosen as the referent and dropped for the imputation.
d2 <- subset(d1, select=-c(PatientDeID, ProvDeID, ACE, charlson0, Physician))

# Impute the missing values.
d3 <- fill.missing(d2)


# Put the ID variables back in.
d4 <- cbind(d1$PatientDeID, d1$ProvDeID, d3)
names(d4)[1:2] <- c("PatientDeID", "ProvDeID")

# Recall that if systolic BP is missing, diastolic will be missing too. So only one of the two missingness indicators is needed.
# Identify any duplicate variables by name.

# Drop any duplicate missingness indicators.
d5 <- subset( d4, select=names(d4)[!duplicated(t(d4))] )

# If needed, save the dataset as a csv file.
write.csv( d5, "C:\\DataWithImputedValue.csv", row.names=F )

Topic revision: r3 - 10 Sep 2012, RobertGreevy

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback