What is the `dataMaid` package?

Show

Goals

Show

Examples of errors in data and data cleaning checks

Show

Workflow with `dataMaid`

Show

Summarize/Visualize

Show

makeDataReport {dataMaid}

R Documentation

Produce a data report

Description

Make a data overview report that summarizes the contents of a dataset and flags potential problems. The potential problems are identified by running a set of class-specific validation checks, so that different checks are performed on different variables types. The checking steps can be customized according to user input and/or data type of the inputted variable. The checks are saved to an R markdown file which can rendered into an easy-to-read data report in pdf, html or word formats. This report also includes summaries and visualizations of each variable in the dataset.

Usage

makeDataReport(data, output = NULL, render = TRUE, useVar = NULL,
  ordering = c("asIs", "alphabetical"), onlyProblematic = FALSE,
  labelled_as = c("factor"), mode = c("summarize", "visualize", "check"),
  smartNum = TRUE, preChecks = c("isKey", "isSingular", "isSupported"),
  file = NULL, replace = FALSE, vol = "", standAlone = TRUE,
  twoCol = TRUE, quiet = TRUE, openResult = TRUE,
  summaries = setSummaries(), visuals = setVisuals(),
  checks = setChecks(), listChecks = TRUE, maxProbVals = 10,
  maxDecimals = 2, addSummaryTable = TRUE, codebook = FALSE,
  reportTitle = NULL, treatXasY = NULL, ...)

Arguments

`data`	The dataset to be checked. This dataset should be of class `data.frame`, `tibble` or `matrix`. If it is of classs `matrix`, it will be converted to a `data.frame`.
`output`	Output format. Options are `“pdf”`, `“word”` (.docx) and `“html”`. If `NULL` (the default), the output format depends two sequential checks. First, whether a LaTeX installation is available, in which case `pdf` output is chosen. Secondly, if no LaTeX installation is found, then if the operating system is Windows, `word` output is used. Lastly, if neither of these checks are positive, `html` output is used.
`render`	Should the output file be rendered (defaults to `TRUE`), i.e. should a pdf/word/html document be generated and saved to the disc?
`useVar`	Variables to describe in the report. If `NULL` (the default), all variables in `data` are included. If a vector of variable names is supplied, only the variables in `data` that are also in `useVar` are included in the data report.
`ordering`	Choose the ordering of the variables in the variable presentation. The options are “asIs” (ordering as in the dataset) and “alphabetical” (alphabetical order).
`onlyProblematic`	A logical. If `TRUE`, only the variables flagged as problematic in the check step will be included in the variable list.
`labelled_as`	A string explaining the way to handle labelled vectors. Currently `“factor”` (the default) is the only possibility. This means that labelled variables that appear factor-like (by having a non-`NULL` `labels`-attribute) will be treated as factors, while other labelled variables will be treated as whatever base variable class they inherit from.
`mode`	Vector of tasks to perform among the three categories “summarize”, “visualize” and “check”. The default, `c(“summarize”, “visualize”, “check”)`, implies that all three steps are performed. The steps selected in `mode` will be performed for each variable in `data` and their results are presented in the second part of the outputtet data report. The “summarize” step is responsible for creating the summary table, the “visualize” step is responsible for creating the plot and the “check” step is responsible for performing checks on the variable and printing the results if any problems are found.
`smartNum`	If `TRUE` (the default), numeric and integer variables with less than 5 unique values are treated as factor variables in the checking, visualization and summary steps, and a message notifying the reader of this is printed in the data summary.
`preChecks`	Vector of function names for check functions used in the pre-check stage. The pre-check stage consists of variable checks that should be performed before the summary/visualization/checking step. If any of these checks find problems, the variable will not be summarized nor visualized nor checked.
`file`	The filename of the outputted rmarkdown (.Rmd) file. If set to `NULL` (the default), the filename will be the name of `data` prefixed with “dataMaid_”, if this qualifies as a valid file name (e.g. no special characters allowed). Otherwise, `makeDataReport()` tries to create a valid filename by substituing illegal characters. Note that a valid file is of type .Rmd, hence all filenames should have a “.Rmd”-suffix.
`replace`	If `FALSE` (the default), an error is thrown if one of the files that we are about to be created (.Rmd overview file and possible also a .html, .pdf or .docx file) already exist. If `TRUE`, no checks are performed and files on disc thus might be overwritten.
`vol`	Extra text string or numeric that is appended on the end of the output file name(s). For example, if the dataset is called “myData”, no file argument is supplied and `vol=2`, the output file will be called “dataMaid_myData2.Rmd”
`standAlone`	A logical. If `TRUE`, the document begins with a markdown YAML preamble such that it can be rendered as a stand alone rmarkdown file, e.g. by calling `render`. If `FALSE`, this preamble is removed. Moreover, no matter the input to the `render` argument, the document will now not be rendered, as it has no preamble.
`twoCol`	A logical. Should the results from the summarize and visualize steps be presented in two columns? Defaults to `TRUE`.
`quiet`	A logical. If `TRUE` (the default), only a few messages are printed to the screen as `makeDataReport` runs. If `FALSE`, no messages are suppressed. The third option, `silent`, renders the function completely silent, such that only fatal errors are printed.
`openResult`	A logical. If `TRUE` (the default), the last file produced by `makeDataReport` is automatically opened by the end of the function run. This means that if `render = TRUE`, the rendered pdf, word or html file is opened, while if `render = FALSE`, the .Rmd file is opened.
`summaries`	A list of summaries to use on each supported variable type. We recommend using `setSummaries` for creating this list and refer to the documentation of this function for more details.
`visuals`	A list of visual functions to use on each supported variable type. We recommend using `setVisuals` for creating this list and refer to the documentation of this function for more details.
`checks`	A list of checks to use on each supported variable type. We recommend using `setChecks` for creating this list and refer to the documentation of this function for more details.
`listChecks`	A logical. Controls whether what checks that were used for each possible variable type are summarized in the output. Defaults to `TRUE`.
`maxProbVals`	A positive integer or `Inf`. Maximum number of unique values printed from check-functions. In the case of `Inf`, all problematic values are printed. Defaults to `10`.
`maxDecimals`	A positive integer or `Inf`. Number of decimals used when printing numerical values in the data summary and in problematic values from the data checks. If `Inf`, no rounding is performed.
`addSummaryTable`	A logical. If `TRUE` (the default), a summary table of the variable checks is added between the Data Cleaning Summary and the Variable List. Only one of `addSummaryTable` and `addCodebookTable` can be `TRUE`.
`codebook`	A logical. Defaults to `FALSE`. If `TRUE` then the document is tweaked to better represent a codebook.
`reportTitle`	A text string. If supplied, this will be the printed title of the report. If left unspecified, the title with the name of the supplied dataset.
`treatXasY`	A list that indicates how non-standard variable classes should be treated. This parameter allows you to include variables that are not of class `factor`, `character`, `labelled`, `numeric`, `integer`, `logical` nor `Date` (or a class that inherits from any of these classes). The names of the list are the new classes and the entries are the names of the class, they should be treated as. If `makeDataReport()` should e.g. treat variables of class `raw` as characters and variables of class `complex` as numeric, you should put `treatXasY = list(raw = “character”, complex = “numeric”)`.
`…`	Other arguments that are passed on the to precheck, checking, summary and visualization functions.

Details

For each variable, a set of pre-check functions (controlled by the preChecks argument) are first run and then then a battery of functions are applied depending on the variable class. For each variable type the summarize/visualize/check functions are applied and and the results are written to an R markdown file.

Value

The function does not return anything. Its side effect (the production of a data report) is the reason for running the function.

Examples

data(testData)
data(toyData)

check(toyData)

 ## Not run: 
DF <- data.frame(x = 1:15)
makeDataReport(DF)

## End(Not run)

## Not run: 
data(testData)
makeDataReport(testData)

## End(Not run)

# Overwrite any existing files generated by makeDataReport
## Not run: 
makeDataReport(testData, replace=TRUE)

## End(Not run)

# Change output format to Word/docx:
## Not run: 
makeDataReport(testData, replace=TRUE, output = "word")

## End(Not run)

# Only include problematic variables in the output document
## Not run: 
makeDataReport(testData, replace=TRUE, onlyProblematic=TRUE)

## End(Not run)

# Add user defined check-function to the checks performed on character variables:
# Here we add functionality to search for the string wally (ignoring case)
## Not run: 
wheresWally <- function(v, ...) {
     res <- grepl("wally", v, ignore.case=TRUE)
     problem <- any(res)
     message <- "Wally was found in these data"
     checkResult(list(problem = problem,
                      message = message,
                      problemValues = v[res]))
}

wheresWally <- checkFunction(wheresWally,
                             description = "Search for the string 'wally' ignoring case",
                             classes = c("character")
                             )
# Add the newly defined function to the list of checks used for characters.
makeDataReport(testData, 
      checks = setChecks(character = defaultCharacterChecks(with = "wheresWally")),
      replace=TRUE)

## End(Not run)

#Handle non-supported variable classes using treatXasY: treat raw as character and
#treat complex as numeric. We also add a list variable, but as lists are not 
#handled through treatXasY, this variable will be caught in the preChecks and skipped:
## Not run: 
toyData$rawVar <- as.raw(c(1:14, 1))
toyData$compVar <- c(1:14, 1) + 2i
toyData$listVar <- as.list(c(1:14, 1))
makeDataReport(toyData, replace  = TRUE, 
    treatXasY = list(raw = "character", complex = "numeric"))

## End(Not run)

[Package dataMaid version 1.1.0 Index]

Show

require(dataMaid)
head(bigPresidentData)

##      lastName firstName orderOfPresidency   birthday dateOfDeath  stateOfBirth       party presidencyBeginDate presidencyEndDate assassinationAttempt
## 38       Ford    Gerald                38 1913-07-14  2006-12-26      Nebraska  Republican          1974-08-09        1977-01-20                    1
## 2       Adams      John                 2 1735-10-30  1826-07-04 Massachusetts  Federalist          1797-03-04        1801-03-04                    0
## 31     Hoover   Herbert                31 1874-08-10  1964-10-20          Iowa  Republican          1929-03-04        1933-03-04                    0
## 1  Washington    George                 1 1732-02-22  1799-12-14      Virginia Independent          1789-04-30        1797-03-04                    0
## 13   Fillmore   Millard                13 1800-01-07  1874-03-08      New York        Whig          1850-07-09        1853-03-04                    0
## 42    Clinton   William                42 1946-08-19        <NA>      Arkansas  Democratic          1993-01-20        2001-01-20                    0
##     sex ethnicity presidencyYears ageAtInauguration favoriteNumber
## 38 Male Caucasian               2                61           2+0i
## 2  Male Caucasian               3                61           4+0i
## 31 Male Caucasian               4                54           5+0i
## 1  Male Caucasian               7                57           3+0i
## 13 Male Caucasian               2                50           7+0i
## 42 Male Caucasian               8                46           7+0i

bigPresidentData <- as.data.table(bigPresidentData)

makeDataReport(bigPresidentData)

Default report:

Identifies miscoded missing values
Identifies prefixed and suffixed (the following appear with the prefix or suffix x)
Checks whitespaces
Identifies levels with < 6 observations
Identifies case issues
Identifies misclassified numeric or integer variables
Identifies outliers
Provides a summary table of variable class, number of unique observations, missingness and any problems
“Column-wise” checks

Using `dataMaid` interactively

Check

allCheckFunctions()

Table continues below
name	description
identifyCaseIssues	Identify case issues
identifyLoners	Identify levels with < 6 obs.
identifyMissing	Identify miscoded missing values
identifyNums	Identify misclassified numeric or integer variables
identifyOutliers	Identify outliers
identifyOutliersTBStyle	Identify outliers (Turkish Boxplot style)
identifyWhitespace	Identify prefixed and suffixed whitespace
isCPR	Identify Danish CPR numbers
isEmpty	Check if the variable contains only a single value
isKey	Check if the variable is a key
isSingular	Check if the variable contains only a single value
isSupported	Check if the variable class is supported by dataMaid.

classes
character, factor
character, factor
character, Date, factor, integer, labelled, logical, numeric
character, factor, labelled
Date, integer, numeric
Date, integer, numeric
character, factor, labelled
character, Date, factor, integer, labelled, logical, numeric
character, Date, factor, integer, labelled, logical, numeric
character, Date, factor, integer, labelled, logical, numeric
character, Date, factor, integer, labelled, logical, numeric
character, Date, factor, integer, labelled, logical, numeric

## syntax

## numeric class
check(bigPresidentData$presidencyYears
      , numericChecks = c("identifyMissing","identifyOutliers"))

## $identifyMissing
## The following suspected missing value codes enter as regular values: Inf.
## $identifyOutliers
## Note that the following possible outlier values were detected: 12, Inf.

check(bigPresidentData$presidencyYears
      , checks = setChecks(numeric = c("identifyMissing")))

## $identifyMissing
## The following suspected missing value codes enter as regular values: Inf.

check(bigPresidentData$presidencyYears
      , checks = setChecks())

## $identifyMissing
## The following suspected missing value codes enter as regular values: Inf.
## $identifyOutliers
## Note that the following possible outlier values were detected: 12, Inf.

## factor class
check(bigPresidentData$ethnicity
      , factorChecks = c("identifyCaseIssues","identifyLoners"))

## $identifyCaseIssues
## No problems found.
## $identifyLoners
## Note that the following levels have at most five observations: African American.

check(bigPresidentData$ethnicity
      , checks = setChecks(factor = c("identifyLoners")))

## $identifyLoners
## Note that the following levels have at most five observations: African American.

check(bigPresidentData$ethnicity
      , checks = setChecks())

## $identifyMissing
## No problems found.
## $identifyWhitespace
## No problems found.
## $identifyLoners
## Note that the following levels have at most five observations: African American.
## $identifyCaseIssues
## No problems found.
## $identifyNums
## No problems found.

## both
check(bigPresidentData[,.(ethnicity, presidencyYears)]
      , checks = setChecks(numeric = "identifyOutliers"
                         , factor  = "identifyLoners"))

## $ethnicity
## $ethnicity$identifyLoners
## Note that the following levels have at most five observations: African American.
## 
## $presidencyYears
## $presidencyYears$identifyOutliers
## Note that the following possible outlier values were detected: 12, Inf.

identifyWhitespace(bigPresidentData$firstName)

## No problems found.

identifyWhitespace(bigPresidentData$lastName)

## The following values appear with prefixed or suffixed white space:  Truman.

check(bigPresidentData[,.(firstName, lastName)]
      , checks = setChecks(character = "identifyWhitespace"))

## $firstName
## $firstName$identifyWhitespace
## No problems found.
## 
## $lastName
## $lastName$identifyWhitespace
## The following values appear with prefixed or suffixed white space:  Truman.

name	description	classes
basicVisual	Histograms and barplots using graphics	character, Date, factor, integer, labelled, logical, numeric
standardVisual	Histograms and barplots using ggplot2	character, Date, factor, integer, labelled, logical, numeric

Visualize
Summarize

allSummaryFunctions()

name	description	classes
centralValue	Compute median for numeric variables, mode for categorical variables	character, Date, factor, integer, labelled, logical, numeric
countMissing	Compute proportion of missing observations	character, Date, factor, integer, labelled, logical, numeric
minMax	Find minimum and maximum values	integer, numeric, Date
quartiles	Compute 1st and 3rd quartiles	Date, integer, numeric
uniqueValues	Count number of unique values	character, Date, factor, integer, labelled, logical, numeric
variableType	Data class of variable	character, Date, factor, integer, labelled, logical, numeric

summarize(bigPresidentData$lastName)

## $variableType
## Variable type: character
## $countMissing
## Number of missing obs.: 0 (0 %)
## $uniqueValues
## Number of unique values: 40
## $centralValue
## Mode: "Adams"

# summarize(bigPresidentData)

lapply(bigPresidentData, variableType) %>% head

## $lastName
## Variable type: character
## $firstName
## Variable type: character
## $orderOfPresidency
## Variable type: factor
## $birthday
## Variable type: Date
## $dateOfDeath
## Variable type: Date
## $stateOfBirth
## Variable type: character

Extending `dataMaid`

Show

Customize your own checking, visual, or summary functions!

Templates:

mySummaryFunction <- summary(v, ...){
  val <- [ result of whatever summary you want to do ]
  res <- [ properly escaped version of val ]
  summaryResult(list( feature = "[Feature name]"
                    , result  = res
                    , value   = val
                ))
}

isSNN <- function(v, nMax = NULL, ...){
  out <- list(problem = FALSE
            , message = ""
            , problemValues = NULL)
  if (class(v) %in% c("character","factor","labelled")){
    if(any(grep("\\d{3}-\\d{3}-\\d{4}",v))) {
      out$problem <- TRUE
      out$message <- "Warning: may contain SSNs"
      out$problemValues <- "Will not show"
    }
  }
  out
}

Examples:

Basic example:

refCat <- function(v, ...) {
  out <- list(factor = FALSE
            , reference = ""
            , problemValues = NULL)
  if(class(v) %in% c("factor")) {
    out$factor <- TRUE
    out$reference <- levels(v)[1]
    out$problemValues <- "Not applicable"
  }
  out
}

refCat <- summaryFunction(refCat
                        , description = "Identifies reference level"
                        , classes = c("factor"))

check(bigPresidentData$sex, factorChecks = "refCat")

## $refCat
## $refCat$factor
## [1] TRUE
## 
## $refCat$reference
## [1] "Male"
## 
## $refCat$problemValues
## [1] "Not applicable"

More advanced example:

identifyNonStartCase <- function(v, nMax = 10, ...){
  v <- unique(na.omit(v)) ## omit NA values and keep only unique values
  vSplit <- strsplit(v, split = " ") ## split around blank spaces
  vSplitAllLower <- sapply(vSplit, tolower) ## make all lowercase
  helper <- function(x){ ## helper function to make first letter capital
    capFirstLetters <- toupper(substring(x, 1, 1))
    x <- paste(capFirstLetters, substring(x, 2), sep = "")
    x
  }
  vSplitStartCase <- sapply(vSplit, helper) ## first letter capital version of v
  vStartCase <- sapply(vSplitStartCase, function(x) paste(x, collapse = " "))
  
  ## find where v and vStartCase differ
  problemPlaces <- v != vStartCase
  
  if(any(problemPlaces)){
    problemValues <- v[problemPlaces]
  } else {
    problemValues <- NULL
  }
  
  problem <- any(problemPlaces)
  
  problemStatus <- list(problem = problem
                      , problemValues = problemValues)
  problemMessage <- "The following variables were not in start case:"
  outMessage <- messageGenerator(problemStatus, problemMessage, nMax)
  
  checkResult(list(problem = problem
                 , message = outMessage
                 , problemValues = problemValues))
}

identifyNonStartCase <- checkFunction(identifyNonStartCase
                                    , description = "Identifies entries that are not written in start case"
                                    , classes = c("character", "factor"))

check(bigPresidentData$stateOfBirth, checks = setChecks(character = "identifyNonStartCase"))

## $identifyNonStartCase
## The following variables were not in start case: New york.

allCheckFunctions()

Table continues below
name	description
identifyNonStartCase	Identifies entries that are not written in start case
identifyCaseIssues	Identify case issues
identifyLoners	Identify levels with < 6 obs.
identifyMissing	Identify miscoded missing values
identifyNums	Identify misclassified numeric or integer variables
identifyOutliers	Identify outliers
identifyOutliersTBStyle	Identify outliers (Turkish Boxplot style)
identifyWhitespace	Identify prefixed and suffixed whitespace
isCPR	Identify Danish CPR numbers
isEmpty	Check if the variable contains only a single value
isKey	Check if the variable is a key
isSingular	Check if the variable contains only a single value
isSupported	Check if the variable class is supported by dataMaid.

classes
character, factor
character, factor
character, factor
character, Date, factor, integer, labelled, logical, numeric
character, factor, labelled
Date, integer, numeric
Date, integer, numeric
character, factor, labelled
character, Date, factor, integer, labelled, logical, numeric
character, Date, factor, integer, labelled, logical, numeric
character, Date, factor, integer, labelled, logical, numeric
character, Date, factor, integer, labelled, logical, numeric
character, Date, factor, integer, labelled, logical, numeric

Customizing document

Validate

Show

require(validate)

This mostly serves as a logic check (ex. death date is later than birth date) and row-wise checks.

First, create a validator object:

validator1 <- validator(
  ageAtDeath := floor((dateOfDeath - birthday)/365.25)
  , `Adult president` = ageAtInauguration >= 18
  , `Alive at inauguration` = ageAtDeath >= ageAtInauguration
  , `Positive first name` = firstName*2 > firstName
  , `Death by assassination` = 
             if (dateOfDeath == presidencyEndDate) 
               assassinationAttempt == 1
  , `Begin date` = difftime(presidencyEndDate, as.Date("176-08-04")) > 0
)

confront_messy <- confront(bigPresidentData, validator1)
summary(confront_messy) %>% kable

name	items	passes	fails	nNA	error	warning	expression
Adult.president	47	47	0	0	FALSE	FALSE	ageAtInauguration >= 18
Alive.at.inauguration	47	40	1	6	FALSE	FALSE	floor((dateOfDeath - birthday)/365.25) >= ageAtInauguration
Positive.first.name	0	0	0	0	TRUE	FALSE	firstName * 2 > firstName
Death.by.assassination	47	38	3	6	FALSE	FALSE	!(dateOfDeath == presidencyEndDate) \| (abs(assassinationAttempt - 1) < 1e-08)
Begin.date	47	45	0	2	FALSE	FALSE	difftime(presidencyEndDate, as.Date(“176-08-04”)) > 0

errors(confront_messy)

## $Positive.first.name
## [1] "non-numeric argument to binary operator"

bpd_clean <- readRDS("bigPresidentData_cleaned.rds")
confront_clean <- confront(bpd_clean, validator1)
summary(confront_clean) %>% kable

name	items	passes	fails	nNA	error	warning	expression
Adult.president	45	45	0	0	FALSE	FALSE	(ageAtInauguration - 18) >= -1e-08
Alive.at.inauguration	45	39	0	6	FALSE	FALSE	floor((dateOfDeath - birthday)/365.25) >= ageAtInauguration
Positive.first.name	0	0	0	0	TRUE	FALSE	firstName * 2 > firstName
Death.by.assassination	45	36	3	6	FALSE	FALSE	!(dateOfDeath == presidencyEndDate) \| (abs(assassinationAttempt - 1) < 1e-08)
Begin.date	45	44	0	1	FALSE	FALSE	difftime(presidencyEndDate, as.Date(“176-08-04”)) > 0

orderVal <- validator(rank(presidencyBeginDate) == rank(orderOfPresidency))

orderCon_messy <- confront(
  na.omit(bigPresidentData[,.(presidencyBeginDate, orderOfPresidency)])
  , orderVal
)
summary(orderCon_messy) %>% kable

name	items	passes	fails	nNA	error	warning	expression
V1	46	44	2	0	FALSE	FALSE	rank(presidencyBeginDate) == rank(orderOfPresidency)

orderCon_clean <- confront(
  na.omit(bpd_clean[,.(presidencyBeginDate, orderOfPresidency)])
  , orderVal
)
summary(orderCon_clean) %>% kable

name	items	passes	fails	nNA	error	warning	expression
V1	45	43	2	0	FALSE	FALSE	rank(presidencyBeginDate) == rank(orderOfPresidency)

Overview of confrontation results:

summary(confront_messy)

##                     name items passes fails nNA error warning                                                                    expression
## 1        Adult.president    47     47     0   0 FALSE   FALSE                                                       ageAtInauguration >= 18
## 2  Alive.at.inauguration    47     40     1   6 FALSE   FALSE                   floor((dateOfDeath - birthday)/365.25) >= ageAtInauguration
## 3    Positive.first.name     0      0     0   0  TRUE   FALSE                                                     firstName * 2 > firstName
## 4 Death.by.assassination    47     38     3   6 FALSE   FALSE !(dateOfDeath == presidencyEndDate) | (abs(assassinationAttempt - 1) < 1e-08)
## 5             Begin.date    47     45     0   2 FALSE   FALSE                         difftime(presidencyEndDate, as.Date("176-08-04")) > 0

Compute percentage pass/fail/NA:

aggregate(confront_messy)

##                        npass nfail nNA  rel.pass   rel.fail     rel.NA
## Adult.president           47     0   0 1.0000000 0.00000000 0.00000000
## Alive.at.inauguration     40     1   6 0.8510638 0.02127660 0.12765957
## Death.by.assassination    38     3   6 0.8085106 0.06382979 0.12765957
## Begin.date                45     0   2 0.9574468 0.00000000 0.04255319

Sort results by problem prevalence:

sort(confront_messy)

##                        npass nfail nNA  rel.pass   rel.fail     rel.NA
## Death.by.assassination    38     3   6 0.8085106 0.06382979 0.12765957
## Alive.at.inauguration     40     1   6 0.8510638 0.02127660 0.12765957
## Begin.date                45     0   2 0.9574468 0.00000000 0.04255319
## Adult.president           47     0   0 1.0000000 0.00000000 0.00000000

For each observation and each check: TRUE/FALSE/NA

values(confront_messy)

##       Adult.president Alive.at.inauguration Death.by.assassination Begin.date
##  [1,]            TRUE                  TRUE                   TRUE       TRUE
##  [2,]            TRUE                  TRUE                   TRUE       TRUE
##  [3,]            TRUE                  TRUE                   TRUE       TRUE
##  [4,]            TRUE                  TRUE                   TRUE       TRUE
##  [5,]            TRUE                  TRUE                   TRUE       TRUE
##  [6,]            TRUE                    NA                     NA       TRUE
##  [7,]            TRUE                  TRUE                   TRUE       TRUE
##  [8,]            TRUE                  TRUE                   TRUE       TRUE
##  [9,]            TRUE                  TRUE                   TRUE       TRUE
## [10,]            TRUE                  TRUE                   TRUE       TRUE
## [11,]            TRUE                  TRUE                   TRUE       TRUE
## [12,]            TRUE                  TRUE                   TRUE       TRUE
## [13,]            TRUE                  TRUE                   TRUE       TRUE
## [14,]            TRUE                  TRUE                   TRUE       TRUE
## [15,]            TRUE                  TRUE                   TRUE       TRUE
## [16,]            TRUE                  TRUE                   TRUE       TRUE
## [17,]            TRUE                    NA                     NA       TRUE
## [18,]            TRUE                    NA                     NA       TRUE
## [19,]            TRUE                  TRUE                  FALSE       TRUE
## [20,]            TRUE                  TRUE                   TRUE       TRUE
## [21,]            TRUE                  TRUE                  FALSE       TRUE
## [22,]            TRUE                  TRUE                   TRUE       TRUE
## [23,]            TRUE                  TRUE                   TRUE       TRUE
## [24,]            TRUE                  TRUE                   TRUE       TRUE
## [25,]            TRUE                  TRUE                   TRUE       TRUE
## [26,]            TRUE                  TRUE                   TRUE       TRUE
## [27,]            TRUE                  TRUE                   TRUE       TRUE
## [28,]            TRUE                  TRUE                   TRUE       TRUE
## [29,]            TRUE                  TRUE                   TRUE       TRUE
## [30,]            TRUE                    NA                     NA         NA
## [31,]            TRUE                 FALSE                   TRUE         NA
## [32,]            TRUE                  TRUE                   TRUE       TRUE
## [33,]            TRUE                  TRUE                   TRUE       TRUE
## [34,]            TRUE                  TRUE                   TRUE       TRUE
## [35,]            TRUE                  TRUE                   TRUE       TRUE
## [36,]            TRUE                  TRUE                   TRUE       TRUE
## [37,]            TRUE                  TRUE                   TRUE       TRUE
## [38,]            TRUE                  TRUE                  FALSE       TRUE
## [39,]            TRUE                  TRUE                   TRUE       TRUE
## [40,]            TRUE                  TRUE                   TRUE       TRUE
## [41,]            TRUE                  TRUE                   TRUE       TRUE
## [42,]            TRUE                    NA                     NA       TRUE
## [43,]            TRUE                  TRUE                   TRUE       TRUE
## [44,]            TRUE                  TRUE                   TRUE       TRUE
## [45,]            TRUE                  TRUE                   TRUE       TRUE
## [46,]            TRUE                  TRUE                   TRUE       TRUE
## [47,]            TRUE                    NA                     NA       TRUE

Visual overview of check results:

barplot(confront_messy)

What errors were there:

errors(confront_messy)

## $Positive.first.name
## [1] "non-numeric argument to binary operator"

What warnings were there:

warnings(confront_messy)

## named list()

Make Codebook

Show

Pros and Cons

Show

Resources

Ekstrom, C.T., Peterson, A.H. (2018, February). Cleaning Up the Data Cleaning Process. Short course at the Conference on Statistical Practice, Portland, OR.

Ekstrom, C.T., Peterson, A.H. (2018, February). Cleaning Up the Data Cleaning Process. Short course at the Conference on Statistical Practice, Portland, OR [Course slides] .

What is the dataMaid package?

Goals

Examples of errors in data and data cleaning checks

Workflow with dataMaid

Summarize/Visualize

Produce a data report

Description

Usage

Arguments

Details

Value

Examples

Using dataMaid interactively

Extending dataMaid

Customizing document

Validate

Make Codebook

Pros and Cons

Pros

Cons

Resources

What is the `dataMaid` package?

Workflow with `dataMaid`

Using `dataMaid` interactively

Extending `dataMaid`