What is the dataMaid
package?
Show Short answer :
- A data cleaning assistant that is able to provide a document to be read and evaluated by a human.
- A tool to aid in logic/error checks both column and row-wise
Long answer : read documentation
Goals
Show- Data screening/cleaning with the
dataMaid
package - Extend data cleaning checks with
dataMaid
- Use
validate
to inspect row-wise errors
Examples of errors in data and data cleaning checks
Show- Incorrect class
- Duplicates
- Capitalization consistency (new York vs New York)
- Unlikely value (BMI = 0.1, age = 201)
- White spaces
- Unrecognized missingness indicators
- Amount of missingness
- Unique observations / categories with low count
- Inaccurate data (death date before birth date)
Workflow with dataMaid
Show Summarize: What information does the variable have?
Visualize: What does the distribution of the variable look like?
Check: What potential problems are there with the variable?
Fix: Fix problems.
Validate: Did we actually fix the problem and/or check row-wise errors.
Summarize/Visualize
ShowmakeDataReport {dataMaid} | R Documentation |
Produce a data report
Description
Make a data overview report that summarizes the contents of a dataset and flags potential problems. The potential problems are identified by running a set of class-specific validation checks, so that different checks are performed on different variables types. The checking steps can be customized according to user input and/or data type of the inputted variable. The checks are saved to an R markdown file which can rendered into an easy-to-read data report in pdf, html or word formats. This report also includes summaries and visualizations of each variable in the dataset.
Usage
makeDataReport(data, output = NULL, render = TRUE, useVar = NULL, ordering = c("asIs", "alphabetical"), onlyProblematic = FALSE, labelled_as = c("factor"), mode = c("summarize", "visualize", "check"), smartNum = TRUE, preChecks = c("isKey", "isSingular", "isSupported"), file = NULL, replace = FALSE, vol = "", standAlone = TRUE, twoCol = TRUE, quiet = TRUE, openResult = TRUE, summaries = setSummaries(), visuals = setVisuals(), checks = setChecks(), listChecks = TRUE, maxProbVals = 10, maxDecimals = 2, addSummaryTable = TRUE, codebook = FALSE, reportTitle = NULL, treatXasY = NULL, ...)
Arguments
data
|
The dataset to be checked. This dataset should be of class |
output
|
Output format. Options are |
render
|
Should the output file be rendered (defaults to |
useVar
|
Variables to describe in the report. If |
ordering
|
Choose the ordering of the variables in the variable presentation. The options are “asIs” (ordering as in the dataset) and “alphabetical” (alphabetical order). |
onlyProblematic
|
A logical. If |
labelled_as
|
A string explaining the way to handle labelled vectors. Currently |
mode
|
Vector of tasks to perform among the three categories “summarize”, “visualize” and “check”. The default, |
smartNum
|
If |
preChecks
|
Vector of function names for check functions used in the pre-check stage. The pre-check stage consists of variable checks that should be performed before the summary/visualization/checking step. If any of these checks find problems, the variable will not be summarized nor visualized nor checked. |
file
|
The filename of the outputted rmarkdown (.Rmd) file. If set to |
replace
|
If |
vol
|
Extra text string or numeric that is appended on the end of the output file name(s). For example, if the dataset is called “myData”, no file argument is supplied and |
standAlone
|
A logical. If |
twoCol
|
A logical. Should the results from the summarize and visualize steps be presented in two columns? Defaults to |
quiet
|
A logical. If |
openResult
|
A logical. If |
summaries
|
A list of summaries to use on each supported variable type. We recommend using |
visuals
|
A list of visual functions to use on each supported variable type. We recommend using |
checks
|
A list of checks to use on each supported variable type. We recommend using |
listChecks
|
A logical. Controls whether what checks that were used for each possible variable type are summarized in the output. Defaults to |
maxProbVals
|
A positive integer or |
maxDecimals
|
A positive integer or |
addSummaryTable
|
A logical. If |
codebook
|
A logical. Defaults to |
reportTitle
|
A text string. If supplied, this will be the printed title of the report. If left unspecified, the title with the name of the supplied dataset. |
treatXasY
|
A list that indicates how non-standard variable classes should be treated. This parameter allows you to include variables that are not of class |
…
|
Other arguments that are passed on the to precheck, checking, summary and visualization functions. |
Details
For each variable, a set of pre-check functions (controlled by the preChecks
argument) are first run and then then a battery of functions are applied depending on the variable class. For each variable type the summarize/visualize/check functions are applied and and the results are written to an R markdown file.
Value
The function does not return anything. Its side effect (the production of a data report) is the reason for running the function.
Examples
data(testData) data(toyData) check(toyData) ## Not run: DF <- data.frame(x = 1:15) makeDataReport(DF) ## End(Not run) ## Not run: data(testData) makeDataReport(testData) ## End(Not run) # Overwrite any existing files generated by makeDataReport ## Not run: makeDataReport(testData, replace=TRUE) ## End(Not run) # Change output format to Word/docx: ## Not run: makeDataReport(testData, replace=TRUE, output = "word") ## End(Not run) # Only include problematic variables in the output document ## Not run: makeDataReport(testData, replace=TRUE, onlyProblematic=TRUE) ## End(Not run) # Add user defined check-function to the checks performed on character variables: # Here we add functionality to search for the string wally (ignoring case) ## Not run: wheresWally <- function(v, ...) { res <- grepl("wally", v, ignore.case=TRUE) problem <- any(res) message <- "Wally was found in these data" checkResult(list(problem = problem, message = message, problemValues = v[res])) } wheresWally <- checkFunction(wheresWally, description = "Search for the string 'wally' ignoring case", classes = c("character") ) # Add the newly defined function to the list of checks used for characters. makeDataReport(testData, checks = setChecks(character = defaultCharacterChecks(with = "wheresWally")), replace=TRUE) ## End(Not run) #Handle non-supported variable classes using treatXasY: treat raw as character and #treat complex as numeric. We also add a list variable, but as lists are not #handled through treatXasY, this variable will be caught in the preChecks and skipped: ## Not run: toyData$rawVar <- as.raw(c(1:14, 1)) toyData$compVar <- c(1:14, 1) + 2i toyData$listVar <- as.list(c(1:14, 1)) makeDataReport(toyData, replace = TRUE, treatXasY = list(raw = "character", complex = "numeric")) ## End(Not run)
require(dataMaid)
head(bigPresidentData)
## lastName firstName orderOfPresidency birthday dateOfDeath stateOfBirth party presidencyBeginDate presidencyEndDate assassinationAttempt
## 38 Ford Gerald 38 1913-07-14 2006-12-26 Nebraska Republican 1974-08-09 1977-01-20 1
## 2 Adams John 2 1735-10-30 1826-07-04 Massachusetts Federalist 1797-03-04 1801-03-04 0
## 31 Hoover Herbert 31 1874-08-10 1964-10-20 Iowa Republican 1929-03-04 1933-03-04 0
## 1 Washington George 1 1732-02-22 1799-12-14 Virginia Independent 1789-04-30 1797-03-04 0
## 13 Fillmore Millard 13 1800-01-07 1874-03-08 New York Whig 1850-07-09 1853-03-04 0
## 42 Clinton William 42 1946-08-19 <NA> Arkansas Democratic 1993-01-20 2001-01-20 0
## sex ethnicity presidencyYears ageAtInauguration favoriteNumber
## 38 Male Caucasian 2 61 2+0i
## 2 Male Caucasian 3 61 4+0i
## 31 Male Caucasian 4 54 5+0i
## 1 Male Caucasian 7 57 3+0i
## 13 Male Caucasian 2 50 7+0i
## 42 Male Caucasian 8 46 7+0i
bigPresidentData <- as.data.table(bigPresidentData)
makeDataReport(bigPresidentData)
Default report:
- Identifies miscoded missing values
- Identifies prefixed and suffixed (the following appear with the prefix or suffix x)
- Checks whitespaces
- Identifies levels with < 6 observations
- Identifies case issues
- Identifies misclassified numeric or integer variables
- Identifies outliers
- Provides a summary table of variable class, number of unique observations, missingness and any problems
- “Column-wise” checks
Using dataMaid
interactively
Check allCheckFunctions()
name | description |
---|---|
identifyCaseIssues | Identify case issues |
identifyLoners | Identify levels with < 6 obs. |
identifyMissing | Identify miscoded missing values |
identifyNums | Identify misclassified numeric or integer variables |
identifyOutliers | Identify outliers |
identifyOutliersTBStyle | Identify outliers (Turkish Boxplot style) |
identifyWhitespace | Identify prefixed and suffixed whitespace |
isCPR | Identify Danish CPR numbers |
isEmpty | Check if the variable contains only a single value |
isKey | Check if the variable is a key |
isSingular | Check if the variable contains only a single value |
isSupported | Check if the variable class is supported by dataMaid. |
classes |
---|
character, factor |
character, factor |
character, Date, factor, integer, labelled, logical, numeric |
character, factor, labelled |
Date, integer, numeric |
Date, integer, numeric |
character, factor, labelled |
character, Date, factor, integer, labelled, logical, numeric |
character, Date, factor, integer, labelled, logical, numeric |
character, Date, factor, integer, labelled, logical, numeric |
character, Date, factor, integer, labelled, logical, numeric |
character, Date, factor, integer, labelled, logical, numeric |
## syntax
## numeric class
check(bigPresidentData$presidencyYears
, numericChecks = c("identifyMissing","identifyOutliers"))
## $identifyMissing
## The following suspected missing value codes enter as regular values: Inf.
## $identifyOutliers
## Note that the following possible outlier values were detected: 12, Inf.
check(bigPresidentData$presidencyYears
, checks = setChecks(numeric = c("identifyMissing")))
## $identifyMissing
## The following suspected missing value codes enter as regular values: Inf.
check(bigPresidentData$presidencyYears
, checks = setChecks())
## $identifyMissing
## The following suspected missing value codes enter as regular values: Inf.
## $identifyOutliers
## Note that the following possible outlier values were detected: 12, Inf.
## factor class
check(bigPresidentData$ethnicity
, factorChecks = c("identifyCaseIssues","identifyLoners"))
## $identifyCaseIssues
## No problems found.
## $identifyLoners
## Note that the following levels have at most five observations: African American.
check(bigPresidentData$ethnicity
, checks = setChecks(factor = c("identifyLoners")))
## $identifyLoners
## Note that the following levels have at most five observations: African American.
check(bigPresidentData$ethnicity
, checks = setChecks())
## $identifyMissing
## No problems found.
## $identifyWhitespace
## No problems found.
## $identifyLoners
## Note that the following levels have at most five observations: African American.
## $identifyCaseIssues
## No problems found.
## $identifyNums
## No problems found.
## both
check(bigPresidentData[,.(ethnicity, presidencyYears)]
, checks = setChecks(numeric = "identifyOutliers"
, factor = "identifyLoners"))
## $ethnicity
## $ethnicity$identifyLoners
## Note that the following levels have at most five observations: African American.
##
## $presidencyYears
## $presidencyYears$identifyOutliers
## Note that the following possible outlier values were detected: 12, Inf.
identifyWhitespace(bigPresidentData$firstName)
## No problems found.
identifyWhitespace(bigPresidentData$lastName)
## The following values appear with prefixed or suffixed white space: Truman.
check(bigPresidentData[,.(firstName, lastName)]
, checks = setChecks(character = "identifyWhitespace"))
## $firstName
## $firstName$identifyWhitespace
## No problems found.
##
## $lastName
## $lastName$identifyWhitespace
## The following values appear with prefixed or suffixed white space: Truman.
allVisualFunctions()
name | description | classes |
---|---|---|
basicVisual | Histograms and barplots using graphics | character, Date, factor, integer, labelled, logical, numeric |
standardVisual | Histograms and barplots using ggplot2 | character, Date, factor, integer, labelled, logical, numeric |
visualize(bigPresidentData$ageAtInauguration, vnam = "Age at inauguration")
basicVisual(bigPresidentData$party, vnam = "Party")
standardVisual(bigPresidentData$party, vnam = "Party") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
allSummaryFunctions()
name | description | classes |
---|---|---|
centralValue | Compute median for numeric variables, mode for categorical variables | character, Date, factor, integer, labelled, logical, numeric |
countMissing | Compute proportion of missing observations | character, Date, factor, integer, labelled, logical, numeric |
minMax | Find minimum and maximum values | integer, numeric, Date |
quartiles | Compute 1st and 3rd quartiles | Date, integer, numeric |
uniqueValues | Count number of unique values | character, Date, factor, integer, labelled, logical, numeric |
variableType | Data class of variable | character, Date, factor, integer, labelled, logical, numeric |
summarize(bigPresidentData$lastName)
## $variableType
## Variable type: character
## $countMissing
## Number of missing obs.: 0 (0 %)
## $uniqueValues
## Number of unique values: 40
## $centralValue
## Mode: "Adams"
# summarize(bigPresidentData)
lapply(bigPresidentData, variableType) %>% head
## $lastName
## Variable type: character
## $firstName
## Variable type: character
## $orderOfPresidency
## Variable type: factor
## $birthday
## Variable type: Date
## $dateOfDeath
## Variable type: Date
## $stateOfBirth
## Variable type: character
Extending dataMaid
Show Customize your own checking, visual, or summary functions!
Templates:
mySummaryFunction <- summary(v, ...){
val <- [ result of whatever summary you want to do ]
res <- [ properly escaped version of val ]
summaryResult(list( feature = "[Feature name]"
, result = res
, value = val
))
}
isSNN <- function(v, nMax = NULL, ...){
out <- list(problem = FALSE
, message = ""
, problemValues = NULL)
if (class(v) %in% c("character","factor","labelled")){
if(any(grep("\\d{3}-\\d{3}-\\d{4}",v))) {
out$problem <- TRUE
out$message <- "Warning: may contain SSNs"
out$problemValues <- "Will not show"
}
}
out
}
Examples:
Basic example:
refCat <- function(v, ...) {
out <- list(factor = FALSE
, reference = ""
, problemValues = NULL)
if(class(v) %in% c("factor")) {
out$factor <- TRUE
out$reference <- levels(v)[1]
out$problemValues <- "Not applicable"
}
out
}
refCat <- summaryFunction(refCat
, description = "Identifies reference level"
, classes = c("factor"))
check(bigPresidentData$sex, factorChecks = "refCat")
## $refCat
## $refCat$factor
## [1] TRUE
##
## $refCat$reference
## [1] "Male"
##
## $refCat$problemValues
## [1] "Not applicable"
More advanced example:
identifyNonStartCase <- function(v, nMax = 10, ...){
v <- unique(na.omit(v)) ## omit NA values and keep only unique values
vSplit <- strsplit(v, split = " ") ## split around blank spaces
vSplitAllLower <- sapply(vSplit, tolower) ## make all lowercase
helper <- function(x){ ## helper function to make first letter capital
capFirstLetters <- toupper(substring(x, 1, 1))
x <- paste(capFirstLetters, substring(x, 2), sep = "")
x
}
vSplitStartCase <- sapply(vSplit, helper) ## first letter capital version of v
vStartCase <- sapply(vSplitStartCase, function(x) paste(x, collapse = " "))
## find where v and vStartCase differ
problemPlaces <- v != vStartCase
if(any(problemPlaces)){
problemValues <- v[problemPlaces]
} else {
problemValues <- NULL
}
problem <- any(problemPlaces)
problemStatus <- list(problem = problem
, problemValues = problemValues)
problemMessage <- "The following variables were not in start case:"
outMessage <- messageGenerator(problemStatus, problemMessage, nMax)
checkResult(list(problem = problem
, message = outMessage
, problemValues = problemValues))
}
identifyNonStartCase <- checkFunction(identifyNonStartCase
, description = "Identifies entries that are not written in start case"
, classes = c("character", "factor"))
check(bigPresidentData$stateOfBirth, checks = setChecks(character = "identifyNonStartCase"))
## $identifyNonStartCase
## The following variables were not in start case: New york.
allCheckFunctions()
name | description |
---|---|
identifyNonStartCase | Identifies entries that are not written in start case |
identifyCaseIssues | Identify case issues |
identifyLoners | Identify levels with < 6 obs. |
identifyMissing | Identify miscoded missing values |
identifyNums | Identify misclassified numeric or integer variables |
identifyOutliers | Identify outliers |
identifyOutliersTBStyle | Identify outliers (Turkish Boxplot style) |
identifyWhitespace | Identify prefixed and suffixed whitespace |
isCPR | Identify Danish CPR numbers |
isEmpty | Check if the variable contains only a single value |
isKey | Check if the variable is a key |
isSingular | Check if the variable contains only a single value |
isSupported | Check if the variable class is supported by dataMaid. |
classes |
---|
character, factor |
character, factor |
character, factor |
character, Date, factor, integer, labelled, logical, numeric |
character, factor, labelled |
Date, integer, numeric |
Date, integer, numeric |
character, factor, labelled |
character, Date, factor, integer, labelled, logical, numeric |
character, Date, factor, integer, labelled, logical, numeric |
character, Date, factor, integer, labelled, logical, numeric |
character, Date, factor, integer, labelled, logical, numeric |
character, Date, factor, integer, labelled, logical, numeric |
Customizing document
Validate
Showrequire(validate)
This mostly serves as a logic check (ex. death date is later than birth date) and row-wise checks.
First, create a validator
object:
validator1 <- validator(
ageAtDeath := floor((dateOfDeath - birthday)/365.25)
, `Adult president` = ageAtInauguration >= 18
, `Alive at inauguration` = ageAtDeath >= ageAtInauguration
, `Positive first name` = firstName*2 > firstName
, `Death by assassination` =
if (dateOfDeath == presidencyEndDate)
assassinationAttempt == 1
, `Begin date` = difftime(presidencyEndDate, as.Date("176-08-04")) > 0
)
confront_messy <- confront(bigPresidentData, validator1)
summary(confront_messy) %>% kable
name | items | passes | fails | nNA | error | warning | expression |
---|---|---|---|---|---|---|---|
Adult.president | 47 | 47 | 0 | 0 | FALSE | FALSE | ageAtInauguration >= 18 |
Alive.at.inauguration | 47 | 40 | 1 | 6 | FALSE | FALSE | floor((dateOfDeath - birthday)/365.25) >= ageAtInauguration |
Positive.first.name | 0 | 0 | 0 | 0 | TRUE | FALSE | firstName * 2 > firstName |
Death.by.assassination | 47 | 38 | 3 | 6 | FALSE | FALSE | !(dateOfDeath == presidencyEndDate) | (abs(assassinationAttempt - 1) < 1e-08) |
Begin.date | 47 | 45 | 0 | 2 | FALSE | FALSE | difftime(presidencyEndDate, as.Date(“176-08-04”)) > 0 |
errors(confront_messy)
## $Positive.first.name
## [1] "non-numeric argument to binary operator"
bpd_clean <- readRDS("bigPresidentData_cleaned.rds")
confront_clean <- confront(bpd_clean, validator1)
summary(confront_clean) %>% kable
name | items | passes | fails | nNA | error | warning | expression |
---|---|---|---|---|---|---|---|
Adult.president | 45 | 45 | 0 | 0 | FALSE | FALSE | (ageAtInauguration - 18) >= -1e-08 |
Alive.at.inauguration | 45 | 39 | 0 | 6 | FALSE | FALSE | floor((dateOfDeath - birthday)/365.25) >= ageAtInauguration |
Positive.first.name | 0 | 0 | 0 | 0 | TRUE | FALSE | firstName * 2 > firstName |
Death.by.assassination | 45 | 36 | 3 | 6 | FALSE | FALSE | !(dateOfDeath == presidencyEndDate) | (abs(assassinationAttempt - 1) < 1e-08) |
Begin.date | 45 | 44 | 0 | 1 | FALSE | FALSE | difftime(presidencyEndDate, as.Date(“176-08-04”)) > 0 |
orderVal <- validator(rank(presidencyBeginDate) == rank(orderOfPresidency))
orderCon_messy <- confront(
na.omit(bigPresidentData[,.(presidencyBeginDate, orderOfPresidency)])
, orderVal
)
summary(orderCon_messy) %>% kable
name | items | passes | fails | nNA | error | warning | expression |
---|---|---|---|---|---|---|---|
V1 | 46 | 44 | 2 | 0 | FALSE | FALSE | rank(presidencyBeginDate) == rank(orderOfPresidency) |
orderCon_clean <- confront(
na.omit(bpd_clean[,.(presidencyBeginDate, orderOfPresidency)])
, orderVal
)
summary(orderCon_clean) %>% kable
name | items | passes | fails | nNA | error | warning | expression |
---|---|---|---|---|---|---|---|
V1 | 45 | 43 | 2 | 0 | FALSE | FALSE | rank(presidencyBeginDate) == rank(orderOfPresidency) |
Overview of confrontation results:
summary(confront_messy)
## name items passes fails nNA error warning expression
## 1 Adult.president 47 47 0 0 FALSE FALSE ageAtInauguration >= 18
## 2 Alive.at.inauguration 47 40 1 6 FALSE FALSE floor((dateOfDeath - birthday)/365.25) >= ageAtInauguration
## 3 Positive.first.name 0 0 0 0 TRUE FALSE firstName * 2 > firstName
## 4 Death.by.assassination 47 38 3 6 FALSE FALSE !(dateOfDeath == presidencyEndDate) | (abs(assassinationAttempt - 1) < 1e-08)
## 5 Begin.date 47 45 0 2 FALSE FALSE difftime(presidencyEndDate, as.Date("176-08-04")) > 0
Compute percentage pass/fail/NA:
aggregate(confront_messy)
## npass nfail nNA rel.pass rel.fail rel.NA
## Adult.president 47 0 0 1.0000000 0.00000000 0.00000000
## Alive.at.inauguration 40 1 6 0.8510638 0.02127660 0.12765957
## Death.by.assassination 38 3 6 0.8085106 0.06382979 0.12765957
## Begin.date 45 0 2 0.9574468 0.00000000 0.04255319
Sort results by problem prevalence:
sort(confront_messy)
## npass nfail nNA rel.pass rel.fail rel.NA
## Death.by.assassination 38 3 6 0.8085106 0.06382979 0.12765957
## Alive.at.inauguration 40 1 6 0.8510638 0.02127660 0.12765957
## Begin.date 45 0 2 0.9574468 0.00000000 0.04255319
## Adult.president 47 0 0 1.0000000 0.00000000 0.00000000
For each observation and each check: TRUE/FALSE/NA
values(confront_messy)
## Adult.president Alive.at.inauguration Death.by.assassination Begin.date
## [1,] TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE TRUE
## [6,] TRUE NA NA TRUE
## [7,] TRUE TRUE TRUE TRUE
## [8,] TRUE TRUE TRUE TRUE
## [9,] TRUE TRUE TRUE TRUE
## [10,] TRUE TRUE TRUE TRUE
## [11,] TRUE TRUE TRUE TRUE
## [12,] TRUE TRUE TRUE TRUE
## [13,] TRUE TRUE TRUE TRUE
## [14,] TRUE TRUE TRUE TRUE
## [15,] TRUE TRUE TRUE TRUE
## [16,] TRUE TRUE TRUE TRUE
## [17,] TRUE NA NA TRUE
## [18,] TRUE NA NA TRUE
## [19,] TRUE TRUE FALSE TRUE
## [20,] TRUE TRUE TRUE TRUE
## [21,] TRUE TRUE FALSE TRUE
## [22,] TRUE TRUE TRUE TRUE
## [23,] TRUE TRUE TRUE TRUE
## [24,] TRUE TRUE TRUE TRUE
## [25,] TRUE TRUE TRUE TRUE
## [26,] TRUE TRUE TRUE TRUE
## [27,] TRUE TRUE TRUE TRUE
## [28,] TRUE TRUE TRUE TRUE
## [29,] TRUE TRUE TRUE TRUE
## [30,] TRUE NA NA NA
## [31,] TRUE FALSE TRUE NA
## [32,] TRUE TRUE TRUE TRUE
## [33,] TRUE TRUE TRUE TRUE
## [34,] TRUE TRUE TRUE TRUE
## [35,] TRUE TRUE TRUE TRUE
## [36,] TRUE TRUE TRUE TRUE
## [37,] TRUE TRUE TRUE TRUE
## [38,] TRUE TRUE FALSE TRUE
## [39,] TRUE TRUE TRUE TRUE
## [40,] TRUE TRUE TRUE TRUE
## [41,] TRUE TRUE TRUE TRUE
## [42,] TRUE NA NA TRUE
## [43,] TRUE TRUE TRUE TRUE
## [44,] TRUE TRUE TRUE TRUE
## [45,] TRUE TRUE TRUE TRUE
## [46,] TRUE TRUE TRUE TRUE
## [47,] TRUE NA NA TRUE
Visual overview of check results:
barplot(confront_messy)
What errors were there:
errors(confront_messy)
## $Positive.first.name
## [1] "non-numeric argument to binary operator"
What warnings were there:
warnings(confront_messy)
## named list()
Pros and Cons
ShowPros
Provides a solution and workflow for data cleaning
Fairly fast even with millions of data
Can help (prevent) coding errors (
rms
really has trouble with low observations, can find these before hitting errors)- Helps with collaborations (especially consultations!)
- Variables in a dataset can usually only be understood in the proper context of their origin
- Requires a collaborative effort between an expert in the field and a statistician
Readable to both parties
Easy for statistician to document what was or was not done
Cons
Hard to parse through with lots of data (variables)
Cannot use non-class dependent variable constraints
Resources
Ekstrom, C.T., Peterson, A.H. (2018, February). Cleaning Up the Data Cleaning Process. Short course at the Conference on Statistical Practice, Portland, OR.
Ekstrom, C.T., Peterson, A.H. (2018, February). Cleaning Up the Data Cleaning Process. Short course at the Conference on Statistical Practice, Portland, OR [Course slides] .