The Statistical Computing Series

The Statistical Computing Series is a monthly event for learning various aspects of modern statistical computing from practitioners in the Department of Biostatistics. We focus on topics related to the R language, Python, and related tools, but we include the broadest possible range of content related to effective statistical computation. The format varies, depending on the speaker and the topic, from lectures to demonstrations to hands-on workshops.

If you have a particular topic you would like to see covered, please send a request.

There have been several requests for coverage of various topics. Here is a short list, if you are interested in contributing but are seeking inspiration:

  • writing R functions with formula arguments
  • writing R functions with methods
  • using makefiles
  • other graphics packages (base graphics)
  • lme4/nlme
  • reshape (package not function)/plyr
  • R data structures
  • bootstrapping / random number generating
  • imputation (using various packages and functions)
  • bibtex
  • software for slide presentations

Time & Location

Virtually on the fourth Friday of each month at 1:30 pm, unless otherwise indicated.

Email Notification

We send out email notifications the week of a particular presentation. If you would like to be added to the list, please let us know.

Fall 2022 Schedule

R Workflow (VIRTUAL)
23 September, 2022 Frank Harrell

This workshop is based on the R Workflow electronic book at hbiostat.org/rflow. Here I outline analysis project workflow that I’ve found to be efficient in making reproducible research reports using R with RMarkdown and now Quarto. I start by covering importing data, creating annotated analysis files, examining extent and patterns of missing data, and running descriptive statistics on them with goals of understanding the data and their quality and completeness. Functions in the Hmisc package are used to annotate data frames and data tables with labels and units of measurement, show metadata/data dictionaries, and to produce tabular and graphical statistical summaries. Efficient and clear methods of recoding variables are given. Several examples of processing and manipulating data using the data.table package are given, including some non-trivial longitudinal data computations. General principles of data analysis are briefly surveyed and some flexible bivariate and 3-variable analysis methods are presented with emphasis on staying close to the data while avoiding highly problematic categorization of continuous independent variables. Examples of diagramming the flow of exclusion of observations from analysis, caching results, parallel processing, and simulation are presented. In the process several useful report writing methods are exemplified, including program-controlled creation of multiple report tabs. The methods presented capitalize on 31 years of experience with the R language and its precursor S.

Winter/Spring 2022 Schedule

A practical tutorial to geocoding (VIRTUAL)
25 February, 2022 Ryan Moore

Geocoding is the process of taking a text description of a location, such as an address, and converting it to geocodes that can be joined to geomarker data of interest. In biomedical research geocoding can used to study a variety of geomarker data such as air quality indices or socioeconomic indicators.

In this month’s computing series, I will give a tutorial on how to use geocoding to join population level census tract data to address data. Additionally, I will give a brief tutorial on how to plot geographic data on a choropleth map in R.

Fall 2021 Schedule

Practical Security
22 October, 2021 Shawn Garbett

The risk of private health information leaking faces the ever-growing threat of hackers and thieves. Practical security tips are shared to help block phishing attacks. Example code is shared showing how to generate reports from REDCap without storing the data locally.

Customizable Table Building with tangram.pipe
27 August, 2021 Andrew Guide

In this presentation, I will introduce the tangram.pipe package, which allows for fully customizable summary tables in R. I will show how to use this package to create well-formatted tables that allow users to specify the features and formatting for each row in the table. These features include comparison tests, missing data handling, row summaries by a column variable, and summaries of subsets.

Winter/Spring 2021 Schedule

as.data.table(data.frame)
30 April, 2021 Cole Beck

The most common data structure in R is the data.frame. While it has its flaws, it's simple to use and generally works as expected. It's the default tool you should use to store data. As data sets grow in size, the default isn't always good enough, and alternatives are available. One popular alternative is the data.table package. We'll discuss its syntax and how to replace data.frame functionality. We'll look at some of its features, and find some examples when data.table doesn't work as expected.

The content of Cole's presentation is available on github: Presentation and YouTube (set quality to 1080p).

Advanced R Reporting
19 March, 2021 Frank Harrell

This talk illustrates the following:
  • parallel processing to speed up simulations.
  • using a hash to only run simulations when an input parameter or the source code changes
  • using the data.table package for aggregating and reshaping data tables
  • auto-sensing when html format is being produced
  • automatically switching to interactive plotly graphics when creating the html version of the report
  • dynamic creation of a sequence of R markdown knitr code chunks each with its own figure caption, using Hmisc::markupSpecs$html$mdchunk
  • use of the beautiful rmdreadthedown report template when producing html
  • use of special LaTeX options when producing pdf

The report to be discussed and its complete RMarkdown file and service functions may be found at https://hbiostat.org/R/Hmisc/markov

RStudio addins and open discussion
26 February, 2021 Josh DeClercq

RStudio addins are extensions which can both simplify and enhance the user’s ability to write R code. They are executable from within RStudio and are accessible to just like any other R package. They can provide a wide range of useful functions, including interactive plotting using ggplot, help with regular expressions, or styling cluttered code. I will provide a brief overview of a few of these features.

In addition, I encourage people to participate in an open discussion regarding the statistical computing series. I am curious to get input on topics that people may be interested in, ideas for presentations or presenters, or any general ideas that can help improve the series. Any feedback or insight is welcome.

R code

Repello: Reports from Trello in R
22 January, 2021 Andrew Guide

Trello is an application for project management that has been utilized in a wide variety of settings. R is a statistical computing software widely used to process data and generate reports. In this presentation, I will introduce the Repello package for R which reads in data via the Trello API in order to generate a customizable progress report of all projects in a Trello board. I will also show how one can setup a cron job in Linux so that the progress report is automatically generated on a repeating schedule.

In addition, I will also show how I use this tool within the nephrology-epidemiology-biostatistics collaboration, which is a large research group with multiple, simultaneous ongoing research projects. The goal in creating this tool was to keep all the investigators informed about the progress of individual projects in the collaboration. I will speak about why I think keeping the entire team in the loop about all the projects is helpful.

Presentation slides

Winter/Spring 2020 Schedule

The "Harrellverse"
24 January, 2020 Frank Harrell

Frank will describe the systems he has in place for managing communications, blog, operating system and R package updates, news feeds, file sharing, synchronizing computers, and other aspects of computer life.

Fall 2019 Schedule

Creating and Distributing R Packages: A Case Study
25 October, 2019 Jeff Jetton

The thousands of available add-on packages are one of the reasons R has been so widely-adopted. This discussion will cover how (and why) R users might develop their own packages and, as an example, trace the journey of the “greenclust” package from its initial creation through to its recent submission to CRAN.

Intro to Bayesian Regression Modeling in R using rstanarm
27 September, 2019 Nathan James

Presentation slides

API Construction for Scalable Delivery of Model Predictions using R and the Plumber package
23 August, 2019 Shawn Garbett

Spring 2019 Schedule

Tangram: Tools for Reproducible Tables
29 March, 2019 Shawn Garbett

Introduction to Docker
26 April, 2019 Nick Strayer


Fall 2018 Schedule

Getting Started with Bayesian Modeling in PyMC3

24 August, 2018 Chris Fonnesbeck


Improving organization and collaboration with Trello and integration with R

28 September, 2018 Molly Olson


Data Cleaning with dataMaid

26 October, 2018 Molly Olson and Omair Khan

Data screening is an important first step of any statistical analysis. dataMaid autogenerates a customizable data report with a thorough summary of the checks and the results that a human can use to identify possible errors. It provides an extendable suite of test for common potential errors in a dataset. Molly will provide an introductory tutorial for using dataMaid in your data analysis workflow. Omair will present an example of dataMaid's extendability.


Click to view previous presentations

Topic attachments
I Attachment Action Size Date Who CommentSorted ascending
Addins.RmdRmd Addins.Rmd manage 5.5 K 26 Feb 2021 - 14:04 JoshDeClercq  
bayes_reg_rstanarm.htmlhtml bayes_reg_rstanarm.html manage 2666.0 K 02 Oct 2019 - 09:08 JoshDeClercq Intro to Bayesian analysis in R
Repello_–_R_Reports_from_Trello.pptxpptx Repello_–_R_Reports_from_Trello.pptx manage 1428.6 K 25 Jan 2021 - 09:27 JoshDeClercq Repello slides
Topic revision: r169 - 21 Sep 2022, RyanMoore
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback