# The Statistical Computing Series

The Statistical Computing Series is a monthly event for learning various aspects of modern statistical computing from practitioners in the Department of Biostatistics. We focus on topics related to the

R language,

Python, and related tools, but we include the broadest possible range of content related to effective statistical computation. The format varies, depending on the speaker and the topic, from lectures to demonstrations to hands-on workshops.

If you have a particular topic you would like to see covered,

please send a request.

There have been several requests for coverage of various topics. Here is a short list, if you are interested in contributing but are seeking inspiration:

- writing R functions with formula arguments
- writing R functions with methods
- using makefiles
- other graphics packages (base graphics)
- lme4/nlme
- reshape (package not function)/plyr
- R data structures
- bootstrapping / random number generating
- imputation (using various packages and functions)
- bibtex
- software for slide presentations

### Time & Location

Virtually on the fourth Friday of each month at 1:30 pm, unless otherwise indicated.

### Email Notification

We send out email notifications the week of a particular presentation. If you would like to be added to the list,

please let us know.

### Fall 2022 Schedule

##### R Workflow (VIRTUAL)

*23 September, 2022* **Frank Harrell**
This workshop is based on the R Workflow electronic book at hbiostat.org/rflow. Here I outline analysis project workflow that I’ve found to be efficient in making reproducible research reports using R with RMarkdown and now Quarto. I start by covering importing data, creating annotated analysis files, examining extent and patterns of missing data, and running descriptive statistics on them with goals of understanding the data and their quality and completeness. Functions in the Hmisc package are used to annotate data frames and data tables with labels and units of measurement, show metadata/data dictionaries, and to produce tabular and graphical statistical summaries. Efficient and clear methods of recoding variables are given. Several examples of processing and manipulating data using the data.table package are given, including some non-trivial longitudinal data computations. General principles of data analysis are briefly surveyed and some flexible bivariate and 3-variable analysis methods are presented with emphasis on staying close to the data while avoiding highly problematic categorization of continuous independent variables. Examples of diagramming the flow of exclusion of observations from analysis, caching results, parallel processing, and simulation are presented. In the process several useful report writing methods are exemplified, including program-controlled creation of multiple report tabs. The methods presented capitalize on 31 years of experience with the R language and its precursor S.

### Winter/Spring 2022 Schedule

##### A practical tutorial to geocoding (VIRTUAL)

*25 February, 2022* **Ryan Moore**
Geocoding is the process of taking a text description of a location, such as an address, and converting it to geocodes that can be joined to geomarker data of interest. In biomedical research geocoding can used to study a variety of geomarker data such as air quality indices or socioeconomic indicators.

In this month’s computing series, I will give a tutorial on how to use geocoding to join population level census tract data to address data. Additionally, I will give a brief tutorial on how to plot geographic data on a choropleth map in R.

### Fall 2021 Schedule

##### Practical Security

*22 October, 2021* **Shawn Garbett**
The risk of private health information leaking faces the ever-growing threat of hackers and thieves. Practical security tips are shared to help block phishing attacks. Example code is shared showing how to generate reports from REDCap without storing the data locally.

##### Customizable Table Building with tangram.pipe

*27 August, 2021* **Andrew Guide**
In this presentation, I will introduce the tangram.pipe package, which allows for fully customizable summary tables in R. I will show how to use this package to create well-formatted tables that allow users to specify the features and formatting for each row in the table. These features include comparison tests, missing data handling, row summaries by a column variable, and summaries of subsets.

### Winter/Spring 2021 Schedule

##### as.data.table(data.frame)

*30 April, 2021* **Cole Beck**
The most common data structure in R is the data.frame. While it has its flaws, it's simple to use and generally works as expected. It's the default tool you should use to store data. As data sets grow in size, the default isn't always good enough, and alternatives are available. One popular alternative is the data.table package. We'll discuss its syntax and how to replace data.frame functionality. We'll look at some of its features, and find some examples when data.table doesn't work as expected.

The content of Cole's presentation is available on github:

Presentation and

YouTube (set quality to 1080p).

##### Advanced R Reporting

*19 March, 2021* **Frank Harrell**
This talk illustrates the following:

- parallel processing to speed up simulations.
- using a hash to only run simulations when an input parameter or the source code changes
- using the data.table package for aggregating and reshaping data tables
- auto-sensing when html format is being produced
- automatically switching to interactive plotly graphics when creating the html version of the report
- dynamic creation of a sequence of R markdown knitr code chunks each with its own figure caption, using Hmisc::markupSpecs$html$mdchunk
- use of the beautiful rmdreadthedown report template when producing html
- use of special LaTeX options when producing pdf

The report to be discussed and its complete RMarkdown file and service functions may be found at

https://hbiostat.org/R/Hmisc/markov
##### RStudio addins and open discussion

*26 February, 2021* **Josh DeClercq**
RStudio addins are extensions which can both simplify and enhance the user’s ability to write R code. They are executable from within RStudio and are accessible to just like any other R package. They can provide a wide range of useful functions, including interactive plotting using ggplot, help with regular expressions, or styling cluttered code. I will provide a brief overview of a few of these features.

In addition, I encourage people to participate in an open discussion regarding the statistical computing series. I am curious to get input on topics that people may be interested in, ideas for presentations or presenters, or any general ideas that can help improve the series. Any feedback or insight is welcome.

R code
##### Repello: Reports from Trello in R

*22 January, 2021* **Andrew Guide**
Trello is an application for project management that has been utilized in a wide variety of settings. R is a statistical computing software widely used to process data and generate reports. In this presentation, I will introduce the Repello package for R which reads in data via the Trello API in order to generate a customizable progress report of all projects in a Trello board. I will also show how one can setup a cron job in Linux so that the progress report is automatically generated on a repeating schedule.

In addition, I will also show how I use this tool within the nephrology-epidemiology-biostatistics collaboration, which is a large research group with multiple, simultaneous ongoing research projects. The goal in creating this tool was to keep all the investigators informed about the progress of individual projects in the collaboration. I will speak about why I think keeping the entire team in the loop about all the projects is helpful.

Presentation slides
### Winter/Spring 2020 Schedule

##### The "Harrellverse"

*24 January, 2020* **Frank Harrell**
Frank will describe the systems he has in place for managing communications, blog, operating system and R package updates, news feeds, file sharing, synchronizing computers, and other aspects of computer life.

### Fall 2019 Schedule

##### Creating and Distributing R Packages: A Case Study

*25 October, 2019* **Jeff Jetton**
The thousands of available add-on packages are one of the reasons R has been so widely-adopted. This discussion will cover how (and why) R users might develop their own packages and, as an example, trace the journey of the “greenclust” package from its initial creation through to its recent submission to CRAN.

##### Intro to Bayesian Regression Modeling in R using rstanarm

*27 September, 2019* **Nathan James**
Presentation slides
##### API Construction for Scalable Delivery of Model Predictions using R and the Plumber package

*23 August, 2019* **Shawn Garbett**
### Spring 2019 Schedule

##### Tangram: Tools for Reproducible Tables

*29 March, 2019* **Shawn Garbett**
##### Introduction to Docker

*26 April, 2019* **Nick Strayer**

### Fall 2018 Schedule

##### Getting Started with Bayesian Modeling in PyMC3

*24 August, 2018* **Chris Fonnesbeck**

##### Improving organization and collaboration with Trello and integration with R

*28 September, 2018* **Molly Olson**

##### Data Cleaning with dataMaid

*26 October, 2018* **Molly Olson and Omair Khan**
Data screening is an important first step of any statistical analysis. dataMaid autogenerates a customizable data report with a thorough summary of the checks and the results that a human can use to identify possible errors. It provides an extendable suite of test for common potential errors in a dataset. Molly will provide an introductory tutorial for using dataMaid in your data analysis workflow. Omair will present an example of dataMaid's extendability.

### Spring 2018 Schedule

##### Running R Scripts in Batch on Remote Servers

*27 April, 2018* **Cole Beck**
We'll discuss motivations for using remote servers (statcomp,

ACCRE) to
run R code in non-interactive mode. Various tips including how to
develop code on another machine, accessing remote files, debugging
errors, and avoiding the dreaded "Broken pipe" (lost SSH connection)
message.

### Fall 2017 Schedule

##### Introduction to Jupyter Notebooks for Interactive and Reproducible Research!

*29 September, 2017* **Chris Fonnesbeck**
Jupyter (formerly IPython) notebooks are a flexible and powerful tool for data science in both local and cloud-based environments. The notebooks allow data analyses to be integrated with markdown text, html, math, multimedia and other supporting materials and technologies to make scientific programming more literate and the generation of reports, web pages and even presentation slides seamless. While originally designed as a Python front-end, Jupyter works with R, Julia, Spark, and dozens of other languages via custom-built kernels. This presentation will introduce Jupyter notebooks and demonstrate how they can provide a powerful platform for reproducible quantitative research.

GitHub repository with notebook

##### Intermediate Version Control and Collaboration Workflows using Git and GitHub

*27 October, 2017* **Chris Fonnesbeck**
Git has become a standard tool for version control of code for scientific computing and software development. Its effectiveness as a collaborative system is enhanced by commercial repository management services such as

GitHub,

BitBucket and

GitLab, which provide remote repositories for working with teams on larger projects, as well as services for managing issues and code contributions from users. This tutorial will cover intermediate Git functionality required to use remote repositories effectively, including branching, cloning, merging and rebasing. I will also demonstrate best practices for participating in collaborative

GitHub projects, such as creating issues and pull requests. This tutorial will assume participants are familiar with elementary Git usage.

### Spring 2017 Schedule

##### Use R to animate travel history!

*27 January, 2017* **Minchun Zhou**
Plotting is not sufficient for data visualization in many cases, such as travel history data. People travel a lot, with family, with friends or alone. If you can animate your travel history, it’s like reviving good memories. In this talk, I will briefly demonstrate how to plot data on Google Maps, draw great circles, and make animations using R!

http://www.minchunzhou.com/travelhistory.html

##### Using the R package GMD to do collaborative statistical document construction

*24 February, 2017* **Nicholas Strayer**
Lucy and I have recently made the R package GMD to solve the problem “how do you construct a statistical report/ homework while working simultaneously with collaborators?”. GMD is an alpha-level package that allows you to keep a local .Rmd file in sync with a remote google doc. Simply paste the share url of the google doc into the function and automatically R will pull the google doc, put it into an .Rmd on your local machine and render the results. This effectively let’s you use google docs as your text editor, with all its benefits of history and multi-user editing while avoiding the hassle of continuously copying and pasting the text into R to check for syntax errors etc.

##### Introduction to Variational Bayesian Methods

*24 March, 2017* **David Schlueter**
In Bayesian analysis, the most common strategy for computing posterior quantities is through Markov Chain Monte Carlo (MCMC). Despite recent advances in efficient sampling, MCMC methods still remain computationally intensive for more than a few thousand observations. A more scalable alternative to sampling is Variational Inference (VI), which re-frames the problem of computing the posterior distribution as a minimization of the Kullback-Leibler divergence between the true posterior and a member of some approximating family. In this talk, we provide a basic overview of the VI framework as well as practical examples of its implementation using the Automatic Differentiation Variational Inference (ADVI) engine in

PyMC3.

##### Gaussian Processes Made Easy

*28 April, 2017* **Chris Fonnesbeck**
A common applied statistics task involves building regression models to characterize non-linear relationships between variables. It is possible to fit such models by assuming a particular non-linear structure, such as a sinusoidal, exponential, or polynomial function, to describe a given response by one variable to another. Unless this relationship is obvious from the outset, however, it involves possibly extensive model selection procedures to ensure the most appropriate model is retained. Alternatively, a non-parametric approach can be adopted by defining a set of knots across the variable space and use a spline or kernel regression to describe arbitrary non-linear relationships. However, knot layout procedures are somewhat ad hoc and can also involve variable selection. A third alternative is to adopt a Bayesian non-parametric strategy, and directly model the unknown underlying function. For this, we can employ Gaussian process models. I will compare three packages for fitting GP models in Python that make building Bayesian non-parametric models easier than they have ever been.

### Fall 2016 Schedule

##### A Not-so-gentle Introduction to Git

*2 September, 2016* **Dr. Christopher Fonnesbeck**

##### Using RMarkdown to quickly make and maintain an attractive website

*28 October, 2016* **Mr. Nick Strayer and Ms. Lucy D'Agostino**

##### A Tour of the TensorFlow Playground

*2 December, 2016* **Dr. Christopher Fonnesbeck**

### Fall 2015 Schedule

##### High-performance Computing with ACCRE

*25 September 2015* **Dr. Will French**
ACCRE presentation slides

##### Effective Text Editing with TextMate

*30 October, 2015* **Dr. Christopher Fonnesbeck**
Presentation summary

##### Computing on Larger-than-memory Datasets using Dask

*20 November, 2015* **Dr. Christopher Fonnesbeck**

##### A Primer on Regular Expressions

*18 December, 2015* **Mr. Jeremy Stephens**

### Spring 2016 Schedule

##### A Primer on Branching in Git

*22 January 2016* **Mr. Nick Strayer**
Slides

##### From Big to Lite Data with R/sqlite

*26 February 2016* **Mr. Cole Beck**

##### Getting Started with Jupyter Notebooks

*22 April 2016* **Dr. Chris Fonnesbeck**
Static Jupyter notebook of presentation

##### Pretty Data Visualization in R

*20 May 2016* **Mr. Nick Strayer**

##### Loop Efficiency in R

*13 March 2015* **Ms. Svetlana Eden**

##### Analyzing Geospatial Data using Python

*10 April 2015* **Dr. Chris Fonnesbeck**

### Presentations from Previous Years

##### Creating Interactive Visualizations with Bokeh

*19 September, 2014* **Dr. Chris Fonnesbeck**
HTML Notebook

##### Enhanced Features of the Thunderbird Email Client

*26 September, 2014* **Dr. Frank Harrell**

##### High Performance Computing in R Using the SNOW Package

*10 October, 2014* **Mr. Minchun Zhou**
Slides

##### Using the REDCap API

*7 November, 2014* **Ms. JoAnn Alvarez and Dr. Chris Fonnesbeck**

##### Computing Clinic

Open forum for asking and answering statistical computing questions.

*21 November, 2014*

##### Introduction to String Matching and Modification in R Using Regular Expressions

*17 January, 2014* **Ms. Svetlana Eden**
presentation PDF

##### 12 TextMate Tips for Effective Coding

*24 January, 2014* **Dr. Christopher Fonnesbeck**
List of tips (Markdown format)

##### Efficiency Tips for a Basic R Loop

*31 January, 2014* **Ms. Svetlana Eden**

##### An Introduction to Data Wrangling with Pandas

*7 February, 2014* **Dr. Christopher Fonnesbeck**

##### Writing Functions in R

*14 February, 2014* **Ms. Svetlana Eden**

##### Really Easy Slide Presentations with Slidify and RStudio

*28 February, 2014* **Ms. Laurie Samuels**
Looking for a simpler, cleaner alternative to Beamer? Slidify might be just the thing you're looking for. This short tutorial will cover just the basics; but even with just the basics, you can quickly make a nice-looking slide presentation with R code, graphs, tables, and even a formula or two.

Sample slide deck

##### Data manipulation with the `apply`

functions in R, part I: `apply`

, `tapply`

, and `lapply`

*23 August, 2013* **Ms. Laurie Samuels**
PDF of Sage notebook with demo code

##### A Gentle Introduction to Git and GitHub

*30 August, 2013* **Dr. Chris Fonnesbeck**
Learn the basics of version control and code management!

HTML5 slideshow

##### Increasing your leisure time as a biostatistician: Using the application programming interface (API) to automate exports in Redcap

*6 September, 2013* **Ms. JoAnn Alvarez**

##### Data science and BiG Data Analytics

*13 September, 2013*
Part I of a video training class by EMC.

##### Recreating Minard: An introduction to base graphics in R

*20 September, 2013* **Mr. Nathan Mercaldo**

##### Who is Stan?

*25 October, 2013* **Dr. Christopher Fonnesbeck**

##### Introduction to the `SparseM`

package for ordinal models

*1 November, 2013* **Dr. Frank Harrell**

##### Length of the Beatles' Songs: An introduction to base graphics in R

*8 November, 2013* **Dr. Tatsuki Koyama**

##### Ten Simple Rules for Reproducible Computational Research

A discussion of

Sandve et al. 2013 paper

*12 November, 2013* **Dr. Christopher Fonnesbeck**

##### Using Plotly for Interactive and Collaborative Data Visualization

Plotly is a collaborative data analysis and graphing platform. I will introduce its main features for generating high-quality, interactive scientific graphics using its APIs for Python and R.

*6 December, 2013* **Dr. Christopher Fonnesbeck**

##### Evaluating and (automatically) typesetting symbolic calculus and linear algebra expressions using Sage and LaTeX

*11 January, 2013* **Ms. Laurie Samuels**
Resources:

UsingSage

##### An Introduction to Graphics with D3

*25 January, 2013* **Dr. Chris Fonnesbeck**

##### Creating Heatmaps in R

*15 February, 2013* **Mr. Pengcheng Lu**
Resources:

Report Sample R script

##### Plotting with `ggplot2`

*1 March, 2013* **Ms. Jennifer Thompson**
Resources:

Slides Example R code

##### Fitting Bayesian Survival Models in Python (Part 1)

*15 March, 2013* **Dr. Chris Fonnesbeck**
Resources:

iPython Notebook

##### Fitting Bayesian Survival Models in Python (Part 2)

*22 March, 2013* **Dr. Chris Fonnesbeck**
Resources:

iPython Notebook

##### Manipulating Structured Data with Pandas

*12 April, 2013* **Dr. Chris Fonnesbeck**

##### Improving research using advanced REDCap interfaces

*19 April, 2013* **Mr. Scott Burns**
Resources:

PDF HTML slides

##### An Introduction to the `rms`

Package

*19 May, 2011* **Prof. Frank E. Harrell, Jr.**

##### Version Control using Git

*26 May, 2011* **Dr. Chris Fonnesbeck**

##### How to create an R package

*9 June, 2011* **Asst. Prof. Matthew Shotwell**
HowToCreateAnRPackage

##### Plotting with `ggplot2`

*16 June, 2011* **Ms. Jennifer Thompson**

##### Implementing Bayesian statistical models in Python

*23 June, 2011* **Dr. Christopher Fonnesbeck**

##### String manipulation and text-mining

*8 September, 2011* **Mr. Pengcheng Lu**
Resources:

Slides Sample R script

##### Using `weaver`

to cache intermediate results with Sweave

*15 September, 2011* **Ms. JoAnn Alvarez**

##### Using markup languages for scientific document creation: reStructuredText and MultiMarkdown

*22 September, 2011* **Dr. Christopher Fonnesbeck**

##### A primer on regular expressions

*29 September, 2011* **Mr. Jeremy Stephens**

##### R Functions and Related Topics

*20 October, 2011* **Ms. Svetlana Eden**
Example R code
Efficiency tips for looping in R

##### Automating data-integrity checks in R

*27 October, 2011* **Ms. Laurie Samuels**
RDataIntegrityChecks code on GitHub. Also see the editrules package on

CRAN.

##### Visualizing geospatial data using R

*3 November, 2011* **Mr. Frank Fan**
Code and Data Slides

##### Running simple Bayesian models in JAGS using R

*17 November, 2011* **Dr. Chris Fonnesbeck**

##### Programming Tips For Statisticians

*8 December, 2011* **Dr. Robert Greevy**

##### Pandas: Powerful data structures for statistical analysis in Python

*9 February, 2012* **Dr. Chris Fonnesbeck**

##### Confidence Estimates Using the `rms`

Package

*16 February, 2012* **Prof. Frank Harrell**

##### Introduction to Survival Modeling in JAGS

*1 March, 2012* **Dr. Chris Fonnesbeck**

##### Five useful graphical functions in R

*8 March, 2012* **Dr. Tatsuki Koyama**

##### R is Dead! Long Live Julia!

*29 March, 2012* **Dr. Chris Fonnesbeck**

##### Speeding up statistical computations in Python using `numexpr`

and `cython`

*17 May, 2012* **Dr. Chris Fonnesbeck**
Static version of iPython notebook used in numexpr/cython demo

##### Integrating R and Markdown using `knitr`

for elegant scientific documents

*24 May, 2012* **Dr. Chris Fonnesbeck**

### Fall 2012 Schedule

##### Handling date-times in R

*30 August, 2012* **Mr. Cole Beck** Date-Time tutorial

##### Introductory Command Line Usage in Mac OS X

*7 September, 2012* **Dr. Chris Fonnesbeck**

##### Mastering the TextMate Editor

*14 September, 2012* **Dr. Chris Fonnesbeck**

##### Statistical Computing Clinic

*12 October, 2012* General troubleshooting and Q&A for R and other tools.

##### Five Favorite Functions

*19 October, 2012* **Mr. Cole Beck** Five Functions

##### How to Create Nomograms

*2 November, 2012* **Dr. Frank Harrell**

##### An Introduction to Version Control Using Git

*9 November, 2012* **Dr. Chris Fonnesbeck**

##### How to increase reproducibility by freezing R

*16 November, 2012* **Mr. Jeremy Stephens**