The Statistical Computing Series
The Statistical Computing Series is a monthly event for learning various aspects of modern statistical computing from practitioners in the Department of Biostatistics. We focus on topics related to the
R language,
Python, and related tools, but we include the broadest possible range of content related to effective statistical computation. The format varies, depending on the speaker and the topic, from lectures to demonstrations to hands-on workshops.
If you have a particular topic you would like to see covered,
please send a request.
There have been several requests for coverage of various topics. Here is a short list, if you are interested in contributing but are seeking inspiration:
- writing R functions with formula arguments
- writing R functions with methods
- using makefiles
- other graphics packages (base graphics)
- lme4/nlme
- reshape (package not function)/plyr
- R data structures
- bootstrapping / random number generating
- imputation (using various packages and functions)
- bibtex
- software for slide presentations
Time & Location
Virtually on the fourth Friday of each month at 1:30 pm, unless otherwise indicated.
Email Notification
We send out email notifications the week of a particular presentation. If you would like to be added to the list,
please let us know.
Fall 2022 Schedule
R Workflow (VIRTUAL)
23 September, 2022 Frank Harrell
This workshop is based on the R Workflow electronic book at hbiostat.org/rflow. Here I outline analysis project workflow that I’ve found to be efficient in making reproducible research reports using R with RMarkdown and now Quarto. I start by covering importing data, creating annotated analysis files, examining extent and patterns of missing data, and running descriptive statistics on them with goals of understanding the data and their quality and completeness. Functions in the Hmisc package are used to annotate data frames and data tables with labels and units of measurement, show metadata/data dictionaries, and to produce tabular and graphical statistical summaries. Efficient and clear methods of recoding variables are given. Several examples of processing and manipulating data using the data.table package are given, including some non-trivial longitudinal data computations. General principles of data analysis are briefly surveyed and some flexible bivariate and 3-variable analysis methods are presented with emphasis on staying close to the data while avoiding highly problematic categorization of continuous independent variables. Examples of diagramming the flow of exclusion of observations from analysis, caching results, parallel processing, and simulation are presented. In the process several useful report writing methods are exemplified, including program-controlled creation of multiple report tabs. The methods presented capitalize on 31 years of experience with the R language and its precursor S.
Winter/Spring 2022 Schedule
A practical tutorial to geocoding (VIRTUAL)
25 February, 2022 Ryan Moore
Geocoding is the process of taking a text description of a location, such as an address, and converting it to geocodes that can be joined to geomarker data of interest. In biomedical research geocoding can used to study a variety of geomarker data such as air quality indices or socioeconomic indicators.
In this month’s computing series, I will give a tutorial on how to use geocoding to join population level census tract data to address data. Additionally, I will give a brief tutorial on how to plot geographic data on a choropleth map in R.
Fall 2021 Schedule
Practical Security
22 October, 2021 Shawn Garbett
The risk of private health information leaking faces the ever-growing threat of hackers and thieves. Practical security tips are shared to help block phishing attacks. Example code is shared showing how to generate reports from REDCap without storing the data locally.
Customizable Table Building with tangram.pipe
27 August, 2021 Andrew Guide
In this presentation, I will introduce the tangram.pipe package, which allows for fully customizable summary tables in R. I will show how to use this package to create well-formatted tables that allow users to specify the features and formatting for each row in the table. These features include comparison tests, missing data handling, row summaries by a column variable, and summaries of subsets.
Winter/Spring 2021 Schedule
as.data.table(data.frame)
30 April, 2021 Cole Beck
The most common data structure in R is the data.frame. While it has its flaws, it's simple to use and generally works as expected. It's the default tool you should use to store data. As data sets grow in size, the default isn't always good enough, and alternatives are available. One popular alternative is the data.table package. We'll discuss its syntax and how to replace data.frame functionality. We'll look at some of its features, and find some examples when data.table doesn't work as expected.
The content of Cole's presentation is available on github:
Presentation and
YouTube (set quality to 1080p).
Advanced R Reporting
19 March, 2021 Frank Harrell
This talk illustrates the following:
- parallel processing to speed up simulations.
- using a hash to only run simulations when an input parameter or the source code changes
- using the data.table package for aggregating and reshaping data tables
- auto-sensing when html format is being produced
- automatically switching to interactive plotly graphics when creating the html version of the report
- dynamic creation of a sequence of R markdown knitr code chunks each with its own figure caption, using Hmisc::markupSpecs$html$mdchunk
- use of the beautiful rmdreadthedown report template when producing html
- use of special LaTeX options when producing pdf
The report to be discussed and its complete RMarkdown file and service functions may be found at
https://hbiostat.org/R/Hmisc/markov
RStudio addins and open discussion
26 February, 2021 Josh DeClercq
RStudio addins are extensions which can both simplify and enhance the user’s ability to write R code. They are executable from within RStudio and are accessible to just like any other R package. They can provide a wide range of useful functions, including interactive plotting using ggplot, help with regular expressions, or styling cluttered code. I will provide a brief overview of a few of these features.
In addition, I encourage people to participate in an open discussion regarding the statistical computing series. I am curious to get input on topics that people may be interested in, ideas for presentations or presenters, or any general ideas that can help improve the series. Any feedback or insight is welcome.
R code
Repello: Reports from Trello in R
22 January, 2021 Andrew Guide
Trello is an application for project management that has been utilized in a wide variety of settings. R is a statistical computing software widely used to process data and generate reports. In this presentation, I will introduce the Repello package for R which reads in data via the Trello API in order to generate a customizable progress report of all projects in a Trello board. I will also show how one can setup a cron job in Linux so that the progress report is automatically generated on a repeating schedule.
In addition, I will also show how I use this tool within the nephrology-epidemiology-biostatistics collaboration, which is a large research group with multiple, simultaneous ongoing research projects. The goal in creating this tool was to keep all the investigators informed about the progress of individual projects in the collaboration. I will speak about why I think keeping the entire team in the loop about all the projects is helpful.
Presentation slides
Winter/Spring 2020 Schedule
The "Harrellverse"
24 January, 2020 Frank Harrell
Frank will describe the systems he has in place for managing communications, blog, operating system and R package updates, news feeds, file sharing, synchronizing computers, and other aspects of computer life.
Fall 2019 Schedule
Creating and Distributing R Packages: A Case Study
25 October, 2019 Jeff Jetton
The thousands of available add-on packages are one of the reasons R has been so widely-adopted. This discussion will cover how (and why) R users might develop their own packages and, as an example, trace the journey of the “greenclust” package from its initial creation through to its recent submission to CRAN.
Intro to Bayesian Regression Modeling in R using rstanarm
27 September, 2019 Nathan James
Presentation slides
API Construction for Scalable Delivery of Model Predictions using R and the Plumber package
23 August, 2019 Shawn Garbett
Spring 2019 Schedule
Tangram: Tools for Reproducible Tables
29 March, 2019 Shawn Garbett
Introduction to Docker
26 April, 2019 Nick Strayer
Fall 2018 Schedule
Getting Started with Bayesian Modeling in PyMC3
24 August, 2018 Chris Fonnesbeck
Improving organization and collaboration with Trello and integration with R
28 September, 2018 Molly Olson
Data Cleaning with dataMaid
26 October, 2018 Molly Olson and Omair Khan
Data screening is an important first step of any statistical analysis. dataMaid autogenerates a customizable data report with a thorough summary of the checks and the results that a human can use to identify possible errors. It provides an extendable suite of test for common potential errors in a dataset. Molly will provide an introductory tutorial for using dataMaid in your data analysis workflow. Omair will present an example of dataMaid's extendability.
Spring 2018 Schedule
Running R Scripts in Batch on Remote Servers
27 April, 2018 Cole Beck
We'll discuss motivations for using remote servers (statcomp,
ACCRE) to
run R code in non-interactive mode. Various tips including how to
develop code on another machine, accessing remote files, debugging
errors, and avoiding the dreaded "Broken pipe" (lost SSH connection)
message.
Fall 2017 Schedule
Introduction to Jupyter Notebooks for Interactive and Reproducible Research!
29 September, 2017 Chris Fonnesbeck
Jupyter (formerly IPython) notebooks are a flexible and powerful tool for data science in both local and cloud-based environments. The notebooks allow data analyses to be integrated with markdown text, html, math, multimedia and other supporting materials and technologies to make scientific programming more literate and the generation of reports, web pages and even presentation slides seamless. While originally designed as a Python front-end, Jupyter works with R, Julia, Spark, and dozens of other languages via custom-built kernels. This presentation will introduce Jupyter notebooks and demonstrate how they can provide a powerful platform for reproducible quantitative research.
GitHub repository with notebook
Intermediate Version Control and Collaboration Workflows using Git and GitHub
27 October, 2017 Chris Fonnesbeck
Git has become a standard tool for version control of code for scientific computing and software development. Its effectiveness as a collaborative system is enhanced by commercial repository management services such as
GitHub,
BitBucket and
GitLab, which provide remote repositories for working with teams on larger projects, as well as services for managing issues and code contributions from users. This tutorial will cover intermediate Git functionality required to use remote repositories effectively, including branching, cloning, merging and rebasing. I will also demonstrate best practices for participating in collaborative
GitHub projects, such as creating issues and pull requests. This tutorial will assume participants are familiar with elementary Git usage.
Spring 2017 Schedule
Use R to animate travel history!
27 January, 2017 Minchun Zhou
Plotting is not sufficient for data visualization in many cases, such as travel history data. People travel a lot, with family, with friends or alone. If you can animate your travel history, it’s like reviving good memories. In this talk, I will briefly demonstrate how to plot data on Google Maps, draw great circles, and make animations using R!
http://www.minchunzhou.com/travelhistory.html
Using the R package GMD to do collaborative statistical document construction
24 February, 2017 Nicholas Strayer
Lucy and I have recently made the R package GMD to solve the problem “how do you construct a statistical report/ homework while working simultaneously with collaborators?”. GMD is an alpha-level package that allows you to keep a local .Rmd file in sync with a remote google doc. Simply paste the share url of the google doc into the function and automatically R will pull the google doc, put it into an .Rmd on your local machine and render the results. This effectively let’s you use google docs as your text editor, with all its benefits of history and multi-user editing while avoiding the hassle of continuously copying and pasting the text into R to check for syntax errors etc.
Introduction to Variational Bayesian Methods
24 March, 2017 David Schlueter
In Bayesian analysis, the most common strategy for computing posterior quantities is through Markov Chain Monte Carlo (MCMC). Despite recent advances in efficient sampling, MCMC methods still remain computationally intensive for more than a few thousand observations. A more scalable alternative to sampling is Variational Inference (VI), which re-frames the problem of computing the posterior distribution as a minimization of the Kullback-Leibler divergence between the true posterior and a member of some approximating family. In this talk, we provide a basic overview of the VI framework as well as practical examples of its implementation using the Automatic Differentiation Variational Inference (ADVI) engine in
PyMC3.
Gaussian Processes Made Easy
28 April, 2017 Chris Fonnesbeck
A common applied statistics task involves building regression models to characterize non-linear relationships between variables. It is possible to fit such models by assuming a particular non-linear structure, such as a sinusoidal, exponential, or polynomial function, to describe a given response by one variable to another. Unless this relationship is obvious from the outset, however, it involves possibly extensive model selection procedures to ensure the most appropriate model is retained. Alternatively, a non-parametric approach can be adopted by defining a set of knots across the variable space and use a spline or kernel regression to describe arbitrary non-linear relationships. However, knot layout procedures are somewhat ad hoc and can also involve variable selection. A third alternative is to adopt a Bayesian non-parametric strategy, and directly model the unknown underlying function. For this, we can employ Gaussian process models. I will compare three packages for fitting GP models in Python that make building Bayesian non-parametric models easier than they have ever been.
Fall 2016 Schedule
A Not-so-gentle Introduction to Git
2 September, 2016 Dr. Christopher Fonnesbeck
Using RMarkdown to quickly make and maintain an attractive website
28 October, 2016 Mr. Nick Strayer and Ms. Lucy D'Agostino
A Tour of the TensorFlow Playground
2 December, 2016 Dr. Christopher Fonnesbeck
Fall 2015 Schedule
High-performance Computing with ACCRE
25 September 2015 Dr. Will French
ACCRE presentation slides
Effective Text Editing with TextMate
30 October, 2015 Dr. Christopher Fonnesbeck
Presentation summary
Computing on Larger-than-memory Datasets using Dask
20 November, 2015 Dr. Christopher Fonnesbeck
A Primer on Regular Expressions
18 December, 2015 Mr. Jeremy Stephens
Spring 2016 Schedule
A Primer on Branching in Git
22 January 2016 Mr. Nick Strayer
Slides
From Big to Lite Data with R/sqlite
26 February 2016 Mr. Cole Beck
Getting Started with Jupyter Notebooks
22 April 2016 Dr. Chris Fonnesbeck
Static Jupyter notebook of presentation
Pretty Data Visualization in R
20 May 2016 Mr. Nick Strayer
Loop Efficiency in R
13 March 2015 Ms. Svetlana Eden
Analyzing Geospatial Data using Python
10 April 2015 Dr. Chris Fonnesbeck
Presentations from Previous Years
Creating Interactive Visualizations with Bokeh
19 September, 2014 Dr. Chris Fonnesbeck
HTML Notebook
Enhanced Features of the Thunderbird Email Client
26 September, 2014 Dr. Frank Harrell
High Performance Computing in R Using the SNOW Package
10 October, 2014 Mr. Minchun Zhou
Slides
Using the REDCap API
7 November, 2014 Ms. JoAnn Alvarez and Dr. Chris Fonnesbeck
Computing Clinic
Open forum for asking and answering statistical computing questions.
21 November, 2014
Introduction to String Matching and Modification in R Using Regular Expressions
17 January, 2014 Ms. Svetlana Eden
presentation PDF
12 TextMate Tips for Effective Coding
24 January, 2014 Dr. Christopher Fonnesbeck
List of tips (Markdown format)
Efficiency Tips for a Basic R Loop
31 January, 2014 Ms. Svetlana Eden
An Introduction to Data Wrangling with Pandas
7 February, 2014 Dr. Christopher Fonnesbeck
Writing Functions in R
14 February, 2014 Ms. Svetlana Eden
Really Easy Slide Presentations with Slidify and RStudio
28 February, 2014 Ms. Laurie Samuels
Looking for a simpler, cleaner alternative to Beamer? Slidify might be just the thing you're looking for. This short tutorial will cover just the basics; but even with just the basics, you can quickly make a nice-looking slide presentation with R code, graphs, tables, and even a formula or two.
Sample slide deck
Data manipulation with the apply
functions in R, part I: apply
, tapply
, and lapply
23 August, 2013 Ms. Laurie Samuels
PDF of Sage notebook with demo code
A Gentle Introduction to Git and GitHub
30 August, 2013 Dr. Chris Fonnesbeck
Learn the basics of version control and code management!
HTML5 slideshow
Increasing your leisure time as a biostatistician: Using the application programming interface (API) to automate exports in Redcap
6 September, 2013 Ms. JoAnn Alvarez
Data science and BiG Data Analytics
13 September, 2013
Part I of a video training class by EMC.
Recreating Minard: An introduction to base graphics in R
20 September, 2013 Mr. Nathan Mercaldo
Who is Stan?
25 October, 2013 Dr. Christopher Fonnesbeck
Introduction to the SparseM
package for ordinal models
1 November, 2013 Dr. Frank Harrell
Length of the Beatles' Songs: An introduction to base graphics in R
8 November, 2013 Dr. Tatsuki Koyama
Ten Simple Rules for Reproducible Computational Research
A discussion of
Sandve et al. 2013 paper
12 November, 2013 Dr. Christopher Fonnesbeck
Using Plotly for Interactive and Collaborative Data Visualization
Plotly is a collaborative data analysis and graphing platform. I will introduce its main features for generating high-quality, interactive scientific graphics using its APIs for Python and R.
6 December, 2013 Dr. Christopher Fonnesbeck
Evaluating and (automatically) typesetting symbolic calculus and linear algebra expressions using Sage and LaTeX
11 January, 2013 Ms. Laurie Samuels
Resources:
UsingSage
An Introduction to Graphics with D3
25 January, 2013 Dr. Chris Fonnesbeck
Creating Heatmaps in R
15 February, 2013 Mr. Pengcheng Lu
Resources:
Report Sample R script
Plotting with ggplot2
1 March, 2013 Ms. Jennifer Thompson
Resources:
Slides Example R code
Fitting Bayesian Survival Models in Python (Part 1)
15 March, 2013 Dr. Chris Fonnesbeck
Resources:
iPython Notebook
Fitting Bayesian Survival Models in Python (Part 2)
22 March, 2013 Dr. Chris Fonnesbeck
Resources:
iPython Notebook
Manipulating Structured Data with Pandas
12 April, 2013 Dr. Chris Fonnesbeck
Improving research using advanced REDCap interfaces
19 April, 2013 Mr. Scott Burns
Resources:
PDF HTML slides
An Introduction to the rms
Package
19 May, 2011 Prof. Frank E. Harrell, Jr.
Version Control using Git
26 May, 2011 Dr. Chris Fonnesbeck
How to create an R package
9 June, 2011 Asst. Prof. Matthew Shotwell
HowToCreateAnRPackage
Plotting with ggplot2
16 June, 2011 Ms. Jennifer Thompson
Implementing Bayesian statistical models in Python
23 June, 2011 Dr. Christopher Fonnesbeck
String manipulation and text-mining
8 September, 2011 Mr. Pengcheng Lu
Resources:
Slides Sample R script
Using weaver
to cache intermediate results with Sweave
15 September, 2011 Ms. JoAnn Alvarez
Using markup languages for scientific document creation: reStructuredText and MultiMarkdown
22 September, 2011 Dr. Christopher Fonnesbeck
A primer on regular expressions
29 September, 2011 Mr. Jeremy Stephens
R Functions and Related Topics
20 October, 2011 Ms. Svetlana Eden
Example R code
Efficiency tips for looping in R
Automating data-integrity checks in R
27 October, 2011 Ms. Laurie Samuels
RDataIntegrityChecks code on GitHub. Also see the editrules package on
CRAN.
Visualizing geospatial data using R
3 November, 2011 Mr. Frank Fan
Code and Data Slides
Running simple Bayesian models in JAGS using R
17 November, 2011 Dr. Chris Fonnesbeck
Programming Tips For Statisticians
8 December, 2011 Dr. Robert Greevy
Pandas: Powerful data structures for statistical analysis in Python
9 February, 2012 Dr. Chris Fonnesbeck
Confidence Estimates Using the rms
Package
16 February, 2012 Prof. Frank Harrell
Introduction to Survival Modeling in JAGS
1 March, 2012 Dr. Chris Fonnesbeck
Five useful graphical functions in R
8 March, 2012 Dr. Tatsuki Koyama
R is Dead! Long Live Julia!
29 March, 2012 Dr. Chris Fonnesbeck
Speeding up statistical computations in Python using numexpr
and cython
17 May, 2012 Dr. Chris Fonnesbeck
Static version of iPython notebook used in numexpr/cython demo
Integrating R and Markdown using knitr
for elegant scientific documents
24 May, 2012 Dr. Chris Fonnesbeck
Fall 2012 Schedule
Handling date-times in R
30 August, 2012 Mr. Cole Beck Date-Time tutorial
Introductory Command Line Usage in Mac OS X
7 September, 2012 Dr. Chris Fonnesbeck
Mastering the TextMate Editor
14 September, 2012 Dr. Chris Fonnesbeck
Statistical Computing Clinic
12 October, 2012 General troubleshooting and Q&A for R and other tools.
Five Favorite Functions
19 October, 2012 Mr. Cole Beck Five Functions
How to Create Nomograms
2 November, 2012 Dr. Frank Harrell
An Introduction to Version Control Using Git
9 November, 2012 Dr. Chris Fonnesbeck
How to increase reproducibility by freezing R
16 November, 2012 Mr. Jeremy Stephens