You are here: Vanderbilt Biostatistics Wiki>Main Web>Seminars>StatisticalComputingSeries (21 Sep 2022, RyanMoore)Edit Attach

The Statistical Computing Series

The Statistical Computing Series is a monthly event for learning various aspects of modern statistical computing from practitioners in the Department of Biostatistics. We focus on topics related to the R language, Python, and related tools, but we include the broadest possible range of content related to effective statistical computation. The format varies, depending on the speaker and the topic, from lectures to demonstrations to hands-on workshops.

If you have a particular topic you would like to see covered, please send a request.

There have been several requests for coverage of various topics. Here is a short list, if you are interested in contributing but are seeking inspiration:

writing R functions with formula arguments
writing R functions with methods
using makefiles
other graphics packages (base graphics)
lme4/nlme
reshape (package not function)/plyr
R data structures
bootstrapping / random number generating
imputation (using various packages and functions)
bibtex
software for slide presentations

Time & Location

Virtually on the fourth Friday of each month at 1:30 pm, unless otherwise indicated.

Email Notification

We send out email notifications the week of a particular presentation. If you would like to be added to the list, please let us know.

Fall 2022 Schedule

R Workflow (VIRTUAL)

23 September, 2022 Frank Harrell

This workshop is based on the R Workflow electronic book at hbiostat.org/rflow. Here I outline analysis project workflow that Ive found to be efficient in making reproducible research reports using R with RMarkdown and now Quarto. I start by covering importing data, creating annotated analysis files, examining extent and patterns of missing data, and running descriptive statistics on them with goals of understanding the data and their quality and completeness. Functions in the Hmisc package are used to annotate data frames and data tables with labels and units of measurement, show metadata/data dictionaries, and to produce tabular and graphical statistical summaries. Efficient and clear methods of recoding variables are given. Several examples of processing and manipulating data using the data.table package are given, including some non-trivial longitudinal data computations. General principles of data analysis are briefly surveyed and some flexible bivariate and 3-variable analysis methods are presented with emphasis on staying close to the data while avoiding highly problematic categorization of continuous independent variables. Examples of diagramming the flow of exclusion of observations from analysis, caching results, parallel processing, and simulation are presented. In the process several useful report writing methods are exemplified, including program-controlled creation of multiple report tabs. The methods presented capitalize on 31 years of experience with the R language and its precursor S.

Winter/Spring 2022 Schedule

A practical tutorial to geocoding (VIRTUAL)

25 February, 2022 Ryan Moore

Geocoding is the process of taking a text description of a location, such as an address, and converting it to geocodes that can be joined to geomarker data of interest. In biomedical research geocoding can used to study a variety of geomarker data such as air quality indices or socioeconomic indicators.

In this months computing series, I will give a tutorial on how to use geocoding to join population level census tract data to address data. Additionally, I will give a brief tutorial on how to plot geographic data on a choropleth map in R.

Fall 2021 Schedule

Practical Security

22 October, 2021 Shawn Garbett

The risk of private health information leaking faces the ever-growing threat of hackers and thieves. Practical security tips are shared to help block phishing attacks. Example code is shared showing how to generate reports from REDCap without storing the data locally.

Customizable Table Building with tangram.pipe

27 August, 2021 Andrew Guide

In this presentation, I will introduce the tangram.pipe package, which allows for fully customizable summary tables in R. I will show how to use this package to create well-formatted tables that allow users to specify the features and formatting for each row in the table. These features include comparison tests, missing data handling, row summaries by a column variable, and summaries of subsets.

Winter/Spring 2021 Schedule

as.data.table(data.frame)

30 April, 2021 Cole Beck

The most common data structure in R is the data.frame. While it has its flaws, it's simple to use and generally works as expected. It's the default tool you should use to store data. As data sets grow in size, the default isn't always good enough, and alternatives are available. One popular alternative is the data.table package. We'll discuss its syntax and how to replace data.frame functionality. We'll look at some of its features, and find some examples when data.table doesn't work as expected.

The content of Cole's presentation is available on github: Presentation and YouTube (set quality to 1080p).

Advanced R Reporting

19 March, 2021 Frank Harrell

This talk illustrates the following:

parallel processing to speed up simulations.
using a hash to only run simulations when an input parameter or the source code changes
using the data.table package for aggregating and reshaping data tables
auto-sensing when html format is being produced
automatically switching to interactive plotly graphics when creating the html version of the report
dynamic creation of a sequence of R markdown knitr code chunks each with its own figure caption, using Hmisc::markupSpecs$html$mdchunk
use of the beautiful rmdreadthedown report template when producing html
use of special LaTeX options when producing pdf

The report to be discussed and its complete RMarkdown file and service functions may be found at https://hbiostat.org/R/Hmisc/markov

RStudio addins and open discussion

26 February, 2021 Josh DeClercq

RStudio addins are extensions which can both simplify and enhance the users ability to write R code. They are executable from within RStudio and are accessible to just like any other R package. They can provide a wide range of useful functions, including interactive plotting using ggplot, help with regular expressions, or styling cluttered code. I will provide a brief overview of a few of these features.

In addition, I encourage people to participate in an open discussion regarding the statistical computing series. I am curious to get input on topics that people may be interested in, ideas for presentations or presenters, or any general ideas that can help improve the series. Any feedback or insight is welcome.

R code

Repello: Reports from Trello in R

22 January, 2021 Andrew Guide

Trello is an application for project management that has been utilized in a wide variety of settings. R is a statistical computing software widely used to process data and generate reports. In this presentation, I will introduce the Repello package for R which reads in data via the Trello API in order to generate a customizable progress report of all projects in a Trello board. I will also show how one can setup a cron job in Linux so that the progress report is automatically generated on a repeating schedule.

In addition, I will also show how I use this tool within the nephrology-epidemiology-biostatistics collaboration, which is a large research group with multiple, simultaneous ongoing research projects. The goal in creating this tool was to keep all the investigators informed about the progress of individual projects in the collaboration. I will speak about why I think keeping the entire team in the loop about all the projects is helpful.

Presentation slides

Winter/Spring 2020 Schedule

The "Harrellverse"

24 January, 2020 Frank Harrell

Frank will describe the systems he has in place for managing communications, blog, operating system and R package updates, news feeds, file sharing, synchronizing computers, and other aspects of computer life.

Fall 2019 Schedule

Creating and Distributing R Packages: A Case Study

25 October, 2019 Jeff Jetton

The thousands of available add-on packages are one of the reasons R has been so widely-adopted. This discussion will cover how (and why) R users might develop their own packages and, as an example, trace the journey of the greenclust package from its initial creation through to its recent submission to CRAN.

Intro to Bayesian Regression Modeling in R using rstanarm

27 September, 2019 Nathan James

Presentation slides

API Construction for Scalable Delivery of Model Predictions using R and the Plumber package

23 August, 2019 Shawn Garbett

Spring 2019 Schedule

Tangram: Tools for Reproducible Tables

29 March, 2019 Shawn Garbett

Introduction to Docker

26 April, 2019 Nick Strayer

Fall 2018 Schedule

Getting Started with Bayesian Modeling in PyMC3

24 August, 2018 Chris Fonnesbeck

Improving organization and collaboration with Trello and integration with R

28 September, 2018 Molly Olson

Data Cleaning with dataMaid

26 October, 2018 Molly Olson and Omair Khan

Data screening is an important first step of any statistical analysis. dataMaid autogenerates a customizable data report with a thorough summary of the checks and the results that a human can use to identify possible errors. It provides an extendable suite of test for common potential errors in a dataset. Molly will provide an introductory tutorial for using dataMaid in your data analysis workflow. Omair will present an example of dataMaid's extendability.

bayes_reg_rstanarm.html: Intro to Bayesian analysis in R

I	Attachment	Action	Size	Date	Who	Comment
Rmd	Addins.Rmd	manage	5 K	26 Feb 2021 - 14:04	JoshDeClercq
pptx	Repello__R_Reports_from_Trello.pptx	manage	1 MB	25 Jan 2021 - 09:27	JoshDeClercq	Repello slides
html	bayes_reg_rstanarm.html	manage	2 MB	02 Oct 2019 - 09:08	JoshDeClercq	Intro to Bayesian analysis in R

Topic revision: r169 - 21 Sep 2022, RyanMoore

Main

Department Home Page

Biostatistics Graduate Program

Vanderbilt University Medical Center

Biostatistics Webs
- Archive
- Main
- Sandbox
- System

Copyright &© 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback

The Statistical Computing Series

Time & Location

Email Notification

Fall 2022 Schedule

R Workflow (VIRTUAL)

Winter/Spring 2022 Schedule

A practical tutorial to geocoding (VIRTUAL)

Fall 2021 Schedule

Practical Security

Customizable Table Building with tangram.pipe

Winter/Spring 2021 Schedule

as.data.table(data.frame)

Advanced R Reporting

RStudio addins and open discussion

Repello: Reports from Trello in R

Winter/Spring 2020 Schedule

The "Harrellverse"

Fall 2019 Schedule

Creating and Distributing R Packages: A Case Study

Intro to Bayesian Regression Modeling in R using rstanarm

API Construction for Scalable Delivery of Model Predictions using R and the Plumber package

Spring 2019 Schedule

Tangram: Tools for Reproducible Tables

Introduction to Docker

Fall 2018 Schedule

Getting Started with Bayesian Modeling in PyMC3

Improving organization and collaboration with Trello and integration with R

Data Cleaning with dataMaid

Spring 2018 Schedule

Running R Scripts in Batch on Remote Servers

Fall 2017 Schedule

Introduction to Jupyter Notebooks for Interactive and Reproducible Research!

Intermediate Version Control and Collaboration Workflows using Git and GitHub

Spring 2017 Schedule

Use R to animate travel history!

Using the R package GMD to do collaborative statistical document construction

Introduction to Variational Bayesian Methods

Gaussian Processes Made Easy

Fall 2016 Schedule

A Not-so-gentle Introduction to Git

Using RMarkdown to quickly make and maintain an attractive website

A Tour of the TensorFlow Playground

Fall 2015 Schedule

High-performance Computing with ACCRE

Effective Text Editing with TextMate

Computing on Larger-than-memory Datasets using Dask

A Primer on Regular Expressions

Spring 2016 Schedule

A Primer on Branching in Git

From Big to Lite Data with R/sqlite

Getting Started with Jupyter Notebooks

Pretty Data Visualization in R

Loop Efficiency in R

Analyzing Geospatial Data using Python

Presentations from Previous Years

Creating Interactive Visualizations with Bokeh

Enhanced Features of the Thunderbird Email Client

High Performance Computing in R Using the SNOW Package

Using the REDCap API

Computing Clinic

Introduction to String Matching and Modification in R Using Regular Expressions

12 TextMate Tips for Effective Coding

Efficiency Tips for a Basic R Loop

An Introduction to Data Wrangling with Pandas

Writing Functions in R

Really Easy Slide Presentations with Slidify and RStudio

Data manipulation with the apply functions in R, part I: apply, tapply, and lapply

A Gentle Introduction to Git and GitHub

Increasing your leisure time as a biostatistician: Using the application programming interface (API) to automate exports in Redcap

Data science and BiG Data Analytics

Recreating Minard: An introduction to base graphics in R

Who is Stan?

Introduction to the SparseM package for ordinal models

Length of the Beatles' Songs: An introduction to base graphics in R

Ten Simple Rules for Reproducible Computational Research

Using Plotly for Interactive and Collaborative Data Visualization

Evaluating and (automatically) typesetting symbolic calculus and linear algebra expressions using Sage and LaTeX

An Introduction to Graphics with D3

Creating Heatmaps in R

Plotting with ggplot2

Data manipulation with the `apply` functions in R, part I: `apply`, `tapply`, and `lapply`

Introduction to the `SparseM` package for ordinal models

Plotting with `ggplot2`

An Introduction to the `rms` Package

Plotting with `ggplot2`

Using `weaver` to cache intermediate results with Sweave

Confidence Estimates Using the `rms` Package

Speeding up statistical computations in Python using `numexpr` and `cython`

Integrating R and Markdown using `knitr` for elegant scientific documents