Programming Tips For Statisticians

Programming is a huge part of what modern statisticians do, yet few of us ever take time to study good programming practices. Some may complain, but I'm a statistician, not a programmer. Consider that while we are statisticians and not mathematicians, we spend a lot of time developing our mathematical skills because they are essential to our jobs. In this age, programming is just as essential to being a good statistician.

This page contains foundational advice every statistician should know and practice.

Key phrases: Good programming advice for statisticians. Good programming practices for statisticians. Programming Style Guide for Statisticians.

Meaningful Variable Names

Good variables names have appropriate meanings. Bad variable names have no meaning or inappropriate meanings. Compare the following three lines.

c = 703*a/b^2

BMI = 703*Weight/Height^2

BMI = 703*WeightInLbs/HeightInInches^2

Notice how the three lines decrease in ambiguity. The first line is highly ambiguous; the second is clear though leaves open the question of whether the right constant, 703, is being used; and the third is completely unambiguous. Good programming minimizes the need for a good memory. Choose a variable naming format and be consistent. What format you choose is a matter of personal preference, but read up on the style guides for nonstatistical programming languages for good advice. Programmers of nonstatistical languages have thought carefully about what makes a good name. Co-opt good style practices from other languages for your own use. Taking advantage of capitalization allows readable names like glyburideFilldate and PersistenceEndDate.

If creating variables that become part of a dataset, then a good variable name should work in many languages. As statisticians, we frequently use several programming languages on a given dataset. We may do one analysis in R and another in Stata. We use the best program we have available for a job. A good variable name will work seamlessly with all the languages we use. This means being mindful of using special characters like _ and . in variable names that may work well with one language and poorly with another.

A possible exception comes in that we are very comfortable using single letters for indexes. Consider this script comparing the estimated power of an equal variance t-test to a Wilcoxon-Mann-Whitney test. The p-values are easily updated using a simple index for the loop. nLoops = 10^6 pvalsTtest = rep( NA, Nloops ) pvalsWilcoxon = rep( NA, Nloops ) for( i in 1:nLoops ){ areaPeanutMandM = rnorm( n=60, mean=492, sd=15 ) areaPretzelMandM = rnorm( n=30, mean=500, sd=7 ) pvalsTtest[i] = t.test( areaPeanutMandM, areaPretzelMandM, var.equal=T)$p.value pvalsWilcoxon[i] = wilcox.test( areaPeanutMandM, areaPretzelMandM )$p.value } powerTtest = sum( pvalsTtest < 0.05 )/nLoops powerWilcoxon = sum( pvalsWilcoxon < 0.05 )/nLoops

Non-significant Whitespace

Put spaces (and sometimes linebreaks) between things for readability. Noonelikestoreadasentencewithoutanyspacebetweenthewords. It's still intelligible, but it takes longer to understand. Besides, storage is cheap.

Consider changing the following a<-4;b<-6;c<-a+b for(x in a:b){print(c:x)} f<-function(x,y){return(x^2+y/3)} to a <- 4 b <- 6 c <- a + b

for (x in a:b) { print(c:x) }

f <- function(x, y) { x^2 + y/3 }
and your eyes will thank you.

Indentation

Think of what punctuation and spaces did for making the written language discernible. In programming, indentation play a huge role in making code readable.

The specifics of a good scheme will vary based on personal aesthetics, but here are some basic pointers.
  • indent the body of for loops, foreach loops, and while loops
  • indent the body of conditional statements (if statements), especially if the conditional statement has several actions within the body
  • choose an indent, e.g. 2 spaces or 4 spaces, and stick with it
  • don't mix tabs and spaces, use one or the other for indentation
  • be consistent with nested indentation, i.e. if indenting with 4 spaces, the body of an if statement within a while loop will be indented 8 spaces, 4 for the while loop and 4 for the if statement.

Vertical Alignment

Long, sprawling horizontal lines of code, especially ones that contain long lists of variables, are very hard to proofread. For a very simple example of employing a vertical structure in coding, compare the following. d1 <- subset( d0, select = c( StudyID, SiteID, SiteName, SubjectID, GroupAssignment, ExaminerID, Reviewer, Reviewer.Initials, Examiner.Diagnosis, Claim.Number, Revdiagnosis, Utdscale, DateExam, ExamStartTime, ExamEndTime, ExamTimeCalculated ) )

## versus ##

d1 <- subset( d0, select = c( StudyID, SiteID, SiteName, SubjectID, GroupAssignment, ExaminerID, Reviewer, Reviewer.Initials, Examiner.Diagnosis, Claim.Number, Revdiagnosis, Utdscale, DateExam, ExamStartTime, ExamEndTime, ExamTimeCalculated ) )

The previous example aligned the 16 variables vertically to make them easier to read. For a more advanced example illustrating how vertical alignment can make comparisons between related elements much easier, see http://en.wikipedia.org/wiki/Programming_style#Vertical_alignment. Like indentation, personal preference will play a large role in how you will apply vertical alignment to your own writing.

Comments

How much to comment is up for debate. Some will say "comment copiously", while others will say "don't comment - write clear code". Consider this line of code: $line =~ s/,/\;/g; # substitute commas with semicolons everywhere in the line

To an expert Perl programmer, the comment is annoying. But to a novice Perl programmer, the comment is helpful. To help you decide how much to comment, ask yourself the following questions?
  • Who needs to be able to read my code? What's their level of experience with the language I'm using?
  • Do these comments make my code easier for me to read? ... easier to read for everyone else who needs to read this code?
  • Will I remember what I'm doing in this section of code a year from now? Will these comments help me remember?
  • Does this comment provide information that the line of code does not?
In the example above, a comment like # commas will be used only as data delimiters provides new information that helps explain why that line of code was written.

When commenting functions, consider including the following.
  1. The function's purpose (if the name doesn't make this obvious).
  2. The arguments (including variable types).
  3. The return value(s).

Code Reuse

Don't reinvent the wheel if you can help it. Often, what you need to do has been done before and is available in your statistical program by default or online as an add on. The time you save not reinventing the wheel can be spent testing and understanding the add on. More often than not, you'll break even on time and get better, more reliable results. That said, don't trust everything you find online. Some code has been thoroughly tested and debugged; some hasn't. You have to do the work to ensure you understand and trust the code. Ask the obvious questions. Is this a default function or did you pick up the code off the street? Did the code go through a review process? Does it have good documentation? Do you understand it? You may not need to understand every detail of a default function in a validated statistical package, but you do need to for an unvalidated script you grabbed off the web.

Don't Repeat Yourself (DRY)

If you find yourself copying and pasting a lot when writing code, you should think about creating a function instead. The theory behind this has to do with making your code easier to change in the future. If you have 10 copies of the same bit of code, when you want to make a change, you have to make 10 changes. And what if you miss one? Those kinds of errors can take time to track down. You can save future time by creating a function and calling it instead of copying and pasting. DRY up your code! smile

Naming Scripts

Personal preference will vary widely here. Here are some questions to ask yourself?
  • Does the name have to be short because you have to type it 50 more times, or can it be longer and contain important information?
  • Should version be part of the script name?
  • Does the script name convey the script's purpose?
  • Will you remember the name a year from now?
  • A year from now, will you remember the purpose of the script from just its name?
  • Does the script name appear in a good place in the alphabetically sorted folder where it lives? Do you want the script to appear next to the data it processes? ... next to the data or analyses it outputs? Do you want different versions of the script to alphabetize in proper order?

Documentation

Good code documentation is as important as good knowledge of a programming language. Documentation may be done in files that accompany your script or in the script itself. Good documentation should:
  • Make it easier for someone else to understand what your code is for and what it does.
  • Make it easier to reproduce your results.
  • Make you write better code by forcing you to clarify your thinking.

Exactly what makes good documentation will depend on your audience and the purpose of the documentation. Good documentation frequently contains:
  • The purpose of the code.
  • Brief logic for each major block of your code.
  • Where things are located. If you are working on a massive project with many dataset and dozens of scripts, things can be hard to find when you restarting on a project after months away from it.
  • How to reproduce your results. In complex projects, you don't go from raw uncleaned data to the final report with just one script. Several scripts are run in a sequence. Good documentation captures that so that if you had nothing but the raw data, the documentation, and your scripts, you could reproduce your final report without having to remember everything you did.
  • Good programming minimizes the need for a good memory. So does good documentation.
  • Creation and modification dates, along with why the modification was required. Although ideally, this is better part of a good version control system.

Aside: For a higher level discussion of making code truly readable, see Literate Programming and Web. These ideas led to the ideas of Noweb, Sweave, and Literate Data Analysis (Sweave, Leisch 2002 pdf).

Version Control System

There is nothing worse than making a major overhaul to a script, realizing it wasn't the right thing to do, and then having to figure out how to undo all your changes because you didn't save the old version of your script. Have a system where you routinely save versions of your script and track what changes are made in each version. In general, before making any new changes to a script, save a working version according to your version control system. Consider checking out GitHub or Subversion. Wikipedia maintains a comparison chart of numerous options.

Be Clear and Correct First; Fast and Clever Second

Before you spend time finding clever ways to make your code run fast, first make your code as clear as you can and make sure it is correct. Then you can optimize your code as time permits.

Get Bang for Your Buck

Commenting, documenting, tracking versions, all take time. In the long run, the time should pay off with better code, more readable code, more reproducible analyses, faster restart times when coming back to a project after months away from it, the ability to quickly undo bad choices, and so on. A modest time investment in developing and maintaining these practices can pay off big over time. Find the right amount of time to spend so you get bang for your buck.

Test With a Sample of Your Data

If you are working with a big data set, it can be difficult to test your code if it takes a long time to run. In this case, take a sample of you data and use that for testing.

Testing and debugging

Many programming languages have debuggers available. Debuggers are programs that will allow you to trace your code as it is run. In other words you can see what value a particular variable contains at a specific point in your code. You may also consider using print statements within your code. Here are a few places where print statements can be helpful:
  1. Entering and leaving a function
  2. The length of a vector before looping through it
  3. After processing each row within a dataset (print the record ID or a count)

Unclutter Where The Language and Your Experience Level Allow

Some languages have very strict syntax. Others like R are more flexible and provide opportunities to unclutter your code. Consider the following R code. a <- 3; b <- 5; if (a > b) { "a is greater than b"; } else if (a < b) { "a is less than b"; } else { "a and b are equal"; }

The following code is read by most people more easily and by R just as well as the code cluttered with semicolons and curly braces. a <- 3 b <- 5 if (a > b) "a is greater than b" else if (a < b) "a is less than b" else "a and b are equal"

When you can unclutter your code requires some understanding of how R reads your code. Consequently, the novice R programmer may prefer to write the first code because it requires less understanding of subtleties in R. The experienced R programmer may prefer the second code because it is easier to read. An experienced programmer may prefer a middle ground, dropping nonessential elements they find extraneous by keeping those they find clarifying, e.g. dropping the semicolons but keeping the braces. Learning where you can unclutter your code can pay off with more readable code.

Performance

If you know how many elements will be in a vector, create the vector with that size. This is much more efficient than appending values onto the end of the vector. Also see MemoryEfficientGeneration.

Here's an example: squaredBad<-function(size) {squared<-c();for(i in 1:size){squared<-append(squared, i^2)};squared} squaredGood<-function(size) {squared<-numeric(size);for(i in 1:size) {squared[i]<-i^2};squared} x<-squaredBad(100000) y<-squaredGood(100000)

It would be even more efficient if you could accomplish this without a for loop: squaredBest <- function(size) { seq(size)^2 } z <- squaredBest(100000)

Other ways to speed up code that takes a long time to run

  • In some cases pmin.int should run faster than pmin. From the help file for pmin: "pmax.int and pmin.int are faster internal versions only used when all arguments are atomic vectors and there are no classes: they drop all attributes."
  • Don't use ifelse when it's not necessary:

currentdata$delta <- ifelse(currentdata$observedt == currentdata$times, 1, 0)

Is better represented as

currentdata$delta <- current_data$observedt == currentdata$times

This can eliminate an extra computation later

dat$newtime[dat$delta == 1] dat$newtime[dat$delta == 0]

Gets replaced with

dat$newtime[dat$delta] dat$newtime[!dat$delta]

Other reasons to not use ifelse

x <- rnorm(10000) #Compare runtimes of system.time(ifelse(x > 0, 1, 0)) # with system.time(as.numeric(x > 0)) # and system.time(ifelse(x > 0, "Yes", "No")) # with system.time(c("No","Yes")[(x > 0)+1])

  • Adding variables to a data.frame is slow. Sometimes, working with multiple vectors works just as well.

Code profiling

Use the function Rprof to start profiling your code and find which functions are taking the most time.

An example: http://www.stat.berkeley.edu/~nolan/stat133/Fall05/lectures/profilingEx.html

Floating-point Arithmetic

Computers use a binary representation to store numbers, thus making it difficult to store the exact representation of decimal numbers. This can lead to cases where two numbers that should be equal are not.

0.68 == 0.68 #=> [1] TRUE which(seq(0.6, 0.7, 0.01) == 0.68) #=> numeric(0) In the R code above, the sequence (0.60, 0.61, …, 0.68, 0.69, 0.70) should have a match found at the ninth element, 0.68. Instead, the code returns no matches.

In such cases, you may want to check for "closeness".

tol <- function(x, y, delta) { (x < y + delta) & (x > y - delta) } which(tol(seq(0.6, 0.7, 0.01), 0.68, .00000001)) #=> [1] 9 This R code rephrases the question as are these two numbers within +0.00000001 of each other. This close enough question allows for the small imprecision caused by how the computer stores numbers.

Another way to do this is:

Replace a == b with abs(a - b) < .Machine$double.eps

Keep in mind that if efficiency is an issue, it may be worth looking for an alternative to this calculation, which can take lots of time.

Code Review

Get more eyes than your own on your code. Depending on the complexity of the code and project, you may want anything from a simple over the shoulder review where you walk someone through your code on the monitor to a formal walk-through where you assemble a group to review your code. Peer reviews and formal walk-throughs are a big time investment, but they have a big payoff. The errors you catch as a team alone will make this worthwhile. It is also an opportunity to get personalized feedback on your programming practices, as well as a chance for everyone to learn from each other. One caveat. Just like when reviewing someone's writing, artwork, cooking, etc., feelings can easily be bruised. It is important for all participants to be respectful towards each other. In more formal reviews, it is helpful for everyone to have a checklist of things they should check in the code and supporting documents. See the References and Useful Links section below for more information on code reviews.

One of the principles under which the Department of Biostatistics operates is that faculty and staff make themselves available for helping each other. A review of samples of your work will be done in a friendly fashion. This is also true at the R Clinic, which is an excellent opportunity to obtain feedback on small sections of your code that might represent larger sections.


References and Useful Links

Edit | Attach | Print version | History: r32 | r30 < r29 < r28 < r27 | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r29 - 13 Feb 2014, JoAnnAlvarez
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback