Some Models for Incorporating Software Usage in Teaching Non-Statisticians
This is a list, in no particular order, of ways to incorporate statistical software into teaching non-statisticians who are taking an introductory or second biostatistics course. Not all of the items below are mutually exclusive.
- Do not mention software or give assignments requiring its use; focus only on statistical thinking, approaches, methods, interpretation of results, reading the literature.
- Do not make students responsible for learning statistical software other than interpreting its output. Have a biostatistics graduate student, MS biostatistician, or a computer systems analyst skilled in statistical computing be on call during the course. This person will receive analysis specifications from students, and use suitable software for obtaining the student's desired results. The person will not assist the student in selection of the statistical method to use; this is what the student is engaged in active learning about. The student will be responsible for understanding the output and drawing conclusions from it.
- Allow students to get a maximum grade of B if they do not do assignments or projects requiring the use of software. For students who self-select to be able to do assignments or projects using statistical sofware (typically using their own data), a maximum grade of A+ can be obtained. Students selecting the second option (A+ max) would either have to use software that is supported by the instructor or would be able to select any software they wish, knowing that some packages will receive less handholding than others. This is essentially the model that Tom Stewart used at UNC. This approach would be especially appropriate if there is an identified subset of students who will not go into research. These students would likely select the "no computing" option.
- Use software throughout the course, letting students choose whatever package they wish to use.
- Use software throughout the course, where a single package is chosen by the degree program and instructor
- Use a mixture of statistical software for Biostatistics I and II that is based on the needs of the physician-scientist students. Use course evaluation feedback to be evidence-based about this mixture. For Biostatistics II, have each student bring their research project data set to one class and have the instructor teach practical data analysis about how to analyze that particular data set. The lessons would cover sample size, data screening, missing data, univariate, regression models, and graphics. In addition to the current students, the instructor could invite students from the previous years to bring a data set/paper to a class for a demonstration of the various statistical techniques used. Add a new course, Biostatistics III, that is an advanced regression modeling course taught with only R. About 15% of the students from various programs will self-select to take this course. In this way the student who want to learn R will learn it and those who do not, will not be forced to learn it.
- Have a biostatistics graduate student, MS biostatistician, or a computer systems analyst skilled in statistical computing conduct a laboratory, offer office hours or participate in class whereby students receive assistance as needed with computing difficulty. The person will not assist the student in selection of the statistical method to use; this is what the student is engaged in active learning about.
- Clearly define the purpose of teaching these courses. Is it to empower physician-scientists to perform some data analysis throughout their careers so that they can be successful at writing papers and grants? If so, we need to develop metrics to track, and perform experiments to test, the best ways to accomplish this goal. For example, we might conduct a REDCap survey of those who have graduated from these program in the past 15 years and get their feedback on what we did right and what we need to improve on. We also need to conduct a REDCap survey of their mentors, since these are the paying customers.
For approach #5, there are three primary choices of statistical computing software. Commercial systems SPSS and Stata have very expensive individual licenses if the student moves to another institution that does not have a deluxe license (and if the Stata user wants to upgrade Stata in the future). R will be perpetually free to all students, and has a slight advantage of making it a bit easier to work with most statisticians (R is the most highly used system for academic statisticians, Stata the second most used, and SPSS is very seldom used by them). For using SPSS, Stata, or R there are two primary modes of use: menu-driven data manipulation and analysis, and command scripting. For R, an SPSS-like menu system is available using the R Commander package. See
here and
here for examples of the many video tutorials available.
Some of the many video tutorials for learning the R language in the context of the RStudio graphical user interface may be found
here and
here. There are also many online videos for Stata and SPSS. The analysis template approach to learning the R language is described in Chapter 1 of
BBR.
Draft Software Survey
In teaching biostatistics to non-statisticians there is a diversity of opinion on whether to concentrates solely on concepts vs. spending significant time on the "how to" of statistical analysis. For students seeking "how to" there is significant diversity about to what extent software should be incorporated into introductory courses, and if so, which software. There are two classes of software: proprietary commercial (e.g., Stata, SPSS, SAS) and free open-source software (e.g., R, Python, Julia). Commercial software has a unified menu system for new users and a more hidden command-line structure. Open source software has a command console, an integrated environment (e.g., RStudio) and moderately complete menu front-ends (e.g., RCommander that provides the most commonly used SPSS menus as a front-end manager for R - see
http://socserv.mcmaster.ca/jfox/Misc/Rcmdr). R is the most popular statistical software among statisticians and has been extended by >5000 user-contributed packages to cover specialty areas (e.g., flow cytometry, image analysis, geomics, REDCap interfaces). SPSS and Stata both have mature menu front-ends for statistical analysis and graphics. A summary of advantages (+) and disadvantages (-) for each software class is found below.
Class |
+/- |
Description |
Proprietary |
+ |
More comprehensive interactive menu system |
|
+ |
More unified documentation |
|
+ |
Quicker time to first analyses |
|
- |
Does not instill reproducible research practices by having researchers create self-contained analysis scripts |
|
- |
Cost may be prohibitive if student leaves for an institution not having a site license for the commercial software package they learned |
|
- |
Due to splintering of the commercial market and cost of software does not facilitate intra-campus learning of computational methods |
Open Source |
+ |
Always free wherever you go |
|
+ |
Full implementation of scripted reproducible research workflow |
|
+ |
Allows for better communication with statisticians |
|
+ |
Facilitates learning from a wide user base and collaboration with researchers at other institutions that do not make commercial software affordable |
|
+ |
Newer analytical methods, including those for specialized research such as flow cytometry and genomics, are available more rapidly |
|
+ |
Superior graphics |
|
- |
Less comprehensive menus |
|
- |
Documentation is more scattered |
|
- |
Longer learning curve |
The purpose of this survey is elicit your opinions about whether, how, and which software should be incorporated into biostatistics courses for non-biostatisticians.
- How do you think that statistical software should be incorporated into coursework?
- do not mention software or give assignments requiring its use; focus mainly on statistical thinking and interpretation of statistical analysis
- incorporate exercises/assignments that require the use of a statistics software package to do the computations and graphics
- If usage of software is required in a course, select by whom it should be used:
- Have a programmer or staff biostatistician be on call during the course to receive analysis specifications from students and use suitable software for obtaining a student's desired result, requiring the student to be solely responsible fo selection of the statistical analysis method to use. The student must interpret the results and draw conclusions from it.
- Each student is responsible for figuring out how to run the analyses they desire using software, in addition to interpreting results and drawing conclusions.
- What is your opinion of instead having a two-tiered system within the course that recognizes that some students have an easier time learning software tools than others? Select from the choices below.
- Students can self-select at the start of the course to receive a maximum grade of B and not be responsible for doing computations/analyses requiring the use of software. Such students may have additional literature and interpretation assignments. Other students can receive a grade up to A+ by using software that is supported by the instructor or by selecting any software they wish, knowing that users of some packages will get less handholding than others. These students will be expected to complete a project (usually using their own data) that requires moderate computational tasks to be completed.
- Fashion the course so that one size fits all.
- What is your recommendation for choosing software if you feel that software should be a necessity for the course?
- Each student has the freedom to choose her own software package (but not Excel) and is responsible to finding help.
- Every student is required to use the same software package as chosen by the program and course director.
- What is your recommendation for the particular software choice?
- SPSS
- R
- Stata
- If you have general ideas to share please write them here: . . .
Thank you for your time.