IntroBiostatCourseBlog < Main < Vanderbilt Biostatistics Wiki

You are here: Vanderbilt Biostatistics Wiki>Main Web>Education>IntroBiostatCourse2006>IntroBiostatCourseBlog (12 Apr 2007, ChunLi)Edit Attach

Chun's course blog

Beginning words: Blogging is totally new to me. I try to experiment this so that I can help you understand the course and statistics in general. I also will elaborate on things that I didn't cover well in the class. Because this is just an experiment, I don't require you to read the blog. Please let me know if you have suggestions or questions. Thanks. Chun

Analysis = software? (01/06/2007)
An example of over-simplifying and mis-interpreting analysis results (05/03/2006)
Missing data in likelihood ratio tests (04/24/2006)
The coefficients of main and interaction effects are not comparable (04/21/2006)
Questions in plain English and questions in statistical terms (04/21/2006)
Wilcoxons signed rank and rank sum tests (04/19/2006)
Think before you jump at an analysis (04/18/2006)
Statistics is not just calculations (04/18/2006)
Rounding values when reporting results (04/17/2006)
Interpretation of the interaction effects in a regression analysis (04/17/2006)
Interpretation of main effect when interaction effects are considered (04/14/2006)
The full may be greater than the sum of its parts (04/11/2006)
Power and sample size calculation (04/11/2006)
Percent changes are not comparable with each other (04/10/2006)
Cross-validation is not a cure to over-fitting (04/10/2006)
Two eyes from an animal are correlated (04/10/2006)
Odds ratios for continuous variables (04/09/2006)
Intercept (04/09/2006)
Results should always be interpreted with context (04/09/2006)
Unit in interpretation (04/09/2006)
If I hadnt carried out the other tests, I would have had the same result on this test. (04/09/2006)
Over-fitting (04/06/2006)
Misperceptions of regression models (04/06/2006)
Significance of a category in a categorical variable (04/06/2006)
Maps of incidence rates (03/20/2006)
How can we find a significant result? (03/19/2006)
Extrapolation and interpolation: Interpretation of parameters depends on the range of the variables (03/19/2006)
Adjusted for (03/19/2006)
Quadratic term and linear term (03/19/2006)
Step-wise variable selection (03/19/2006)
Permutation tests cannot do away with multiple comparisons (03/19/2006)
Interaction is dependent on context (03/17/2006)
Regression (03/17/2006)
What is Statistics? Statistics is not a collection of tools/methods/tricks (03/05/2006)
About the midterm (03/04/2006)
We can be fooled by our labeling an analysis (03/04/2006)
We can be fooled by contexts (03/02/2006)
Pilot studies (03/01/2006)
Degrees of freedom (02/26/2006)
Permutation tests and permutation CIs (02/24/2006)
Exactness is mostly an illusion and Fishers exact test (02/24/2006)
The probability world is so different from the logic world (02/21/2006)
P-value is not the probability of the null being true (02/21/2006)
Pr(German | accident) ≠ Pr(accident | German) (02/21/2006)
Statistics and probability (02/20/2006)
Interaction is not product (02/13/2006)
Interaction effect and main effect (02/13/2006)
Ecological fallacy (02/10/2006)
Watch out on p-values (02/10/2006)
Study design is more than sample size calculation (02/09/2006)
Measurement error and its effect (02/08/2006)
Summary is not equal to analysis (02/07/2006)
Interpretation of correlation coefficient in simple linear regressions (02/06/2006)
Interpretation of t-test results in all regressions (02/06/2006)
An example of using transformation (02/03/2006)
One-sided vs. two-sided tests (02/01/2006)
Hypothesis testing (02/01/2006)
p-value (01/31/2006)
"There is no statistical difference": A psychologically misleading phrase (01/27/2006)
An example of simplistic understanding of p-value (01/26/2006)
Abuse of statistics (01/25/2006)
A guaranteed path to having a paper (01/25/2006)
Variable selection and prediction: An example of artifact (01/25/2006)
Ask questions (01/24/2006)
Testing for difference and prediction are totally different animals (01/24/2006)
Rate, frequency, probability, terminologies in general, and my confession (01/23/2006)
Variation and interpretation of confidence intervals (01/22/2006)
Value of exploratory analyses (01/19/2006)
Outliers (01/19/2006)
Categorization of a continuous variable (01/18/2006)
Wrong ways of judging a statistician (01/18/2006)
Wrong ways of judging a statistical method (01/18/2006)
Wrong way of starting a project (01/18/2006)
Miscellaneous (01/17/2006)
Simpson's paradox (01/17/2006)

Analysis = software? (01/06/2007)

My colleague Rafe Donahue recently received a request from a doctor. Rafe has kindly allowed me to post the original request and his response (with my edits to remove irrelevant parts).

Here was the question:

Do you know of any good stats software packages that our department can purchase to use for some of our research projects? We are looking for something that is fairly straight forward/easy to use, and provides tools to perform survival analysis in addition to other basic analysis sets. We also would like this software to make publication quality tables/graphs. If such a creature exists, I figured you might know what it is. Can you help?

Here was Rafe's response:

You know, I was just thinking of writing you a letter. Our department is in need of some simple orthopedic tools, you know, fancy hammers, drills, titanium pins, bone saws, that sort of thing, for some of our projects here, especially for some of our more athletic members who tend to keep tearing menisci and ACLs and stuff like that. Do you know where we could get some of those tools? Maybe as a kit, with all of them matching? We would also like to be able to open and close wounds, so we might need scalpels and sutures and stuff, to make good quality incisions and close them up all neat and pretty. Oh, and those neat little scope things would be great too. If such a tool kit exists, I figured you might be able to help me find one. Can you help?

No, not really, but I hope you understand the point.

There are lots of great stat software packages (SAS, SPSS, R, S-Plus, STATA, etc.). R is even free. They all do survival analysis. Most of them can be used to spin the world backwards in the hands of a talented statistician/programmer, but therein lies the rub. Even if you gave me all the tools that you guys have, I wouldn't be able to do a decent job at what you do. I might be able to tie one of those fancy knots in a suture after a couple tries, but I certainly wouldn't be able to do what I would need to do.

About the software products, though:

None of them is easy to use, I believe, by someone who is not completely sure of what he or she is doing with an analysis. In that sense, all of them can be brutally difficult to use for someone who isn't a statistician.

Rafe

An example of over-simplifying and mis-interpreting analysis results (05/03/2006)

Research results are often digested by news reporters and then fed to the potential users of the results and the public. In this news article (PDF version), the reporter summarized the findings in a recent paper in Circulation. Unfortunately, it over-simplifies and mis-interprets analysis results.

Based on the news article, the analysis corrected for age, smoking, and other potential risk factors for coronary heart disease. In such an analysis, the results for the coffee-drinking variables reflect their additional effects after the other factors have been taken into account. The results show that there are no additional effects in the data, NOT there are no effects. There still may exist real effects but since coffee drinking is associated with smoking, some effects could be explained away by the smoking variable.

The estimates of risk ratios are all around one (i.e. no difference from the baseline, which presumably is no coffee drinking at all), with 95% CI's including one. This means we can't tell if they differ from one. But the news article picked the values that happened to have point estimate smaller than one, ignoring the precisions associated with the estimates, and claimed: "In fact, men and women who drank six or more cups of coffee a day for up to 20 years had a slightly lower relative risk". This statement will mislead people to think "if I want to drink coffee, then drink a lot". There was only one 95% CI that didn't include one (just slightly off from one), but given there were several reported CIs, it is quite possible that by chance alone, a CI does not cover the true value.

Missing data in likelihood ratio tests (04/24/2006)

A likelihood ratio test can be used to compare two nested model structures to see if the full model gives significantly better fit to the data than the reduced model to the same data. Be careful when missing data exist. When fitting a regression, software packages by default often drop the subjects with missing value for any of the variables used in the regression. Since the full model involves more variables than the reduced model, when missing data is present, more subjects may be dropped when the full model is fit. In this situation, you may end up with two models fitted with different datasets, and it wont be correct to compare results (e.g. log-likelihood, deviance, etc.) between the two models. Make sure the two model fits are based on the same set of subjects. Check the residual degrees of freedom may help since the difference in the residual DFs should be the same as the difference in the number of parameters.

The same principle applies to other model-comparison methods. For example, when using cross-validation to compare models, we estimate prediction performance of the models. When missing data is present, models may have been evaluated based on different subsets of data, and more complex models tend to be based on fewer subjects.

The coefficients of main and interaction effects are not comparable (04/21/2006)

Suppose we fit a regression with two continuous input variables, age (in years) and height (in meters), and their product interaction term: β₀ + β₁age + β₂height + β₃age•height. The coefficients have different units: β₁ reflects increase per year, β₂ reflects increase per meter, and β₃ reflects increase per year per meter. Apparently the coefficients are not comparable.

Suppose we fit a regression with a categorical variable h with four categories {h1, h2, h3, h4} and a continuous variable x and their interaction effects. As we explained in blog item Significance of a category in a categorical variable, we effectively fit a model allowing the four categories to have their own intercepts and slopes. The main effects are reflected in the intercepts and the interaction effects are reflected in the slopes. Apparently again, an intercept and a slope are not comparable.

Suppose we fit a regression with two categorical variables and their full interaction terms. Again, the coefficients of main and interaction effects are not comparable. Let me explain it in the simplest case: both input variables x1 and x2 are binary and are coded as 0 and 1. The model β₀ + β₁I_{x1=1} + β₂I_{x2=1} + β₃I_{x1=1}I_{x2=1} is equivalent to the following:
when x1 = 0, x2 = 0, right hand side = β₀ = γ₀,
when x1 = 1, x2 = 0, right hand side = β₀ + β₁ = γ₁,
when x1 = 0, x2 = 1, right hand side = β₀ + β₂ = γ₂,
when x1 = 1, x2 = 1, right hand side = β₀ + β₁ + β₂ + β₃ = γ₃.
Here, the γ's are the effects estimated based on the data from the corresponding combinations. It is easy to show that β₀ = γ₀, β₁ = γ₁ γ₀, β₂ = γ₂ γ₀, and β₃ = (γ₃ γ₂) (γ₁ γ₀). Now you see that the main effect coefficents β₁ and β₂ reflect quite different things than the interaction effect coefficient β₃. Comparing the two effects is like comparing the your height with the height difference between me and my son.

Questions in plain English and questions in statistical terms (04/21/2006)

People think in English and they ask questions in plain English. When answering a question through data analysis, we often need to re-state it in statistical terms and sometimes we also need to quantify the question first. Often, there isnt a single way of quantifying a question and some questions are not even quantifiable. Moreover, when re-stating a question, a single English question may turn out to be multiple statistical questions.

Suppose we study the effects of two variables on an outcome. Variable h is categorical with four categories {h1, h2, h3, h4} and variable x is binary (coded as 0 and 1). An investigator may be interested in a specific category, say h1, and wonder whether the effect of h1 is the same for any value of x, or in other words, if there is an interaction between h1 and x. If you carry out a regression analysis, you might include the interaction terms between x and all the categories of h and look at the z-test (or t-test) corresponding to the interaction term of between h1 and x. In this situation, what you are doing effectively is this: Given the effects of the combinations of x=1 and all categories but h1 are allowed to have the flexibility to be non-additive (in the scale of the right hand side), I am testing if allowing such flexibility to the combination of x=1 and h1 will significantly improve the model fit. Alternatively, if you include only the interaction term between x=1 and category h1 in your regression analysis and look at its corresponding z-test (or t-test), what you are doing effectively is this: Given the effects of the combinations of x=1 and all categories of h are constrained to be additive, I am testing if allowing the effect of only the combination of x=1 and h1 to be non-additive will significantly improve the model fit. Some other ways of quantifying the original question exist, although they may not be as justifiable as the above two ways. In fact, there is no single optimal way of quantifying the original question. In this situation, it is prudent to carry out analyses under different potential quantifications and see if they lead to the same conclusion, in other words, carry out sensitivity analyses to see if the results are sensitive to how you quantify a question.

An investigator once asked if calculations could be done to show the power of an approach to detect interaction effect between two variables. While asked to quantify the target magnitude of the interaction effect, he said this: We want to see the power to detect interaction when interaction explains half of the main effect. This question sounds noble but it is hard to quantify. It is quantifiable in some artificial ways, but I doubt it is quantifiable in any really meaningful way because the coefficients for main and interaction effects are not comparable (see my other blog item).

Wilcoxons signed rank and rank sum tests (04/19/2006)

I write this item because I found the textbook is not clear enough and sometimes even confusing. These tests are analogous to one-sample (or paired) and two-sample t-tests, and should be used when the assumptions of the t-tests are not met or the t-tests will be strongly influenced by some outliers in the data.

Signed rank test: When we have one sample, such as the differences from a paired data set, we rank the absolute values of all the numbers from the lowest to the highest. [The textbook we use excludes zero from the ranking while some software like Stata includes zero; they will lead to different statistics and different reference distributions under the null, but they should lead to the same p-values.] Then we sum over the ranks of the negative values and the ranks of the positive values, and pick the smaller of the two sums as a statistic. The null hypothesis is that the median of the underlying distribution is zero and the distribution is symmetric about the median (the second part often is missed in most textbooks). If the null is true, then the two sets of ranks in the positive and negative groups of a random sample should behave as if both the ranks and the signs have been randomly assigned and thus the group sizes may vary as a result (although the sum of group sizes is fixed). For each random assignment of ranks and signs, we can calculate the sums of ranks for the resulting two groups and pick the smaller sum. After enumerating all possible assignments, we have a reference distribution, and the statistic of the real data can then be compared with this reference distribution to obtain a p-value (the proportion of values in the reference distribution that is smaller than or equal to the real statistic). Since we always pick the smaller sum no matter if it is from the positive or the negative group, the p-value is two-sided.

Note that the symmetry part in the hypothesis is important and is the basis of our treating the ranks equally despite the signs. When we think the underlying distribution is far from being symmetric, both the t-test and the signed rank test are not suitable for detecting whether the central location of the distribution is different from zero or not. Thus, the choice of using the signed rank test over the t-test is mainly due to the concern of existence of outliers in the data.

Rank sum test: When we have two samples to compare (without pairing in design), we rank all the values (not absolute values) from the lowest to the highest, and then sum over the ranks for the two samples, and pick the sum of the smaller group. The null hypothesis is that the two distributions underlying the two samples are the same. If the null is true, then the two sets of ranks should behave as if the ranks have been randomly assigned. For each random assignment of ranks, we can calculate the sums of ranks for the two groups and pick the sum for the smaller group (always the same group). After enumerating all possible assignments, we have a reference distribution, and the statistic of the real data can then be compared with this reference distribution to obtain a p-value. A one-sided p-value is the proportion of values in the reference distribution that is equal to or more extreme (i.e. farther from the center) than the real statistic); a two-sided p-value is twice the one-sided p-value. Unless there was a strong a priori preference, a two-sided p-value should be reported.

Note that there is no symmetry requirement in the null hypothesis. Thus, the choice of using the rank sum test over the t-test may be due to the concern of either existence of outliers in the data or asymmetric underlying distributions.

The textbook explains the tests are about comparing medians. Now you can see there is little about the median in either test. In the signed rank test, the median and mean are the same due to the symmetry requirement of the distribution. The rank sum test is about whether the two distributions underlying the two groups of observations are the same or not.

Think before you jump at an analysis (04/18/2006)

Suppose a car repair shop wants to improve the engine quality of the cars in the city. They implement a new procedure for all cars coming to the shop to repair. They will do free engine checking and if there are any problems, they will recommend the car owners to repair it (paid by the car owners, of course). Suppose the repair shop is honest and wont prey on the car owners for engine repairs that are not needed. At the end of the month, the repair shop collects all data. They have serviced 900 cars since the new procedure was established and 500 cars didnt have any engine problems. For the 400 cars that had problems, 80 owners took the recommendations and had the engines fixed by the shop.

The repair shop wants to know if the new procedure worked. One way of looking at the data is that there is a change of the fraction of cars with good engines from 500/900 to 580/900. Another way of looking at the data is a 2x2 table with initial engine status (good/problem) as one factor and engine repair status (yes/no) as the other factor; the four cell counts will then be 0, 500, 80, and 320. However, if you naively jump at an analysis by carrying out a test of independence, you are wrong. The repair shops question is not answerable based on the data, because the data are collected in a way that will always show improvement. Moreover, when two variables are dependent on each other by design, a test of independence doesnt provide any information, whether the result is significant or not. In fact, the repair shops question is even ambiguous, with the comparison groups not clearly defined; once the comparison group is well defined, the need for collecting more data will be apparent.

Statistics is not just calculations (04/18/2006)

Some people tend to view statistics as just calculations or fitting models or black boxes that generate p-values. There was an investigator who asked me to do a linkage analysis (a type of data analysis in genetics). He asked if he could come over and learn how to do it by watching how I would do the analysis. Well, he could, but he probably would learn nothing. He apparently viewed statistics as a type of mechanics that he could learn just by watching, like learning to operate a centrifuge or some other lab machines.

Here is another example. An investigator was relatively savvy at statistics; she could run some analyses herself. Once she asked me to help analyze her data, which had an ordinal outcome variable (like disease stages and many measures in psychiatric diseases). She would do ANOVA, which treated the outcome as if they were continuous and thus was inappropriate. I told her the proportional odds model might be suitable and she wondered if she could do it in her favorite software. Later, she e-mailed me the computer output she generated while trying out her software and asked if it was correct. Even if the output appeared to be okay, I hesitated to say yes, because I wasnt sure she understood what the model did and what assumptions it made. She might have viewed statistics as a field that lived by generating results.

This view of statistics is quite common. Some investigators want to send their post-docs to workshops to learn data analysis techniques with the hope that the post-docs will replace the statisticians, who often have higher salary. Well, the newly trained post-docs will replace the statisticians when generating computer output, but probably not when correctly formulating the problems and correctly interpreting the output. [Another advantage of having post-docs to do the work is they are more obedient, not like statisticians who tend to hold back and not to endorse the investigators new discoveries though analysis, a.k.a. data-dredging.]

Rounding values when reporting results (04/17/2006)

If a value from a computer output has many significant digits, you may consider reporting a rounded value to make it more sensible. If the value is the final result (like an effect estimate or a p-value), you may keep 2 or 3 significant digits (I often keep 3). But if the value to be reported would serve as an intermediate result for further calculation, such as the coefficient estimates in regression analysis, you should keep more significant digits so that later results that depend on these intermediate values still have good precision. I keep 4 at least. If the intermediate result will not be reported, there is no need to round them in your own further calculation.

Interpretation of the interaction effects in a regression analysis (04/17/2006)

In a regression analysis, suppose we consider two input variables, x₁ and x₂, and the interaction terms between them. If the interaction terms have more than one column (as when either x₁ or x₂ is a categorical variable with more than two categories), we should treat them as a group and evaluate the significance of the interaction effects through a likelihood ratio test comparing model fits with and without the interaction terms. People may be tempted to look at the regression output and the p-values corresponding to the interaction terms. This is not a good approach. Such a p-value reflects the significance of the corresponding interaction term as a column in addition to all the other columns. It is possible that none of these p-values is significant but the interaction effects as a group have a significant contribution to the model fit.

If the interaction term between x₁ and x₂ is one column (as when both x₁ and x₂ are binary variables or when we use a product interaction term), the z-test (or t-test) for the interaction term should give similar results as the likelihood ratio test.

When both x₁ and x₂ are binary, the coefficient of the interaction term has some meaning, but often is misunderstood. In a logistic regression with the right hand side being β₀ + β₁x₁ + β₂x₂ + β₃x₁x₂, some people may apply their face-value understanding of regression coefficients and call exp(β₃) as the odds ratio for the interaction effect. I really dont know what this means. When we say odds ratio, there have to be two groups; if the interaction effect is a group, then what is the other group? In fact, exp(β₃) is the ratio A/B of two odds ratios, where A is the odds ratio comparing the group with x₁ = 1 and x₂ = 1 and the group with x₁ = 1 and x₂ = 0, and B is the odds ratio comparing the group with x₁ = 0 and x₂ = 1 and the group with x₁ = 0 and x₂ = 0. When exp(β₃) = 1, the two odds ratios are the same and the effect of x₂ = 1 over x₂ = 0 is the same for any x₁ value (a.k.a. no interaction effect); vice versa.

Similarly, in a linear regression, β₃ is the difference C D of two differences, where C is the difference in average outcomes between the group with x₁ = 1 and x₂ = 1 and the group with x₁ = 1 and x₂ = 0, and D is the difference in average outcomes between the group with x₁ = 0 and x₂ = 1 and the group with x₁ = 0 and x₂ = 0.

Interpretation of main effect when interaction effects are considered (04/14/2006)

When we fit a regression with two or more input variables and some interaction terms, it will be troublesome if we only look at the main effect of a variable and explain it as the effect of the variable (as measured by either the magnitude of coefficient, or the sign of the coefficient, or the standard error of the coefficient, or the corresponding p-value).

The reason is we cannot pull the main effect term of a variable out of the analysis output and explain it without considering the interaction terms involving the variable. For example, suppose there are two input variables, x₁ and x₂. Without the interaction term, the model will be β₀ + β₁x₁ + β₂x₂. Then β₁ can be interpreted as the effect of one unit increase of x₁ when the values of all the other variables in the regression are fixed. With the interaction term between x₁ and x₂, the model will be β₀ + β₁x₁ + β₂x₂ + β₃x₁x₂. Then β₁ can NOT be interpreted as above, because when the value of x₁ changes, the value of x₁x₂ changes too. In this situation, β₁ can only be interpreted as the effect of one unit increase of x₁ when the values of all the other variables are fixed and the values of the variables that have interaction terms with x₁ are zero. The z-test (or t-test) of the main effect tests for β₁ = 0 when β₃ is allowed to be free from any constraints, a test in which we often are not interested. The test effectively compares the model with three terms (x₁, x₂, x₁x₂) with the model with two terms (x₂, x₁x₂), which by itself carries strong and unrealistic assumptions as I explained in another blog item.

In some situations, the main effect may be interpreted as how much effects can be attributed to x₁ as a column in regression. The attributable amount of effects (or, direction or significance of effects) may appear to be quite different between when the interaction term is explicitly considered and when it is explicitly NOT considered.

The full may be greater than the sum of its parts (04/11/2006)

In a regression analysis, the coefficients may be viewed as the effects of the corresponding variables. In homework 4 problem 3, we did both simple logistic regression for each of the three input variables (scenario A) and a full logistic regression with all three variables included (scenario B). The odds ratio estimates for the three input variables in scenario B were higher than those obtained from scenario A. Can the full be "greater" than the sum of its parts?

Sometimes, it can. Interactions among the input variables may be a reason for such a phenomenon. But this phenomenon also can be observed when the input variables are independent of each other and there is no interaction effect between the input variables. A more fundamental reason is precision.

Suppose the outcome is really determined by two independent input variables x and y, linearly and without interactions. If you put both variables into analysis, great. If you put only x into analysis, the signals due to y have to be shouldered by both x and the error terms. If x and y are not correlated, the signals will mainly be attributed to the error term. This makes the variance estimate larger and leads to higher variation of the coefficient estimate. In some situations, the coefficient estimates from individual simple regressions may both swing towards zero, appearing to give weaker effects.

In Stata, you can repeat the follow code a few times and observe the coefficient estimates and associated standard error estimates.

clear
set obs 20
gen x1 = invnormal(uniform())
gen x2 = invnormal(uniform())
gen z = x1 + x2 + 0.1*invnormal(uniform())
regress z x1 x2
regress z x1
regress z x2

Power and sample size calculation (04/11/2006)

The power of a method is the probability of claiming something true when it is true. There are several factors that determine the power of a method: (1) significance level, (2) sample size, (3) magnitude of the target effect, (4) variance of the outcome, (5) other relevant factors such as the type of the target effect, the baseline of the target effect, etc.

(1) Significance level is the false positive rate we want to tolerate. When all the other factors are fixed, the higher the significance level, the higher the power is. But the high power is achieved by sacrificing false positive rate. Thus, in power calculations, we often fix the significance level. (2) Obviously, when all the other factors are fixed, the larger the sample size, the more information we have and the higher the power is. (3) In a t-test, the target effect is the presumed difference in the means; in a regression, the target effect may be the absolute value of the coefficient for an input variable; in a case-control study, the target effect may be the odds ratio or the difference of the proprotions of exposure in the case and control groups. When all the other factors are fixed, the larger the target effect, the easier for a method to detect it, and thus the higher the power is. (4) The variance of outcome can be viewed as the noise level. In a t-test, it is the within-group sample variance; in a regression, it is the true residual variance. The lower the noise level, the easier for a method to detect the signal, and thus the higher the power is. Often the variance can be estimated through past results or a pilot study. If not, we have to speculate or consider multiple potential values for the variance. Sometimes there is no need to specify the variance because it is determined by the underlying unknown parameters; for example, in binomial sampling, the variance is determined by the underlying success probability. (5) An example of the type of the target effect is the choice of dominant, additive, recessive effects of an allele in genetic studies. An example of the baseline of the target effect is the allele frequency in the control group in genetic association studies.

A full statement about the power of a method should be: When the variance and the other relevant factors are given in aa, the power of the method to detect an effect of bb at sample size cc is dd at significance level ee. The power is a function of multiple factors: dd = f(aa, bb, cc, ee). For example: When the effect of the risk allele in a biallelic polymorphism is dominant (aa) and the risk allele frequency is 0.2 (aa again), the power of the method to detect an odds ratio of 1.5 (bb) with 200 cases and 200 controls (cc) is 0.85 (dd) at significance level 0.05 (ee).

Many people think the power is a function of solely sample size. This is true only when all the other factors, especially the magnitude of the target effect, are fixed. Power also is about a method. Some people even forget this and routinely ask what is the power of the study? I dont know what the question means unless the study employs a single data analysis method and all the other factors are fixed. This comes back to a common theme: all concepts have context and should be talked about with their context. Power also has context.

In sample size calculation, we want to determine the sample size necessary to reach a certain level of power. As described above, the power is a function of multiple factors: dd = f(aa, bb, cc, ee). In sample size calculation, we want to find out, given the other factors, the sample size cc such that the resulting power achieves some desired level. Only in very simple situations can we derive a closed-form formula (as in EMS Table 35.1) for the sample size as a function of the other factors: cc = g(aa, bb, dd, ee). The formula may be exact or approximate. In more complicated situations, we have to use simulations to carry out the task.

There is another BIG assumption when talking about power or sample size: We often assume the sample is homogeneous or the admixture structure of the sample is fully specified and is taken into account in power calculation. In reality, this turns out to be a big assumption and is almost always wrong. As a result, sample size calculations almost always over-simplify the reality and lead to under-powered studies.

Percent changes are not comparable with each other (04/10/2006)

People routinely collect paired data, as in a study comparing results before and after a treatment. In some studies, the difference between a pair of observations depends on the magnitude of the observations. In this situation, we may calculate the percent change (B/A 1)×100 instead of the difference (B A). Some people may be tempted to calculate the average of the percent changes and call it the mean percent change. This is wrong, because the percent changes often are not comparable with each other. The percent changes often have different magnitudes of the denominators, and thus different precisions. [This is to some extent similar to the problems in maps of incidence rates.] Moreover, percent changes may still depend on the magnitude of the observations (e.g. less percent change for subjects with large baseline values).

All data analysis methods have assumptions. This is true even when you calculate a simple average, because by adding up numbers you are assuming the numbers are comparable.

Cross-validation is not a cure to over-fitting (04/10/2006)

A study may have many variables; examples include variables in a social study, genes in a microarray study, and markers in a genetic association study. We cant throw all variables into an analysis; it will make the analysis results fragile. In some studies, we also want to identify the real culprits behind the outcome variable such as a disease. What we can do is to consider many different combinations of the variables (a combination determines a model structure), fit a model for each combination, and compare across the models. The comparison often is carried out using cross-validation.

We can evaluate a model structure by estimating its prediction performance. In a ten-fold cross-validation, we randomly divide the dataset into 10 exclusive parts with equal size. For each part, we leave it out and fit the model using the other nine parts, and then use the fitted model to predict the outcome for the observations in the part we have left out. We then compare the predicted and the observed outcomes and summarize the prediction performance (e.g. mean squared errors for continuous outcomes and misclassification rate for binary outcomes). The model structure with the best performance will then be selected as the best possible model and will be used to fit all the data to obtain a final model fit.

Some people may have a perception that once cross-validation is built into a model selection procedure, over-fitting is no longer a problem. This is wrong, and here is why. Suppose a new outcome variable has been generated randomly, without connections with any of the other variables. For each model structure, we still can use cross-validation to evaluate its performance on predicting this newly generated outcome. The measure of the performance is a statistic and will have its own variation. Thus, by chance alone, some model structure will appear to have good prediction performance. If the set of variable combinations is large, some variable combination will appear to have great performance, again, by chance alone. If you claim it as your winner, you are over-fitting the data because the outcome had nothing to do with any other variables.

Two eyes from an animal are correlated (04/10/2006)

In ophthalmology research, people use animals to study their eyes' response to various treatments. Since an animal has two eyes, there often are two observations from an animal, one from each eye. Some people may think the two observations are independent because they are different organs in different locations of the body. But they are in the same animal, which has the same genetic makeup and same environmental factors before and after the treatment. For example, if an animal goes to see the sun/light a lot, then both eyes will have the same amount of light exposure; if an animal tends to concentrate on things and blinks less often than others, then both eyes will blink at the same frequency.

I was involved in a debate on this and one investigator even stated that sometimes he saw some tendency of correlation between the two observations from an animal, but he would do a statistical test to see if the correlation was significantly different from zero, and if not, he would treat the observations as if they were independent. This is dangerous. If you suspect there is correlation, take it into account in your analysis. The correlation may not be strong enough or the sample size may not be big enough to yield statistically significant correlation. But absence of evidence is not evidence of absence. If correlation exists but is not treated accordingly, the analysis often gives you inflated confidence in the results because it tends to yield a smaller variance estimate than reality, and leads to a higher false positive rate than what your expected to tolerate.

The same principle applies to other organs in the body. Correlated data should be treated differently from what we have covered in an introductory course. Seek help from statisticians if you have correlated data.

Odds ratios for continuous variables (04/09/2006)

When teaching the concept of odds ratio, we often use categorical variables to demonstrate its meanings. For example, the risk of infection in the rural area compared to that in the urban area. This may have led to a wrong perception that in order to estimate an odds ratio, we need to have categories for the input variables and thus need to categorize variables before carrying out analyses. This may have led to a widespread practice of categorizing variables in some fields, such as epidemiology. (Another drive for using categorized variables in epidemiology may be the historically extensive use of tables.)

We already know that categorizing a continuous variable distorts data and loses information. But the need for estimating odds ratios also is appealing. In fact, you can use continuous variables as input variables so that you retain all the information and still are able to obtain odds ratios. For example, suppose we fit a logistic regression with age as a continuous input variable (without interaction with other variables). The odds ratio for a 50 year-old compared to a 40 year-old can be obtained as exp[β_age * (50 40)] = exp(10β_age). In such an approach, what we no longer have are the traditional estimates such as odds ratio for age group 50-59 compared with 40-49. Well, this is not bad news; after all, when stating an effect comparing age groups 50-59 to 40-49, we already made an implicit but unrealistic assumption that the effect for a 59 year-old versus a 40 year-old is the same as that for a 51 year-old versus a 49 year-old. If you really need such a number to satisfy a person (maybe your PI) who demands it, you always can pick the middle values of the group ranges as representatives and calculate the corresponding estimates. For example, you can calculate the odds ratio for a 55 year-old versus a 45 year-old and use it as the estimate for the odds ratio for age groups 50-59 versus 40-49.

One drawback of treating variables as continuous is the linearity assumption, which may be too strong in some situations. One solution is to use restricted cubic splines, which is much more flexible than assuming just four or five flat levels for the variable.

Similarly, tables are useful tools to display data and results. But the need for using tables to display data and analysis results shouldnt dictate that our analysis has to be based on tabulated numbers (i.e. categorized variables) instead of the original, more informative observations.

Intercept (04/09/2006)

In a linear regression analysis, intercept often is not meaningful because it is the assumed outcome average when all the other variables are simultaneously zero. In some situations, the input variables could be zero and when they are zero, and outcome should be around zero. Then, you may be tempted to fit a regression while forcing the intercept to be zero. This approach only is sensible when the input variables include zero in their ranges. If zero is outside the ranges, the intercept estimate is strongly influenced by how far zero is from the range of the collected values of the input variables. If zero is far from the collected values, a small change in the slope estimate will lead to a big change in the intercept estimate. Thus, it is prudent not to give the intercept any interpretation but to view the intercept as a mechanistic parameter that allows you to have the best fit in the range of the collected data.

Results should always be interpreted with context (04/09/2006)

This is apparently true. But people forget it too often and interpret results based on their simplistic, face-value understanding.

Dont strip results off contexts and try to interpret results. Contexts include structural and distributional assumptions (e.g. linearity, logit function as connection, normality, exchangeability, etc.), the units of variables, the ranges of variables, transformations, other variables included in analysis, how the data were collected, etc.

Unit in interpretation (04/09/2006)

Result interpretation depends on the context. One aspect of context is the units of the variables. Suppose you use height as an input variable in a logistic regression and it is significant. If the height is coded in meters, the associated parameter may be something like 3, with corresponding odds ratio exp(3) = 20; if the height is coded in centimeters, the associated parameter will be 0.03, with corresponding odds ratio exp(0.03) = 1.03! Their corresponding p-values will be the same; after all, the data are the same. In fact, the two odds ratios are totally different things: the former is the odds ratio for the outcome for one meter increase of height, while the latter is the odds ratio for the outcome for one centimeter increase of height. If you are puzzled with a numerically very small odds ratio estimate but a significant p-value, the unit may be the reason.

This dependence on a variables unit is not limited to logistic regressions. It is everywhere.

You can be fooled if you are not careful. Suppose you carry out simulations to evaluate the performance of a method to detect genetic, environmental, and interaction effects, and you choose BMI and a genetic marker as input variables (BMI ranging from 20 to 35 and genotype coded as 0, 1, 2). To generate simulated data, you need to assign some effects for the input variables. You may define the coefficients of the two variables the same number, say 1.1, and think you have assigned equal effects to the variables. Well, yes and no. The effects corresponding to the units of the variables appear to be equal. But since BMI has a numerically much larger range than genotype does, you have simulated a much stronger effect for BMI than that for genotype, and your method probably will appear to be powerful to detect the BMI effect than the genetic effect.

A small number doesnt mean a small effect, and a large number doesnt mean a large effect.

A transformation may totally change the unit of a variable and may make result interpretation difficult. While it is true that all data analysis methods have assumptions, keep in mind that in an analysis using a transformed variable, we are making relevant assumptions in the transformed scale and you need to think if it is reasonable to do so.

If I hadnt carried out the other tests, I would have had the same result on this test. (04/09/2006)

This was what an investigator said when being pointed out that she was doing too many statistical tests. She had genotypes at markers of several candidate genes and had about a dozen phenotype variables. She set up her analysis program to carry out marker-phenotype association analysis, and then ran it through all combinations of the markers and phenotypes. The results of course varied for different combinations, and the smallest p-value was for a combination that she had suspected would most likely to show strong association. The p-value was nominally significant, but a correction for multiple comparisons would make it non-significant. For the sake of simplicity, let us call that combination combination A. She could have just looked at combination A and found the same p-value and were happy about it because of no need for correction. Then why should she be punished by looking at some other combinations simultaneously?

All wind down to what she intended to do faithfully before seeing the results. (a) She would focus on combination A as the primary goal and treat all the other combinations as secondary. Then, she essentially was treating A and the other combinations differently and wouldnt be equally happy if some other combinations but A appeared to be significant. In this situation, there was no need to correct for multiple comparisons. (b) She would want to see which among all the combinations would stand out to be significant, without singling out combination A even though she thought A would likely be among the combinations that would stand out. Then, she should correct for multiple comparisons. In this situation, she essentially treated any combinations equally and would be equally happy if some other combinations but A stood out to be significant.

If you still dont understand, here is an analogy. Suppose you have 50 coins. You may carefully examine them and pick a coin that looks problematic, then flip it 10 times and see 9 heads and 1 tail. The corresponding p-value is 0.021. You also may flip all the coins each 10 times and see 9 heads and 1 tail for the coin that was suspected problematic. But now the p-value will be different because the chance of seeing any coin with 9 heads and 1 tail or more extreme results is much bigger. Even if all coins were fair, the probability of seeing any one coin with 9 heads and 1 tail or 9 tails and 1 head or 10 heads or 10 tails would be 66%; that is, the p-value now is 0.66.

All these tell us that pre-specification, if possible, is extremely important. Once you prioritize your goals and act faithfully, you wont suffer the corrections for many relatively less important tests, and you wont feel guilty for searching for a result and not correcting for multiple comparisons. In any situations, including large-scale analyses, it is your last resort to do parallel data analysis with equal treatment of the variables or tests. Dont let the machines replace your minds.

Over-fitting (04/06/2006)

Over-fitting is a phenomenon in which your fitted model is too much tailored to the data at hand so that the fit is not an optimal solution to understanding the reality and to predicting future outcomes. For a continuous outcome, over-fitting leads to a smaller estimate of residual variation than reality (i.e. inflated accuracy); for a categorical outcome, over-fitting leads to a smaller misclassification rate than reality.

You may think a data set consists of both signals and noises, with unknown proportions. We try to extract the signals from the data. If a procedure extracts more "signals" than those embedded in the data, it is over-fitting. Although we don't know how much extracted signals are approximately real, there are principles to follow to guard against over-fitting.

Over-fitting is mainly due to searching. Suppose you have an outcome variable and 20 potential explanatory variables and you think two of these variables should explain the outcome. You may be tempted to look at all pairs of variables and see which pair gives you the best fit to the data. The resulting model, with the pair of variables selected to have the best fit, will probably over-fit the data. Similarly, step-wise variable selection almost always leads to over-fitting.

Over-parameterization also can cause over-fitting. If you have data from 100 subjects and you want to fit a model with 15 parameters, you are on the edge of over-fitting. This is because you wont have enough accuracy to estimate those parameters and the parameter estimates will tend to have large variance and tend to tailor to the 100 data points at hands.

In machine learning, there seems to be a common perception that once cross-validation is built into a model/variable selection process, over-fitting is no longer a concern. This is wrong. Cross-validation does help curb against over-fitting, but it cannot fully protect you from it.

Misperceptions of regression models (04/06/2006)

Suppose we have a binary outcome (disease/no disease) and consider an ordinal variable as an input variable (e.g. age groups, genotypes at a biallelic marker, etc.). We may carry out a logistic regression treating the ordinal variable as a continuous variable with equal spacing between adjacent categories. We also may carry out a logistic regression treating the ordinal variable as a categorical variable, ignoring the order information. One may think: All these are model-based analyses, and I will just treat the categories separately and estimate category-specific risks and use them as the basis of my inference (e.g. prediction, model selection, variable selection, etc.). Then I will have the advantages of being model-free, being free from all the linearity assumptions, and being non-parametric.

In fact, what he is doing is no different from the logistic regression treating the input variable as a categorical variable (also called a saturated model). Suppose the categories are labeled as 0, 1, 2, , and his category-specific risk estimates are p_i for category i, with corresponding log-odds γ_i = ln[p_i/(1-p_i)]. Then, in a saturated logistic regression, we will see that β₀ = γ₀, and when i > 0, β_i = γ_i γ₀. Once we know the risks, we can calculate the γs and then the βs; once we know the βs, we also can calculate the γs and the corresponding risks. In fact, all inferences he can do can be done using the saturated logistic regression framework. And in a saturated model, all the perceived bad images of linearity and model-dependence are essentially gone. The perception of doing something non-parametric also is wrong. He is still doing parametric analysis (the unknown risks are parameters), but it is disguised as if it is non-parametric because of no explicit parameterization.

Now, suppose there are two ordinal input variables (e.g. genotypes at two markers). One may follow the same logic and think it is great to treat all two-way combinations separately and estimate combination-specific risks and use them as the basis of his inference. In this situation, all the perceived advantages of being model-free, being free from all the linearity assumptions, and being non-parametric are again misperceptions. A saturated logistic regression, in which the input variables are treated as categorical and the interaction terms are full interactions (not product interaction), will capture all the information he is capturing and can do all he wants to do in his own approach. This fact also holds for multiple input variables. You can make up a test dataset and see this equivalence yourself.

The power of regression analysis is it is a unifying framework. It can be pushed to one extreme end to be saturated and to another extreme end to be parsimonious, which often is needed when we dont have enough data to achieve both complexity and robustness.

Significance of a category in a categorical variable (04/06/2006)

When we carry out a regression analysis with a categorical variable and some other variables as the input variables, we may find the z-test for a specific category is significant. We may then do another regression analysis including the interaction between the categorical variable and another variable, and see the z-test for the same category (main effect) is no longer significant. After this course, you should know that these two z-tests are in different contexts and are not comparable, and you also should know that once additional terms are included in the regression analysis, the effect of some original terms may appear to change if you just look at the z-tests. Now I want to explain why this happens in an intuitive way.

All go back to what is happening effectively. Suppose we only consider two variables as the input variables: a categorical variable h with four categories, {h1, , h4}, and another variable x (categorical or numeric). When we do a regression on these variables, the software will generate three indicator variables I₂ to I₄ for categories h2 to h4, and then fit a regression model with four variables (x plus the three newly generated ones): right hand side = β₀ + β₁x + β₂I₂ + β₃I₃ + β₄I₄. This is equivalent to the following:
For category h1, right hand side = β₀ + β₁x = γ₁ + β₁x
For category h2, right hand side = β₀ + β₂ + β₁x = γ₂ + β₁x
For category h3, right hand side = β₀ + β₃ + β₁x = γ₃ + β₁x
For category h4, right hand side = β₀ + β₄ + β₁x = γ₄ + β₁x

Thus, we are imposing a constraint that all four categories have the same slope for variable x (no interaction means the effect of x is the same for all the categories). Suppose we see the z-test for category h2 is significant. The z-test is to test if β₂ = 0 given all the other three variables have been taken into account. In other words, the z-test is to test that under the common-slope assumption, whether the intercept for category h2 is significantly different from that for category h1, the baseline that may have been chosen by the software and may not be of interest to you.

When we do a regression analysis including the interaction terms between the two input variables, the right hand side is β₀ + β₁x + β₂I₂ + β₃I₃ + β₄I₄ + β₅I₂x + β₆I₃x + β₇I₄x. This is equivalent to the following:
For category h1, right hand side = β₀ + β₁x = γ₁ + δ₁x
For category h2, right hand side = β₀ + β₂ + β₁x + β₆x = γ₂ + δ₂x
For category h3, right hand side = β₀ + β₃ + β₁x + β₇x = γ₃ + δ₃x
For category h4, right hand side = β₀ + β₄ + β₁x + β₈x = γ₄ + δ₄x

By including the interaction terms, we free up the constraint of common slope (interaction means the effect of x may be different for different categories). The z-test for category h2 may no longer be significant. The z-test is to test if β₂ = 0 given all the other six variables have been taken into account, including the interaction term between x and category h2! In other words, the z-test is to test that given the slopes are freed up, whether the intercept for category h2 is significantly different from that for category h1. The reason for the change of significance may be due to the fact that since the slopes are freed up, the intercept estimate for h2 is nearer to the intercept estimate for h1. And the original significant result may be more of a function of the common-slope assumption than a function of data.

Because the z-test for category h2 often is labeled as a "main effect", some people tend to interpret it as a test for the effect of h2 on the outcome, period. But in fact, the z-test only reflects the effect of h2 as measured by the intercept compared with that of h1. When interaction is considered, the effect of h2 exists not only in the intercept, but also in the corresponding slope.

When variable x is binary with the two categories coded as 0 and 1 (e.g. 0 is non-smoking and 1 is smoking), if interactions between h and x are considered, the model is saturated; that is, all combinations of categories of h and x effectively are considered separately. In this situation, the z-test for the "main effect" of category h2 is a test of difference between h2 and baseline category for category 0 of variable x (e.g. difference of h2 and baseline for non-smokers).

Note that we have seen this phenomenon in homework 3 problem 3, in which we did analyses on lung capacity using age and cadmium exposure as input variables. When cadmium exposure was considered alone, it was significant; when both cadmium exposure and age were considered but without interaction, cadmium exposure was no longer significant; when interaction was included in analysis, the "main effect" for cadmium exposure was significant again. In the first analysis, we only looked at the marginal effects of cadmium exposure; in other words, we forced the slopes with respect to age for both exposure groups to be zero. In the second analysis, we forced the slopes to be the same, but not necessarily zero. In the third analysis, we relaxed the constraint on the slopes and allowed them to be any values. You can see that assumptions on one part of the analysis (i.e. slope) may have big effects on another part of the analysis (i.e. intercept).

Maps of incidence rates (03/20/2006)

In a map of incidence rates, counties may be colored or shaded according to their incidence rates (e.g. five colors/shades for five quintiles). Or, 10% of the counties that have the highest incidence rates are filled with a color, while the other counties are not filled. Such maps tend to be misleading.

This is because counties have vastly different population sizes that would serve as denominators for the calculation of incidence rates. As a result, the precision of the estimates is different for different counties. Counties with fewer people tend to appear to have high or low incidence rate because of the higher variation in their incidence rate estimates. For example, in a U.S. mortality map for kidney cancer, among the 10% counties with the highest kidney cancer mortality, counties with small population sizes such as those in the mountains tend to be included, while some of their neighboring counties (also with small population sizes) were among the 10% counties with the lowest kidney cancer mortality. Large metropolitan areas and coastal areas rarely belong to either. This is because the variation in the ranking of incidence rates increases as the population sizes decreases. As an extreme example, suppose there was a county that had only one person; that county would end up with either the highest or the lowest mortality rate.

Such a phenomenon also happens when we want to compare schools or hospitals on their performance. For example, there were 118 ophthalmology residency training programs in the U.S. between 1999 and 2003. After the trainees graduated, they would take and pass exams to be certified. If we simply rank the programs based on their failure rates, small programs tend to be overly rewarded when they happen to have few or no failure or overly punished when they happen to have a few failures. After all, a program with 10 trainees would easily move up or down 10% in failure rate just for one trainees exam result.

Colored/shaded maps also can mislead people, as demonstrated in the book How to Lie with Maps written by Mark Monmonier.

How can we find a significant result? (03/19/2006)

This was what I heard when I met with an investigator the other day. The investigator had already collected data and he needed help with data analysis. During the conversation, he asked this question multiple times, because he needed to get a paper out and a necessary condition for getting a paper accepted was to have a p-value less than 0.05 (a.k.a. significant result). Unfortunately, these two reasons are so common in current research and they together contribute a lot to the abuse of statistics. To achieve the goal of getting a paper out, an investigator often has to twist the study goal and to comb through the data by considering many different variable combinations or many different ways of stratifying data, and not to report that the goal was not the initial, primary goal and the results were obtained through an extensive search.

A correct thought process is this: I have a short list of suspected effects that I want to see if my data (1) support them strongly (then I can write it up), or (2) show some consistent trends (then I am encouraged to follow it up), or (3) suggest no or opposite effects to the extent that there is little chance that the result is just due to having an unlucky sample (then I need to think why and may consider to give it up). In the latter situation, I may write up a report if many people have the same suspicion of the effects as I did.

Extrapolation and interpolation: Interpretation of parameters depends on the range of the variables (03/19/2006)

The range of a variable has a strong impact on our interpretation of the effect of the variable. For example, when we collect a sample of subjects whose age ranges from 20 to 30, all we can infer is the effect of age when age is narrowly considered between 20 and 30. Age effect may appear to be linear inside this interval, but non-linear in a wider range. Age effect may appear to be non-significant inside this interval, but strongly associated with the outcome in a wider range. When your data have variables with limited ranges, interpret the effects of the variables carefully.

The range of a variable should be considered when interpreting results. We have seen this when I cautioned on the interpretation of correlation coefficient in a simple linear regression.

When we want to infer the effect of a variable for a value outside the range of our data, we are extrapolating. Extrapolation often is dangerous because of lack of data support around the value at which we are making the inference. It heavily depends on how the fitted curve extends outside the data range, and thus it heavily depends on the model structure we chose to fit or ended up having.

When we want to infer the effect of a variable for a value inside the range of our data, we are interpolating. If we have data for subjects at around 10 and 20 years old but very few or no subjects at around 15 years old, an inference about the age effect at value 15 may be problematic, because it also depends on the model structure we chose to fit or ended up having. But in general, interpolation is less a problem than extrapolation is.

Adjusted for (03/19/2006)

We often see statements in a paper like the analysis has been adjusted for followed by a list of variables. This often means the list of variables were included in the analysis, in addition to the variable of interest (e.g. a gene). If the authors bother to provide details, you often find that only the linear terms were included. Thus, a more correct statement should be: the analysis has been adjusted for the linear effects of In this situation, a linear adjustment is a full adjustment only for binary variables such as gender.

Quadratic term and linear term (03/19/2006)

When we think a variables effect is not linear, we may choose to add a quadratic term into the analysis. It is possible that the quadratic term appears to be significant while the linear term does not. In this situation, you may be attempted to remove the linear term and only keep the quadratic term. Dont do that.

The reason is similar to the relationship between interaction effect and main effect. When you include an interaction effect but not the corresponding main effects in an analysis, you are imposing strong and often unrealistic structural assumptions, making it hard to interpret the results. Similarly, when you include a quadratic term but not the linear term, you are imposing a strong structural assumption.

Step-wise variable selection (03/19/2006)

In a study, we may have many variables as potential explanatory variables. Putting all potential variables into an analysis often will cause serious problems: A major problem is overfitting and a minor problem is collinearity. Thus, variable selection may be necessary. It always is good to use your knowledge external to the data at hand to select variables or at least to narrow/prioritize your selections. Only when you have exhausted this option and still have too many potential variables to choose from, should you try techniques that are based on your current data. Some popular techniques are penalized likelihood, cross-validation, and step-wise variable selection. Unfortunately, step-wise variable selection is the least reliable and often leads to a severely over-fitted model in practice.

One reason for such a failure is that we looked at so many results before reaching a final, best model. No mechanism is built in to account for such searching. Thus, the final model and its parameter estimates and standard error estimates tend to be too optimistic. To evaluate the variation of a step-wise variable selection result, you can generate multiple bootstrap data sets, apply the same selection protocol to each data set, and see how often the results across data sets agree.

Another reason is similar to Simpsons paradox. Variables may appear to be significant or non-significant depending on what other variables have been included. Thus, adding and dropping variables sometimes depends on the order of variables.

When you see a paper reporting a model that resulted from step-wise variable selection, be wary. Some papers report a model in which all variables are significant, this often is a sign that the model was derived through step-wise selection and was too good.

If you have to use step-wise variable selection, use your common sense. Think what variables should always be included, or what variables should be preferable to some others, or what variables should be logically included/excluded as long as some other variables are included/excluded.

Permutation tests cannot do away with multiple comparisons (03/19/2006)

Some people have a perception that permutation tests can be used to guard against or bypass the problem of multiple comparisons. This is wrong.

In a statistical test, we calculate a test statistic and compare it with a reference distribution to determine the corresponding p-value. The reference distribution often is a nice mathematical distribution, such as normal, t, chi-squared, etc. The validity of using these distributions is based on large sample theory (asymptotic theory), in which we can prove that assuming the data meet some assumptions, as the sample size increases, the distribution of the test statistic approaches one of these distributions. Hence, for an asymptotic distribution to be used as a good approximation, two issues need to be satisfied: the assumptions are not strongly violated and the sample size is not too small. If one of these is not met, the approximation may not be good, especially in the tails, where we care the most about. Permutation is an alternative way of generating a reference distribution to determine the p-value. It often has weaker assumptions and thus often is more reliable.

Permutation tests were invented to avoid relying on asymptotics. They cannot do away with the problems associated with multiple comparisons. As long as you carry out more than one permutation test, your results are subject to the issues of multiple comparisons.

A fundamental problem in multiple comparisons is we carry out multiple tests and report test results individually. We can view this whole process as a protocol and we need to evaluate the probability of this protocol leading to one (any one) small p-value by chance alone. In some simple situations, permutation test can be used to achieve this. For example, in a case-control study, we carry out multiple disease-marker association analyses, one for each marker. We can pick the smallest p-value and treat it as a statistic. Then, we can permute the affection status of the subjects and for each permutated data set, carry out the disease-marker association analyses and pick the smallest p-value. The smallest p-value from the real data can then be compared with the distribution of the smallest p-values generated through permutations, and a single p-value can be derived. Here, we essentially carried out a single test. We could carry out a permutation test because the situation was simple: All the original tests are testing for association between disease status and variables (markers in this example), in which the grouping variable (cases or controls) is the same for all tests. When you carry out multiple tests using more than one grouping variable, permutation cannot be used. For example, suppose you have multiple candidate genes and multiple outcome variables. If you carry out association analyses for all gene-outcome pairs, permutation cannot bail you out of the problems of multiple comparisons.

Interaction is dependent on context (03/17/2006)

This is an example to show the word "interaction" is context dependent. In the textbook, we analyzed the onchcerciasis data set, in which microfilarial infection status is a binary outcome and we considered area (2 levels) and age group (4 levels) as input variables. We may want to test for interaction between these two variables. In other words, we want to see if the effect of area is the same for different age groups. [See my Stata Notes for code.]

When we treated age group as a categorical variable, the interaction model with three extra parameters only increased the 2x(log-likelihood) by 5.27 over the non-interaction model, with corresponding p-value 0.153. When we treated age group as a continuous variable, the interaction model with only one extra parameter increased the 2x(log-likelihood) by 6.46 over the non-interaction model (a different non-interaction model), with corresponding p-value 0.011. We might then have different interpretations depending on which test we chose to carry out.

This is a good example to demonstrate the fact that results should always be interpreted with context. In fact, these two "interactions" are different interactions! If we don't know which interaction we are looking after, we may do both tests and correct for multiple comparisons, which may hurt the power. If we want to avoid multiple comparisons and choose only one test, the test with fewer degrees of freedom often is more powerful (but not always).

Regression (03/17/2006)

Regression is a powerful, unifying tool in statistics. There are several types of regressions: linear regression, logistic regression, Poisson regression, proportional hazard regression, etc. All regression types share a lot in common. They differ mainly because of the nature of the outcome variable.

In a regression analysis, we try to connect the outcome with a few other variables called input variables. Ideally, the outcome should be determined by the input variables in a deterministic way. In reality, it is almost impossible to find all the relevant variables, and we have to think of the outcome as a random observation, but with some unobserved parameter underlying it. What we want to do is to connect that unobserved parameter with the input variables in a deterministic way.

For a continuous outcome variable, the unobserved parameter is the average outcome for subjects with the same values of the input variables. We can equate this unobserved parameter with a linear combination of the input variables. This leads to a linear regression.

For a binary outcome variable, the unobserved parameter is the probability for subjects with the same values of the input variables. However, when we consider a linear combination of the input variables, the range of the potential values is unlimited, while the probability is bounded between 0 and 1. We cannot connect them by just equating them; if we did it, it would cause lots of problems in calculation and interpretation. We have to free up the probability by transforming it to a scale that also is unlimited. (Or equivalently, we have to transform the linear combination of the input variables to a scale that is bounded between 0 and 1.) There are many such transformations, among which the logit function f(p) = log[p/(1 − p)] is a popular choice because of its easy interpretation of the parameters in the linear combination of the input variables. This leads to a logistic regression. Note that by using the logit transformation, we impose a structural constraint: the rate of risk increase at p is as fast as the rate of risk decrease at 1 − p. This constraint may not be good in some situations, in which we need to use other transformations such as the complementary log-log function f(p) = log(−log(1 − p)).

For a count outcome variable, the unobserved parameter is the underlying rate for subjects with the same values of the input variables. Similarly as above, rate is bounded to be positive and thus we cannot just equate rate and a linear combination of the input variables. The logarithm transformation f(λ) = log(λ) is often used to transform rate and we can equate log(rate) with a linear combination of the input variables. This leads to a Poisson regression (sometimes called a log-linear model). Again, there is a structural constraint, but the constraint is mainly due to the assumption of Poisson distribution, for which the mean and variance are the same. In some situations, the count data have much higher variance than mean and we need to use other distributions such as negative binomial, resulting in a negative binomial regression.

All the transformations are called link functions. In linear regression, we appeared not to use any transformation; in fact, we used a special transformation f(x) = x. This identity function also can be viewed as a link function. A regression model with a link function is called a generalized linear model (GLM). All regressions above are special types of generalized linear models.

For survival outcome variables, the unobserved parameter is the underlying survival function for subjects with the same values of the input variables. Note that a survival function is not a single number; instead, it is a function over time t indicating the probability of "survival" (i.e. having not experienced the event of interest) by time t. A survival function corresponds to a hazard function, which is a function of time t indicating the rate of the event of interest at time t for a subject who has not experienced the event before time t (i.e. who has "survived" till time t). We may think there is a baseline hazard function, and for each subject, there will be a ratio of the subject's hazard function over the baseline hazard function. This ratio also is a function of time. In a Coxs proportional hazard regression, we equate this hazard ratio with a linear combination of the input variables. In this situation, some input variables may themselves be functions of time. When none of the input variables are functions of time, a subject's hazard ratio will be a constant and the hazard functions for different subjects will be proportional to each other.

What is Statistics? Statistics is not a collection of tools/methods/tricks (03/05/2006)

Some people have the wrong feeling that Statistics is just a collection of tools and methods and sometimes tricks. This perception may also be reinforced by some statistics teachers who only teach methods and tricks, and also by some statistics books that are essentially cookbooks. In fact, Statistics is much bigger than a box of tools. It is about discovering and understanding patterns and relationships through three major aspects: data collection, analysis, and interpretation. Among the three, interpretation is the most important, and data collection and analysis pave the road towards interpretation.

Data collection determines the kinds of information that will be available for analysis and that will serve as the basis for interpretation. In most situations, interpretation of analysis results depends more on how data are collected than on how data are analyzed. Because of this, data collection is more important than analysis. Data collection really determines the quality of a study and how much information the data will provide to answer the research questions. Data analysis just tries to extract such information and wont compensate for a poorly designed study no matter how sophisticated a method is.

A good design of data collection is heavily dependent on the subject matter and traditionally it is not covered in an introductory course. But its importance should be appreciated. (And statistics teachers need to put more emphasis on this aspect in their teaching.)

Although I rank data analysis as the least important aspect compared to data collection and interpretation, it obviously is indispensable. Data analysis is a necessary step to bridge data collection and interpretation. In order to have a valid and efficient analysis, we need to choose methods according to how the data are collected, the nature of data (e.g. variable types), and the goal of analysis. Because there are many types of data and many analysis goals, there are so many different statistical methods. The existence of many methods makes Statistics appear like a box of tools and sometimes a bag of tricks. But all they provide are efficient ways of analyzing data so that the interpretation will have a valid basis.

The choice of an analysis method has a big impact on interpretation. In addition to how data are collected, interpretation depends on the assumptions of an analysis method and sometimes on the analysis procedure itself. You dont want your analysis result to be more of a function of assumptions than a function of data. You also dont want the result to be more of a function of the analysis procedure, such as over-fitting.

Among the three aspects of Statistics, data analysis is the most technically challenging, and thus has historically attracted statisticians attentions. Unfortunately because of this, some academic statisticians have developed a narrow view of Statistics and they value technicality over all the other aspects of Statistics. Avoid such technophiles. But it doesnt mean technicality is unimportant and you can use a wrong method or an over-fitting procedure and dismiss critiques on the technical validity of your approach.

As I said earlier, interpretation is the most important aspect of Statistics. However, when you are ready to interpret your results, the results have already been influenced by what data were collected, how the data were collected, and how the data were analyzed. It often is underappreciated that the ways in which we process data, analyze data, and quantify effects (e.g. through p-values) can lead to many artifacts and caveats. In addition, many people interpret results on the basis of half-baked, face-value understanding of the analysis methods involved. As a result, misinterpretations, wrong interpretations, and wishful interpretations are so prevalent in current research that they are like an epidemic.

Because there are so many artifacts and caveats in data analysis and interpretations, I spend a significant portion of the course time on interpretation of results. I hope my course will help you develop correct views of Statistics and avoid becoming future abusers of Statistics.

PS. Historically, Statistics was limited by available computational capability. Current statistical toolbox does reflect such historical limitation.

PS2. Statistical thinking may appear to be different from your daily thinking. If you feel this way, it is because your daily thinking is not clear and logical enough. Learning statistics should make your daily thinking clearer and more logical.

About the midterm (03/04/2006)

Some of you found no time left to answer some questions that you knew how to answer. It certainly means you needed more than one hour to finish the exam. But it also means you didn't allocate your time well. A good strategy in a fixed-time exam is to spend a couple of minutes to flip through the pages and have a rough idea on what questions the exam has. Then when you are stuck on a question, you may choose to work on other questions that are easy to you and come back to the hard ones later on.

I allocated 50% of your grade to homework and active participation of discussion, and 50% to midterm and final exam performance. This allocation is a common practice that you should have been familiar with as a graduate student. If you didn't do well in the midterm, you still have chances. More than half of the grade (57% = 25% final + 20% active participation + 12% homework) still is open for grabs. In addition, if you do much better in the final exam than in the midterm, I will up your grade one level (e.g. B to B+, B+ to A-, etc.).

We can be fooled by our labeling an analysis (03/04/2006)

You have seen this example in the midterm. Blood pressure may change as people grow older. Suppose we have measured systolic blood pressures (SBP) on 400 adult men and 400 adult women, and have recorded their ages. We may analyze the relationship between SBP and age in men and women separately. We also may combine the two groups and carry out a single analysis on the relationship between SBP and age. Suppose we combine the groups and do a simple linear regression of SBP on age, and find the p-value for the regression fit is 0.003. Can we interpret this result as that combining the groups was not appropriate and thus the gender variable has a significant effect on SBP and should not have been ignored?

The answer is no. The small p-value in the simple linear regression of SBP on age tells that when age is considered alone (i.e. without adjusting for any other variables), its effect on SBP is significant. It has nothing to do with combining groups or not and the analysis doesnt have any component to reflect the effect of combining groups. But if we have labeled this analysis as combining groups, the results might be interpreted as if they offer information on the appropriateness of combining groups. We could have labeled the analysis as combining vegetarian and non-vegetarian groups or as combining Democrats and Republicans. Can we claim that the small p-value also suggests that diet or party affiliation has an effect on SBP? These factors may have effects on SBP, but the result from that simple linear regression has no logical connections with these factors.

We can be fooled by contexts (03/02/2006)

Contexts are useful for offering clues to solving problems. But they also can fool us, easily. Here are two examples that we talked about in the class:

Example 1: Suppose there are three cards. One is red on both sides; another is white on both sides; the third one has red on one side and white on the other side. Put then under a hat, pull out one card, and only look at one side. Suppose the color you see is red, what is the probability that the other side also is red? [Many people think the answer is 1/2, which is wrong.]

Example 2: Suppose prisoners A, B, C were on death row. The King decided to pardon one of them. He randomly chose a prisoner to pardon, told the warden of his choice, and asked the warden to keep secret. Prisoner C wanted to know if he would be freed. Without any information, his chance was 1/3. He knew the warden wouldnt tell him if he would be freed or not. So, he asked the warden who between A and B would be killed. The warden reasoned this way: Either A or B or both would be killed; I could just pick a person that would be killed and tell C, without releasing any information about Cs fate. Then he told C that A would be killed. Then C reasoned this way: Given A would be killed, either B or I would be freed, thus my chance of being freed increases to 1/2. Who do you think was correct? [It was the warden.]

Pilot studies (03/01/2006)

A pilot study often is used to collect preliminary information to evaluate the feasibility of a larger study and to guide the design of a larger study. Hence, it is important to do it right so that the effect estimate and the variance estimate are not biased. A biased estimate of an effect may give us an overly optimistic result on which we will be too excited and eventually wasting money to follow it up. An underestimate of variance may lead to an under-powered study that won't give us a conclusive result after years of work. In either situation, the researchers may feel the pressures of generating papers/results and of securing the next grant renewal. They may choose to dredge their data to mine for signals, generating and publishing non-replicable results. Careful designs of pilot studies will help reduce the chance of facing such an essentially unethical dilemma in your future research.

One colleague of mine once said this: "For what it is worth, most small studies really really really suck. They touted as 'pilot studies' which is really secret code for 'we just wanted to run an experiment and didn't want to take the time to do it right'. As such, they end up at best only giving very sloppy estimates of variance that aren't even applicable because a real study can't be run the same way. At worst, they provide no information or, even worse, their sloppy conduct and poor planning lead to biased estimates that point people the wrong way. Sometimes it takes years to get the ship turned around. Small, poorly-designed, 'pilot' studies are foolhardy, haphazard, ignorant, arrogant, wasteful, possibly unethical, and sometimes even dangerous."

Researchers often are not aware of the serious long-term impacts a poor study design will lead to. In my opinion, the article The scandal of poor medical research should be read by all researchers and if necessary, refreshed every year.

Degrees of freedom (02/26/2006)

This concept seems to be intuitive in some situations but not in some other situations. Let me try to explain it, first for t-tests, second for tests on categorical data, and finally on tests comparing two models.

In a one-sample t-test with sample size n, we compare the test statistic with a t-distribution with n 1 degrees of freedom (DF). The DF in a t-test is based on the DF of its denominator, the standard error. The standard error is a function of the variance estimate, which in turn is a function of the n data points. However, the variance estimate is not based on n independent pieces of information; instead, it is based on n 1 pieces of information. Suppose we have ordered the n values. As long as the differences between adjacent values are fixed, the variance estimate will be the same; and there are n 1 such differences. When estimating the variance, the central location of the data is irrelevant, although the sample mean appears in the formula. Sometimes, people say one DF is spent to obtain the estimate of the central location and thus only n 1 DF are left for the test.

Similarly, in a two-sample t-test on two groups of m and n observations, the denominator is a function of m 1 + n 1 = m + n 2 pieces of independent information. In a paired t-test on n pairs of observations, we focus on the n within-pair differences, and the test essentially is a one-sample t-test; thus we have n 1 DF. In an F-test, the two numbers can be similarly figured out for the numerator and the denominator separately.

For a categorical variable with k categories, if we carry out a goodness-of-fit chi-squared test to compare data with a pre-specified set of probabilities on the categories, the resulting Pearsons statistic will be compared with a chi-squared distribution with k 1 DF. This is because we are essentially comparing the relative frequencies among the categories with the ratios of the corresponding pre-specified probabilities. And there are only k 1 independent relative frequencies. Similarly as for the continuous data, people sometimes say one DF is spent to obtain the total count and thus only k 1 DF are left for the test.

Sometimes the probabilities are specified indirectly through constraints under the null. Examples include testing for independence in a two-way table and testing for Hardy-Weinberg equilibrium of a genetic marker. For a two-way table with a rows and b columns, there are n = ab categories with probabilities {p_ij}. As explained above, one DF needs to be spent to obtain the total count. However, when calculating the expected cell counts using the constraints p_ij = p_i+⋅p_+j, where p_i+ and p_+j are marginal probabilities, we need to know the marginal counts. For the a rows, having known the total count, a 1 additional marginal counts need to be calculated; similarly, for the b columns, b 1 additional marginal counts need to be calculated. Then there are n 1 (a 1) (b 1) = n a b + 1 = (a 1)(b 1) DF left for the test.

When testing for HWE of a bi-allelic marker, there are three genotypes. Again, one DF needs to be spent to obtain the total count. Under the null, the genotype probabilities are functions of the allele frequencies, which need to be calculated in addition to the total count. In fact, only one allele frequency needs to be calculated; the other can be derived. Thus, there are 3 1 1 = 1 DF left. For a marker with k alleles, there are k(k+1)/2 genotypes and in addition to the total count, k 1 DF need to be spent on estimating allele frequencies, and there are k(k 1)/2 DF left for the test.

In a likelihood ratio test, we often compare the test statistic with a chi-squared distribution with a certain DF. In this situation, we essentially are comparing two models, one full model and one reduced model, to see if the full model provides significantly better fit to the data. The full model always has one or a few more parameters (or variables) than the reduced model, and the likelihood ratio test can be viewed as a test for significant contribution of these additional parameters given the parameters that have already been included in the reduced model. The number of these additional parameters often is the DF of the test.

Some other tests also follow this thought process. For example, in the test for departure from linear trend on a 2xk table, we calculate the difference between the Pearson's chi-squared statistic and the trend test statistic to see if there is significant departure from linear trend. The Pearson's test statistic is essentially divided into two portions, one being the trend test statistic. The former test has k 1 DF while the latter has 1 DF. Thus, the difference has DF = k 2. Another example is in ANOVA, in which we also think of dividing the total sum of squares into a few portions.

Permutation tests and permutation CIs (02/24/2006)

Permutation tests can be used on a 2x2 table in a case-control study. In an ideal permutation test, we generate all permuted data sets by permuting the affection status of the subjects but keeping the exposure variable intact, and calculate a statistic of interest for each permutated data set. The resulting set of statistics can then serve as a reference distribution for calculating the p-value of the real data. The statistics essentially serve as the basis to rank the permutations from least extreme to most extreme, and the p-value of the real data is just the fraction of permutations that are as extreme as or more extreme than the real data. If we choose a different statistic, it may result in a different ranking and thus a different permutation test, even though the permutations are the same.

All permutated data sets have the same numbers of cases and controls as the real data have; in other words, all marginal counts are fixed. Thus, the permutations are the possibilities in the hyper-geometric distribution given the fixed marginal counts. If the statistic of interest and the probability in the hyper-geometric distribution lead to the same ranking of the permutations, then the permutation test is effectively the same as Fishers exact test.

In reality, we often cannot enumerate all possible permutations of a data set. Instead, we sample random permutations and build a reference distribution accordingly. Then, the p-value of the statistic on the real data can be estimated using this reference distribution. In this situation, because the p-value is estimated, it is not exact. Only when we can enumerate all permutations can we calculate the exact p-value. Even in this latter situation, the exactness relies on the assumption that the permutation space is the sample space, which often is not correct. [In general, estimation through random sampling is called Monte Carlo.]

Permutation test can be used in situations other than a 2x2 table, in which there are two groups of subjects (cases and controls) and the outcome is binary. For two groups of subjects with categorical or continuous outcome, we can generate permutations by permuting the grouping status while keeping the outcome intact. In principle, this can be extended to data from three or more groups; we can always permute the grouping status. Another extension is in testing for correlation between two continuous variables; we can permute the values of one variable while keeping the values of the other variable intact. In non-parametric statistics, there are many rank-based tests, in which we compare the ranking of current data with all possible ways of ranking; they are essentially permutation tests in the rank scale.

If we replace the statistic of interest by the parameter of interest, we can generate a permutation distribution of the parameter of interest, and construct a 95% confidence interval by using the 2.5% and 97.5% percentiles of this distribution.

Exactness is mostly an illusion and Fishers exact test (02/24/2006)

You may have heard of Fishers exact test. You also may have heard claims that a procedure leads to exact results or exact p-values. Dont buy them. Exactness is great. However, in reality, exactness mostly is an illusion and rarely achievable. Most procedures achieve exactness through assumptions, idealizations, oversimplifications, or narrowing the sample space by post hoc conditioning (e.g. Fisher's exact test, conditional logistic regression, etc.).

Fishers exact test is a famous test. Fisher introduced this test using data from a tea tasting experiment. A lady claimed to be able to tell if tea was poured into milk or milk was poured into tea. Fisher designed an experiment in which there were eight cups, four with tea poured in first and the other four with milk poured in first. He also told the lady there were four cups each. Note that because the lady knew this fact, her answers would be constrained: she would claim four cups as one group and the rest as the other group. If we put her answers into a 2x2 table (with the real grouping and her grouping as row and column variables), her potential answers would have fixed marginal counts on both the rows and the columns. The distribution of data with such a constraint is called a hyper-geometric distribution. In this context, Fisher developed his exact test. Unfortunately, only in this context is the test exact.

In reality, we rarely see situations with all marginal counts fixed. In the tea-tasting experiment, if the lady didnt know there were four cups each, she could have claimed five cups as one group and the rest as the other group.

For a 2x2 table, we often want to test for association between the row and column variables. A 2x2 table of two variables may result from population sampling without controlling for marginal distributions of the variables. Then the data essentially came from a multinomial distribution with four categories. A 2x2 table also can result from a cohort study or a case-control study. In a cohort study, we may prospectively follow up a certain number of subjects in the exposure group and a certain number of subjects in the non-exposure group. In a case-control study, we may choose a certain number of cases and a certain number of controls and gather exposure data retrospectively. In either situation, the data essentially came from two binomial distributions. In all three scenarios, asymptotic theory has shown that when the sample size is large, under no association, Pearsons chi-squared test statistic will be approximated by the chi-squared distribution with one degree of freedom. For all three scenarios, Fishers exact test is only an approximation.

These two tests may give you different results, but because they are correlated, their results sometimes are similar enough to lead to the same conclusion or decision. In general, Fishers exact test wont inflate false positive rate but sometimes is quite conservative; Pearsons chi-squared test sometimes is less conservative than Fishers exact test and thus is more accurate but sometimes may inflate false positive rate. A commonly used criteria is that when the overall total is n > 40, or 20 ≤ n ≤ 40 and the smallest expected cell count is ≥5, we use Pearsons chi-squared test; when the overall total is n < 20, or 20 ≤ n ≤ 40 and the smallest expected cell count is <5, the approximation in Pearsons chi-squared test may not be good and Fishers exact test is a good alternative.

The probability world is so different from the logic world (02/21/2006)

We know that if A implies B, then not B implies not A. Similarly, if A implies not B, then B implies not A. For example, a man cannot be pregnant, and if a person is pregnant, it must not be a man. Can this be extended to the following: If Pr(not B | A) is extremely high, then Pr(not A | B) also is high?

If you said yes, lets try this. A human being randomly picked from the world is almost surely not the president of the United States. George W. Bush is the president of the United States, then he is almost surely not a human being.

In fact, only when Pr(not B | A) = 1, can we say Pr(not A | B) = 1. When Pr(not B | A) is not 1, not matter how close it is to 1, Pr(not A | B) can be any number. For example, suppose we have two fair coins and we toss them 20 times separately. Let B = {coin 2 had same faces in 20 tosses} and A be a certain pattern about coin 1 result. Then due to independence between the two coins, Pr(not B | A) = Pr(not B) is close to 1, but Pr(not A | B) = Pr(not A) can be anywhere between 0 and 1 depending on how we define A.

P-value is not the probability of the null being true (02/21/2006)

A small p-value almost always goes with a strong belief that the null is wrong; the smaller a p-value is, the stronger the belief. But p-value is not designed to quantify that belief. In other words, p-value ≠ Pr(H₀). If a forensic test says the probability that the blood from a crime scene matches with your blood and the chance of a random match is 0.001, do you want to hear the attorney/jury/expert interpreting this as the probability that you didnt commit the crime is 0.001, or the probability that you did it is 99.9%? Here, the argument is similar to Pr(A | B) ≠ Pr(B | A).

Pr(German | accident) ≠ Pr(accident | German) (02/21/2006)

In the class we showed Pr(A | B) ≠ Pr(B | A). But in reality, people often are confused with this. For example, in a Swiss skiing resort, most foreign skiers involved in accidents came from Germany. A newspaper then called for this: Beware of German tourists. The accident data suggest that given a foreign skier was involved in an accident, the probability that he/she was a German was high. But this may be due to the fact that most foreign skiers in the resort were Germans. Unfortunately, the newspaper interpreted the data as this: given a skier was a German, the probability that he/she would be involved in an accident was high.

Many newspaper reports follow similar traps. Examples are: Boys more at risk on bicycles, Soccer most dangerous sport, etc. Read news reports with a grain of salt.

Statistics and probability (02/20/2006)

They are related, but different fields. Probability is part of mathematics and its major thought process is induction: Given some conditions, what properties can be proven to hold for any entity that satisfies the conditions? Statistics is quite different and its major thought process is deduction: Given often limited amount of data, what can be inferred about the underlying population the data came from? Then, why do we need probability?

To do statistical inference, we often construct a model for the data with some unknown parameters, and treat all possible sets of parameter values as candidates. For each set of parameter values, we can calculate the probability of observing our data. (You see, statistics relies on probability.) Such a probability changes as the set of parameter values changes; thus, it can be viewed as a function of the parameters and is called a likelihood function. We often look for the set of parameters that maximizes the likelihood of observing our data and a range of parameter sets that lead to likelihood not too small compared to the maximum likelihood. The former is called a point estimate and the latter is called a confidence region (or confidence interval if we have a single parameter).

Interaction is not product (02/13/2006)

We often see interaction terms as products of existing variables. However, this is not always the case. When you investigate the interaction between two continuous variables, there are many possible types of interaction effects. It often is impossible to study all these effects due to limited sample size. Thus, people often use the product term to reflect interaction. However, this is just one type of interaction. For example, in a model with y = β₀ + β₁x₁ + β₂x₂ + β₃x₁x₂ , we effectively impose the following structural constraints: When x₂ = a, the effect of x₁ on y is linear with y = (β₀ + β₂a) + (β₁ + β₃a)x₁; when x₂ = b, the effect of x₁ on y is linear with y = (β₀ + β₂b) + (β₁ + β₃b)x₁. In other words, by using such a model, we assume that the effect of x₁ on y is always linear with intercept and slope varying as functions of x₂, and that both the intercept and the slope depend on x₂ linearly. Now you see this is a very special type of interaction. If the real interaction is quite different from this, we may not have power to detect it because we didnt capture any other types of interaction. In practice, it often is okay to use product terms to reflect interaction because we often dont have enough sample size to study all possible interactions and have to make compromises by modeling the most likely patterns of interaction and this linear type of interaction is simplest to study.

For categorical variables with only a few categories, we may have power to study all possible interactions. For example, for two variables each with 3 levels, the main effects need 4 parameters (2 for each variable), and the full interaction requires additional 4 parameters. When the categorical variables are ordinal variables (e.g. genotypes), we may have coded them using numbers such as 0, 1, and 2. Unfortunately, careless investigators may blindly carry out an analysis on interaction by including the products of the variables. For genotypes at two bi-allelic markers with coding 2/1/0 for AA/Aa/aa and BB/Bb/bb, using the product term is equivalent to assuming 4 levels of interaction effect: {AA/BB}, {AA/Bb, Aa/BB}, {Aa/Bb}, and {aa/__, __/bb}, with the effect of AA/BB twice as much as AA/Bb or Aa/BB and 4 times as much as Aa/Bb and the effect of the other 5 genotypes being zero. Of course this is not a full interaction and the resulting analysis wont have power to detect interaction patterns that far differ from this. It is unfortunate that this careless analysis has lead to criticisms on regression analysis for being not powerful enough to detect interactions.

PS. The word "interaction" means statistical interaction, which may or may not translate to physical interaction.

Interaction effect and main effect (02/13/2006)

When you suspect the effect of one variable on the outcome depends on the value of another variable, you can capture such effect using interaction term(s). For some software packages, you have to create new variables for the interaction terms and then include these new variables in the analysis. For example, if you think the effect of x1 on y depends on the value of x2, you may define a new variable x3 = x1×x2 as the interaction term. When you put x1, x2, and x3 in the analysis, x1 and x2 are called the main effects, and x3 is called the interaction effect.

In this situation, the software often isnt smart enough to remember that the interaction terms are functions of existing variables; it will treat the new variables as if they are separate variables. Consequently, if you include x3 and ignore x1 (or x2, or both), the software will generate results. However, when you include an interaction effect but not the corresponding main effects in the analysis, you are imposing strong and often unrealistic structural assumptions, making it hard to interpret the results (see next paragraph). Thus, even if you think there is no main effect of x1 but interaction effect exists between x1 and x2, you should include x1 in the analysis. This leads to a general guideline: As long as you put the interaction term in the analysis, you should also include the corresponding main effects.

Now I show why removing the main effect term(s) while keeping the interaction effect term often will lead to unrealistic structural assumptions. Suppose you fit a model with the right hand side being β₀ + β₂x₂ + β₃x₁x₂. Suppose x₁ is binary with values 0 and 1. Then this model is equivalent to the following:
when x₁ = 0, right hand side = β₀ + β₂x₂,
when x₁ = 1, right hand side = β₀ + (β₂ + β₃)x₂.
Thus, effectively, you assumed the two groups (x₁ = 0 and x₁ = 1) have the same intercept but possibly different slopes. This often is unrealistic unless it is exactly what you want to model. If x₂ is binary with values 0 and 1, then the model is equivalent to the following:
when x₂ = 0, right hand side = β₀,
when x₂ = 1, right hand side = (β₀ + β₂) + β₃x₁.
Here, you effectively assumed x₁ doesn't have any effect on the outcome when x₂ = 0. Again, this probably is unrealistic unless it is exactly what you want to model. If you remove both the main effects and fit a model with the right hand side being β₀ + β₃x₁x₂, then you effectively assumed the effects of x₁ and x₂ are present only when both are non-zero. If both variables are binary, say x₁ = 1 indicates males and x₂ = 1 indicates smoking, then this last model will have assumed the effects are the same for female smokers and female non-smokers and male non-smokers, but may be different for male smokers. Again, this probably is unrealistic unless it is what you want to model.

The requirement of including main effect terms whenever related interaction terms are included is purely technical and is for the purpose of correct interpretation of results. It has nothing to do with whether or not there is a main effect in the data. The validity and power of the analysis doesnt depend on the existence of main effects. Unfortunately, some people misunderstand this requirement and wrongly criticize regression methods by saying they require main effects to exist and thus they are not suitable when there are no main effects in the data or in the reality. Such criticisms are wrong.

When you fit a regression with an interaction term, it often happens that one main effect is no longer significant. It doesnt mean the main effect doesnt exist or you can remove the main effect term from the model. It just means that the main effect column doesnt provide significantly more explaining power of the outcome after the other main effect columns and the interaction effect columns have been taken into account. (Remember that interpretation of t-test results always is conditional.) The variable still explains the outcome, maybe mainly through the interaction effect columns. If you want to have correct interpretation of results, you need to keep it in the model.

Ecological fallacy (02/10/2006)

Ecological fallacy (also here) is often seen in social research and sometimes in epidemiological research. The problem behind ecological fallacy is using group-level data (also called aggregate data) to infer individual-level relationship. Suppose we want to do inference on the relationship between x and y at individual level, but we only have access to averages of x and y at group level. However, the relationship between group-level averages may be quite different from that between individual-level measurements. The group-level relationship may appear stronger than, weaker than, or opposite to the individual-level relationship. Thus, using group-level relationship to infer individual-level relationship may lead to a fallacy.

Watch out on p-values (02/10/2006)

Ideally, a p-value is the probability of observing data as extreme as or more extreme than your current data when the null hypothesis is true. However, in reality, p-value calculation depends on many factors: assumptions, approximations, and estimations. If a test relies on a well defined distribution to calculate p-value, its validity often rests upon assumptions of data distribution or large-sample approximation, or both. Unfortunately, assumptions may be shaky sometimes, and your sample size may not be big enough for the test statistic to be well approximated by its large-sample distribution. To avoid these problems, some p-value calculations rely on computational techniques such as permutations or simulations. However, often we cant enumerate all possible permutations and have to rely on a random sample of permutations to estimate the p-value. The precision of the estimate is good when the true p-value is not very small but will be poor when the true p-value is very small.

In some situations, a naïve calculation of p-value may be way off. In genetics, people routinely test for Hardy-Weinberg equilibrium (HWE). The traditional test statistic often is well approximated by the chi-squared distribution with 1 degree of freedom. I once derived a new test for HWE, with test statistic formula similar to that of the traditional one. I was tempted to use the same chi-squared distribution to calculate p-value. However, the new statistic had a fairly wider variation than the traditional one. As a result, the true probability of observing the data was much higher than what the chi-squared distribution would tell me. In other words, the p-value calculated based on the chi-squared distribution was much smaller than the true p-value. For example, a data set had p-value 0.02 when calculated based on the chi-squared distribution, while its true p-value was >0.1 when the variance inflation was taken into account. I am sure these two numbers will give you quite different levels of evidence against the null hypothesis.

In many situations, the p-value is designed to reflect the probability of observing data as extreme as or more extreme than your current data GIVEN both the model structure and the variables under consideration are pre-specified. If you use a screening method to search for the best list of variables or the best model structure, the p-value often doesnt reflect the additional variation brought in by the screening process. In other words, the true distribution of the test statistic has a wider variation than what the p-value can reflect, and the p-value may give you a false sense of strong evidence against the null hypothesis.

P-value is supposed to give you a sense of the level of departure of your data from the null hypothesis, and you put beliefs onto the null or alternative hypotheses on the basis of the p-value. However, if the p-value you get is a lot different from the true p-value, you will have a false sense about the level of evidence in your data. Watch out.

Study design is more than sample size calculation (02/09/2006)

A good study always starts with a good study design. A good design determines the quality of the study and how much information the data will provide to answer the researchers questions. Through data analysis, we try to extract information that is relevant to answering the questions. But no data analysis, no matter how sophisticated and fancy the analysis techniques are, can compensate for a bad design.

Study design has a few components: (i) subject ascertainment, (ii) variable collection, (iii) allocation of resources, and (iv) logistics and data management. We will address these issues in the following paragraphs.

For studies on human subjects, we need to determine what demographic composition is the best to answer our questions and to determine the criteria for inclusion and exclusion of subjects. If nested ascertainment is needed, we need to select the best nested sampling procedure. In addition, we carry out sample size calculation to determine the number of subjects needed to have enough power to detect the effect that we intend to demonstrate. Moreover, we want to ensure that the subjects are representative of the population that we want to study, that potential confounding factors are taken into consideration through individual or distributional matching across different groups, that potential selection biases are avoided, and that randomization is carried out if subjects will be assigned to different groups. Without steps to address these issues, misinterpretation of analysis results may occur.

We also need to determine a list of variables to collect. Make sure the list includes variables that potentially influence the outcome and factors that potentially confound with the other variables. This will allow us to adjust for the effects due to these variables and to carry out stratified analysis if necessary.

All studies are constrained by funding and resources. Hence, optimal allocation of available resources is extremely important. There is a tendency for investigators to allocate resources to cover many multi-factor combinations, resulting in small numbers of subjects in each combination. This often dilutes the resources and makes it difficult to have any conclusive results when the data are analyzed. We want to have optimal allocation of resources so that we are guaranteed to have enough power to conclusively answer our questions.

Data management also is important to the success of a study. Data analysis and interpretation will depend upon the quality of data. However, data may have been recorded wrongly when recorded on paper and entered into computers, and may have been recorded in different formats or units. Data may also have been merged inappropriately when patient records are extracted from multiple databases. We need to implement procedures to minimize the chances of these happening and to check for potential errors. Data checking involves checking for logical inconsistencies and variable ranges, and detecting unexpected patterns and missing data patterns through exploratory data analysis. In addition, work with IT folks to determine database management and access policies to ensure data quality.

Study design is much more than sample size calculation. Involving statisticians in the design stage will pay off tremendously; if it is impossible, consult them. You will be glad that you did it. Sir Ronald Fisher once said, To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of.

Measurement error and its effect (02/08/2006)

If a variable has a fixed value for a subject but your observed value is different from the true value, then there is a measurement error. Often, a variable doesn't have a fixed, single value for a subject (e.g. blood pressure). When we analyze such a variable, we intend to analyze the effect of its average value. For example, a person's systolic blood pressure is around 110, but at the time you measure his SBP, the reading was 120. If you knew 110 was his average SBP, you would use 110 in your analysis; however, 120 was the only reading and would be used in the analysis. This also is measurement error.

Measurement error happens everywhere, and we need to know its impact on analysis results. One impact is that it gives you less accurate measurements and thus lowers the signal/noise ratio in your data, leading to loss of power. Another impact is attenuation: If there are measurement errors in an explanatory variable, the parameter estimate for the variable tends to attenuate (e.g. to bias towards zero or no effect), leading to under-estimation of the effect and loss of power.

Should we do repeated measurements on a subject to bring down the measurement error? It depends. In general, spending resources on one measurement per subject is more efficient than multiple measurements per subject. However, if the magnitude of the measurement error is big compared to the variable range in your data set, the impact may be big to warrant special consideration to bring down measurement errors. This is because the magnitude of attenuation depends on the ratio of within-subject variation (a.k.a. measurement error) and between-subject variation. You need to have a good idea about how big within-subject variation is.

Summary is not equal to analysis (02/07/2006)

I went to a seminar this morning. The speaker had access to a large public health dataset with many variables. However, the whole seminar was spent on displaying summaries of the data. Most summaries were plots of incidence rate as a function of time for different ethnic and gender groups. At the end, the speaker said all this was her analysis. Well, it is analysis, but at a simple, elementary level, summarizing variables one at a time. I dont advocate complicated analysis if it is not needed. But often there are connections and interplay among the variables, and we want to find out these connections so that we can understand and hopefully improve our system. We spend lots of money collecting data not just for historians or accountants, but also for our own understanding and discovery. Unfortunately, many studies follow this track: Collect data and summarize them (mostly single variable summaries), but stop short of studying them.

PS. Not surprisingly, many studies follow the opposite track: Analyze the data to the death to find out any "signals" and "patterns".

PS2. Graphs often are easier to understand than tables. The speaker was good in this aspect.

Interpretation of correlation coefficient in simple linear regressions (02/06/2006)

If you have a random sample and have measured two variables, x and y, on each subject in the sample, the correlation coefficient between x and y can be estimated to be r = ∑[(x_i − xbar)(y_i − ybar)]/√[∑(x_i − xbar)²⋅∑(y_i − ybar)²]. It measures the level of relatedness and the direction of the relationship between x and y.

This formula can also be used in simple linear regressions to obtain the R² and to obtain the slope estimate if x and y have been transformed to their SD unit scales. However, it only can be interpreted as an estimate of correlation coefficient when both x and y are randomly sampled as described in last paragraph. In many studies, the distribution of x is determined by the study design to cover a certain range uniformly or to over-sample some ranges of x that are under-represented. As a result, the distribution of x in the data set is different from that of x in the population. In this situation, the "correlation coefficient" cannot be used as an estimate of the population correlation coefficient between x and y. It can over- or under-estimate the population correlation coefficient depending on the design. For example, suppose in the population, x is normally distributed with mean 100 and SD 10. In a design with x evenly represented in the range [90, 110], the correlation coefficient calculated based on the data tends to under-estimate the true correlation coefficient. In a design with x evenly represented in the range [60, 140], the correlation coefficient tends to over-estimate the true correlation coefficient.

Interpretation of results really depends on how the data were collected.

Interpretation of t-test results in all regressions (02/06/2006)

In any regression output, for each variable, you will see a t-test statistic and its associated p-value. Such a t-test tests for significant contribution of the variable to the explanation of the outcome IN ADDITION TO all the other variables used in the regression. For example, suppose you carry out a regression analysis of y on x1, x2, and x3. A non-significant p-value for x3 means that given x1 and x2 are already taken into account, having x3 doesn't provide significantly better fit to the data than not having x3. If we exclude x1 (or x2), then the t-test for x3 may become significant. This is probably due to some correlation between x1 and x3; when x1 is present, it "explains away" some of the explaining power x3 may have on y, making x3 appear to be less important. In another scenario, adding another variable, say x4, may make x3 become significant. This is probably due to the fact that the effects of x3 are different on subjects with different x4 values; without x4, the effects of x3 are "averaged out" to appear to be less important. Remember Simpson's paradox?

Therefore, t-tests in any regression analysis should always be interpreted as conditional tests, and should not be used to judge the absolute usefulness of a variable to explain the outcome except when the variable is independent of all the other variables.

Interpretation of results really depends on what you included in the analyses.

An example of using transformation (02/03/2006)

If you have the small Stata book that came with your distribution CD, read pages 26-29. This is a nice story about the usefulness of variable transformation. I will briefly recast the story. Data were available for 1978 cars on weight, mpg, and foreign/domestic status. You may think mpg as the outcome and model mpg as a function of the other two variables. In the book, a regression model with weight, weight squared, and foreign status is fitted. The foreign status clearly has significant effect on mpg (based on its associated t-test result). The quadratic term seems to capture the relationship between mpg and weight to some extent (its associated t-test is significant); however, it seems to capture the relationship better for domestic cars than for foreign cars. To get a better fit, we may need different parameters for weight (and weight squared) between domestic and foreign groups, effectively leading to additional interaction terms in the model, making the model more complicated. A little thinking leads to using gallon per mile (gpm = 1/mpg) as the outcome and the relationship between gpm and weight is mainly linear, simplifying the relationship and allowing the results to make more sense: The amount of gas needed to move a mile is proportional to the weight of a car. The slopes look to be different between foreign and domestic cars; thus, an interaction term may still be needed to capture more information. The finding that foreign cars in 1978 were less efficient can also be seen in the original model, in which the parameter associated with foreign is negative.

One-sided vs. two-sided tests (02/01/2006)

One-sided tests can be used only when you are interested in one direction of the effect and won't care to investigate if the data show a difference towards the other direction. Of course, this is a very rare scenario. In most situations, even when you had a strong belief that the effect should go to one direction, if you see a significant effect towards the other direction, you will want to investigate and/or report the findings. Then a two-sided test should be used. In other words, no matter how strong your belief on the direction of the effect, as long as you give yourself an opportunity to investigate/report when the effect turns out to be in the opposite direction, you should use a two-sided test.

Hypothesis testing (02/01/2006)

Hypothesis testing is a powerful tool to rationalize a problem and to help casting a problem into a mathematical framework that can be solved. It is useful for testing for an effect or validating a discovery. But it is not suitable for exploratory work. Unfortunately, many people whose work is mostly exploratory try to put their work into a hypothesis testing framework, using the same data to drive the discovery and to test the newly found discovery, leading to many false discoveries.

The logic behind hypothesis testing is similar to "innocent until proved guilty" in the legal system. You put the innocent as the null hypothesis and the guilty as the alternative hypothesis. Examples of the innocent include no difference, no effect, no change of effect, equilibrium, etc. The data have to show strong departure from the notion of innocence for you to reject the null hypothesis. However, there is a big difference between hypothesis testing and the legal system. In hypothesis testing, if you don't prove guilty, you don't necessarily acquit the null. In other words, this is not a binary decision. Because of this, we only say we "don't reject" the null hypothesis and avoid saying we "accept" the null hypothesis.

When you test for an effect, you need to have a way to capture the effect based on your sample. The result often is a number, called a statistic. [It is possible to have a vector of numbers to capture the effect instead of a single number.] The statistic indicates some level of departure from the null. Once you have calculated your statistic, you need to know how likely you can see such a departure or larger departure from the null if the null hypothesis is true. To achieve this, you need to compare your statistic with a reference distribution, which is the distribution of the statistic if the null is true. The fraction of the reference distribution that is as extreme as or more extreme than your statistic is the p-value. Classical statistical methods often have well defined reference distributions (normal, t, F, χ², etc.) or asymptotic reference distributions. Many recent statistical methods rely on computers to generate reference distributions through permutations or simulations.

Lack of power may be due to small sample size, but also may be due to inefficiency of your statistic to capture the effect, or both. For example, suppose you have two groups of measurements and want to compare them. You do a t-test and the p-value is large. However, the two groups may be quite different in their distributions; they may happen to have similar averages. In this situation, the test statistic in the t-test is not efficient to capture the difference.

p-value (01/31/2006)

This is to reiterate what I have said in the class. P-value is arguably the most important summary of an analysis. However, it is a single number summary. Thus, it has much less information than analysis results have. If you have spent money and time on collecting data and analyzing them, you don't want to make your judgment solely based on a p-value. In some research areas, it is common to further dichotomize a p-value into "P<0.05" and "NS". Such dichotomization retains even less information. For "P<0.05", you don't know how strong the evidence is; for "NS", you don't know if it is worth following up with a larger study.

P-value is not only a function of the effect you are testing for, but also a function of the sample size. Thus, the interpretation of a p-value has to take both effect and sample size into account. In a large study, a significant p-value may not mean there is a biologically significant effect; it may have been driven by the large sample size. In a small to moderate study, a non-significant p-value may not mean a biologically non-significant effect; the sample size may be small not to have enough power.

PS. A p-value of 0.03 can't be interpreted as the probability for the null hypothesis to be true is 0.03. We do put personal believes on the possibility of the null being true or false, but p-value is not designed to quantify the belief, even though small p-values often go with strong belief of the null being false.

"There is no statistical difference": A psychologically misleading phrase (01/27/2006)

This is what we often see in papers. It means the analysis doesn't reveal significant discrepancy between the data and the assumption of no difference (even if there may be a trend toward difference). The truth may be (1) there really is no difference, or (2) there is a real difference but (a) the method is not efficient at extracting relevant information, or (b) even when the method is efficient, a larger sample is needed to have reasonable power, or (c) both are fine but you have an unlucky sample. Thus, not seeing a difference doesn't mean there isn't difference. Or to quote Douglas Altman: "Absence of evidence is not evidence of absence".

Although this phrase is correct and will be understood by people who understand it (of course), its psychological impact will be different from "we cannot reject the notion of no difference". "There is no statistical difference" is so similar to "There is no difference" that many people may unconsciously perceive them as the same.

An example of simplistic understanding of p-value (01/26/2006)

I read a paper a few months ago (I don't remember the reference, but likely in Genome Research 2004 or 2005). It was about characterizing human chromosomal structures. For every location on a chromosome, the authors calculated measures of two features. There were thousands of such pairs of measures on a chromosome. Then they tested if the measures correlated with each other and saw a very small p-value. [You often see very small p-values if you have thousands of data points.] Later on, they applied Kolmogorov-Smirnov test on these two sets of measures and also saw a very small p-value. They then claimed this result supported their finding on significant correlation early in the paper. But these are totally different tests! The former was to see if the two measures had correlation as paired data, while the later was to see as two groups of data, if the groups had the same distribution. The small p-values indicated that the two measures were significantly correlated and the two groups had significantly different distributions. The authors apparently had a simplistic understanding of p-value: It is a measure of the effect I want to demonstrate, no matter what tests I use.

Another problem in that paper was they mentioned false discovery rate (FDR) and referred to Benjamini and Hochberg (1995). Looks good so far. But what they really meant was just false positive rate, a.k.a. type I error rate. The reason might be that FDR sounds newer and fancier than type I error, which is so 19th century.

Abuse of statistics (01/25/2006)

Statistics is not mechanics or a collection of procedures and rules. Superficial understanding and simplistic application of statistics is dangerous and has led to abuses of statistics. Yes, statistics can be abused and has been abused, a lot! Here are some types of abuses:

All data analysis methods, including the simple ones, have assumptions. Only when the assumptions are met or approximately met can results be interpreted correctly. Unfortunately, people often are ignorant of this fact. It is even more unfortunate that many people who analyze data don't even know this fact. Analysis results are influenced by both assumptions and data, and you want to make sure your results are more of a function of data than a function of assumptions.
Another type of abuse comes, surprisingly, from over use of some statistical concepts and procedures. Examples include 0.05 as a golden threshold, p-value as the only basis for judgment, hypothesis testing as a golden framework for all kinds of research, parametric analysis as default analysis, etc.
Yet another common abuse is to apply many different methods on the same dataset. Many people have a laundry list of data analysis methods (or a boilerplate for grant applications), and they apply them all to any single dataset. They beat the data to the death, literally. This certainly increases the chance of seeing an "effect" or "pattern". Unfortunately, it is highly likely that these effects and patterns are not real at all.
Some data analysis procedures can lead to artifacts. There are many ways of generating artifacts: multiple comparisons, over-fitting, step-wise variable selection, ecological fallacy, fishing/data-dredging, change of significance as other variables are added or removed (like Simpson's paradox), etc.
Software availability has helped ease computation, but it also makes abuse of statistics easier. Many people appear to be able to do analysis just because they can easily generate professional-looking analysis output.

A guaranteed path to having a paper (01/25/2006)

Here is an example of abuse that is often seen in research. An investigator wants to see if the effects of two treatments differ. He uses 5 mice on each treatment, does a t-test on the 10 outcome values, and finds the p-value >0.05. He then adds 3 more mice to each treatment and calculates the p-value based on the 16 outcome values. If the p-value is still >0.05, he will continue to add 3 more mice to each treatment and go on until the p-value is <0.05. Then he stops and writes up a paper on the "observed effect" without telling people that he allowed himself multiple opportunities to land on a small p-value. He may have thought he used the smallest amount of resources to show the effect. But this is fishing, and even if there is no effect between the two treatments at all, the probability of seeing a p-value <0.05 at some time point along the way will be 100%. In other words, he is guaranteed to have a paper sooner or later.

Variable selection and prediction: An example of artifact (01/25/2006)

One paper published in 2005 in Cancer Research analyzed 10K (i.e. 10,000) genotypes on 50 cases and 50 controls. They did single-marker analysis on each of the 10K markers and found 37 markers to be significant after Bonferroni correction. [For only 100 subjects, the effect must be very large to survive Bonferroni correction of 10K tests.] Then they calculated the first principal component of the 37 markers and used it to predict the disease status and had 93% correct prediction rate.

Sound sophisticated and good, right? Well, I did a simulation study by randomly generating such data with affection status randomly assigned. Because the data were randomly generated, the markers couldn't survive Bonferroni correction. Nonetheless, I just picked the top 37 performers and calculated the first principal component and saw how good it would predict the affection status. It turned out that on average I could reach 92% correct prediction and 40% of my simulation replicates had 93% or better prediction. So, I have 40% chance to achieve their feat or beat them on prediction performance. In other words, their prediction performance is well within the variation expected due to chance alone.

In general, when there are many variables, if the outcome is involved to guide the variable selection process, then the selected variables tend to correlate with the outcome and "predict" the outcome well, by chance alone.

Ask questions (01/24/2006)

If you have questions, please ask. Some people are afraid of asking questions because they feel the questions may be so simple that other people will think he/she is stupid. Well, no questions are stupid questions, and the most stupid thought in the world is to think you may be stupid. (This makes me feel like FDR, who once said "Only thing we have to fear is fear itself." By the way, FDR stands for Franklin Delano Roosevelt, not False Discovery Rate.)

No laughter, please.

Often times there are some others who also have the same questions. They will admire your courage instead of thinking you are stupid. If you don't ask questions, you are blocking yourself from accessing knowledge. (Wow, am I knowledge?) Anyway, you will differentiate yourself from the others further, in a bad way. Need more elaborations? Then you may be right that you might be stupid.

Testing for difference and prediction are totally different animals (01/24/2006)

Suppose your study subjects fall into two categories and you can measure a variable that is potentially associated with the grouping. Through a preliminary study you saw that subjects with small values on the variable tend to be in group 1 and those with large values tend to be in group 2. You think you can do a larger study to claim the variable is useful to predict the grouping outcome, and you have asked a statistician to do some calculations on sample size.

Well, think again. To give you an analogy: Suppose you propose to collect two groups of people (men and women) and measure their heights. Association apparently exists between gender and height, and it probably is enough to have 100 men and 100 women to show the groups have significantly different heights. But whatever prediction model you build based on height will have many misclassifications for sure. Being able to detect difference in a variable and being able to predict well using that variable are different things.

Rate, frequency, probability, terminologies in general, and my confession (01/23/2006)

Some of you were confused with my use of the word "rate". This surprised me and I looked it up in a dictionary and found that you were right. In daily language, "rate" is used for speed, pace, and sometimes counts. However, in epidemiology and statistics, "rate" is used only to describe pace (see explanation in EMS page 229). We avoid saying "the divorce rate was 150,000 in 1985", as is often used in daily language; instead, we say "the number of divorces was 150,000 in 1985, and the divorce rate for that year was 20 per 100 marriages (or 20 divorces for every 100 marriages, or 20%). Because of this, rate can always be thought of as a proportion.

In quantitative sciences, the word "frequency" often is used interchangeably with "count" or "number", except in genetics, where it is used for proportion or probability (due to historical reasons). "Relative frequency" is used for proportion. Unfortunately, in daily language, "frequency" may mean both "rate" and "count".

The word "probability" is a relatively abstract concept. But it can be thought of as rate or proportion. Fortunately, no one uses it for counts.

Terminologies tend to have different meanings from their daily usage. Same as in many other fields, terminologies in statistics can mislead people. Examples include: parametric, valid, significant, exact, bias, interaction, model, regression, normal, confidence. Although these words come from daily language, they often have narrower or totally different meanings in statistics. Dont just interpret terminologies at their face values.

Now, my confession: As a non-native speaker, I have limited exposure to the various possible usages of a words. This has led to two problems. (1) I tend to have narrower understanding of a word. For example, "rate" has always been proportion or pace to me. (2) I also misunderstand words a lot. For example, for a long time, I thought "differentiate" was a math terminology and had trouble understanding what people were talking about when they used "differentiate" in a context apparently without any function. Another example is "linkage", which I learned through genetics. People often use this word in contexts that have nothing to do with genetics. A more confusing problem is some people do use "linkage" in genetics context but use it for its daily meaning (i.e. association or connection) instead of its narrower genetics meaning.

Variation and interpretation of confidence intervals (01/22/2006)

If you didn't have strong appreciation of the impact of variation, I hope you were surprised by the results of our sampling exercise at the end of Friday's class. Even if you did everything correctly in your study, you sample will give an estimate that is almost always off from the true value, and occasionally WAY off. The true value often is unknown (unless a census is done), and thus it can take many possible values. Because of this, it often is viewed as a parameter.

In the long run, your sample will be representative of the population most of the time; however, for the sample in your current study, you have no idea whether it is representative or you are in bad luck of having an unrepresentative sample. As a result, statistical inference is like gambling. You may think of all possible samples as 95% representative and 5% unlucky, and calculate a 95% confidence interval (CI) based on your sample. The interpretation of this 95% CI is this: if your sample belongs to the 95% majority, the true value should fall inside the interval; if your sample belongs to the 5% minority, the true value is outside the interval. Even though in the long run, 95% of the time your CI will cover the true value, for your current sample, your mindset (at least mine) will be like a gambler.

If you don't want the unlucky probability to be that high, you can change it to a lower value, say 1%. Then you shall classify all possible samples as 99% representative and 1% unlucky and calculate a 99% CI (which will be wider than a 95% CI) based on your sample. If your sample belongs to the 99% majority, your CI should contain the real value; if not, you are just unlucky. Because the 99% CI is wider, you have less accuracy, and this is the price you pay for having additional confidence.

Value of exploratory analyses (01/19/2006)

Exploratory analyses often involve simple graphs and summaries. They are extremely valuable tools. They help you identify potential data errors and potential outliers that either may be due to errors or will be influential on the results. Unless you are doing a purely confirmatory study, exploratory analyses also can offer clues on patterns of relationships among variables and on patterns of missingness if you have missing data. I recommend you always do exploratory analyses before doing more formal analyses except for the reasons below.

One caveat in using exploratory analysis is that the data effectively are used twice, telling you the pattern in the data and contributing to your later analyses formalized based on the identified pattern. For example, you look at a scatter plot of two variables and find it is quite linear, and then you fit a model assuming linear relationship between the two variables. However, this linearity assumption is not pre-specified and not independent of the data. This may lead to over-fitting because most analysis procedures quantify results (e.g. through p-values) assuming the model structure is pre-specified. Also note that the more extensive exploratory analyses you do, the more probability you will find "patterns".

Unless you are doing a pure replication study, the advantages of exploratory analyses outweigh the disadvantage described above. So, as a general guideline, always see the data. Dont just blindly believe in the numbers coming out of an analysis procedure without looking at the data.

Outliers (01/19/2006)

Outliers are values/records that are far from the other data points. They may be due to errors in either data recording or inclusion/exclusion criteria. If we are sure they are errors, we can correct them and if this fails, remove the values/records. In many situations, we are not sure if these relatively unusual values/records are due to errors. Then we need to keep them in the analyses because they may be legitimate values/records and may be representative of a small fraction of the target population under study. They often will have high influence over analysis results. Sensitivity analysis needs to be done to see how much the results may change if an outlier is removed.

Categorization of a continuous variable (01/18/2006)

Examples include age groups (20s, 30s, 40s, etc.) and BMI quartiles (four equal size groups). Categorization of a continuous variable is only useful for summarizing and reporting data or analysis results. It never is a good idea to do analysis on a categorized variable instead of the original variable. This is because categorization makes the information less rich and distorts the information. It is like racial profiling; if you think racial profiling is a bad thing, then avoid categorization in your analysis. Some one said it leads to "lost in translation".

Categorization is very rigid. For example, BMI quartiles allow only four possible levels of BMI effects, while a simple linear or quadratic function of BMI allows many more possible levels of BMI effects. BMI quartiles take 4 degrees of freedom away from data to estimate the associated effects. These 4 degrees of freedom will be better spent if we use a restricted cubic spline on BMI with 4 knots.

Some people perceive effects in terms of odds ratio. Because the traditional educational materials on odds ratio tend to focus on categories, they feel it necessary to categorize variables so that they can understand the results. In fact, odds ratio can always be obtained from analyses with continuous variables.

Wrong ways of judging a statistician (01/18/2006)

You are better than him because

you can give me a significant p-value while he cant
you propose to use more sophisticated/fancier methods
you assure me to find the signal/gene/effect, while he told me uncertainty may exist and we may have chance to find nothing
you tell me 150 subjects are enough while he told me 200 is minimum
when I decide to do a study with a design well thought through (at least to myself), you always can provide statistical justifications even though statistics never played a role in my designing the study, while he told me I am too ambitious.

If you laughed, good. Then try not to do this when you are an investigator in the future.

Wrong ways of judging a statistical method (01/18/2006)

We propose to collect 100 subjects for our analysis. This sample size has been sufficient to find the susceptibility gene for disease X (cite a Science paper). Fact: It is often the case that signals identified in past discoveries are stronger than that you dream to identify in your current study.
The method is safe to use because it doesnt generate false positives as I recall. Fact: There is no method that doesn't have false positives. The only exception is the one that never claims any positive results. Moreover, the probability for a bogus method to claim false positives is high enough for us not to trust the method, while at the same time low enough not to yield many false positives in a limited number of instances we have used the method.
This method is good because it gives a p-value of 0.04 while that method gave 0.08, and I know the effect to be tested is real. Fact: This is using anecdotal evidence. A single instance often doesn't offer any evidence for or against a method. In fact, a method that tends to overfit the data will inflate everything, making a real signal appear to be stronger than any other methods can claim while at the same time increasing the chance of claiming false positives.

Wrong way of starting a project (01/18/2006)

Over-ambitious. You may want to answer too many questions by allocating your limited resources to cover many possibilities, ending up unable to answer any of the questions in a conclusive way.

For example, an investigator had money for only 50 mice and she wanted to know (1) if the effect of a drug would differ for two different mouse strains, (2) if the response difference between the strains would vary at different dosage levels, and (3) for each strain, if the response would differ at different dosage levels. She tried 2 strains and 5 dosage levels, allocating only 5 mice to each strain-dosage combination. Because the resources were diluted to so many different combinations and the variation of responses among the 5 mice for any single combination was quite high, we could only answer the first question in a conclusive way. She might have thought of three questions, but they translate to more than a dozen statistical tests that relied on only 50 data points! Actually, one test did show significance (without correcting for multiple comparisons). But when you do 12 tests, the probability for any test to appear to be significant at level 0.05, by chance alone, is 0.46, almost 50%. In other words, if you repeat this experiment twice, it is highly likely that there is a significant result in one experiment or the other, just by chance alone.

Miscellaneous (01/17/2006)

Today, I did some major changes to the course web page: added a section on Stata labs and put homework as a separate section instead of a column spanning all lecture days. Adding Stata labs will help emphasize the computational aspect of the course, which is a vital component if the students want to learn statistics well.

I also added this blog section! I hope this will be a good channel for me to bring supplemental materials and thoughts that will benefit the students and to "cover up" topics that I didn't do well in the class. Of course, I will try to do my best in the class

Simpson's paradox (01/17/2006)

Simpson's paradox is a fascinating read. The paradox is a commonly seen phenomenon, with different incarnations in various contexts. One incarnation in genetics is that spurious association may result if hidden population admixture is ignored. Another incarnation is in regression analysis, in which the significance of a variable and the direction of the effect of a variable may change depending on whether another variable is included or not.

Basically speaking, Simpson's paradox describes the seemingly contradicting effects of variable 1 on variable 2 depending on whether we ignore variable 3 or take variable 3 into account. This often is because the subsets of data stratified by variable 3 are not comparable (i.e. heterogeneous) with respect to variables 1 and/or 2. Thus, pooling the heterogeneous subsets together while ignoring variable 3 often leads to misleading results.

Students: If you are interested, identify the three variables in each of the five examples in the above linked article.

Topic revision: r113 - 12 Apr 2007, ChunLi

Main

Department Home Page

Biostatistics Graduate Program

Vanderbilt University Medical Center

Biostatistics Webs
- Archive
- Main
- Sandbox
- System

Copyright &© 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback