Here was the question:
Do you know of any good stats software packages that our department can purchase to use for some of our research projects? We are looking for something that is fairly straight forward/easy to use, and provides tools to perform survival analysis in addition to other basic analysis sets. We also would like this software to make publication quality tables/graphs. If such a creature exists, I figured you might know what it is. Can you help?
Here was Rafe's response:
You know, I was just thinking of writing you a letter. Our department is in need of some simple orthopedic tools, you know, fancy hammers, drills, titanium pins, bone saws, that sort of thing, for some of our projects here, especially for some of our more athletic members who tend to keep tearing menisci and ACLs and stuff like that. Do you know where we could get some of those tools? Maybe as a kit, with all of them matching? We would also like to be able to open and close wounds, so we might need scalpels and sutures and stuff, to make good quality incisions and close them up all neat and pretty. Oh, and those neat little scope things would be great too. If such a tool kit exists, I figured you might be able to help me find one. Can you help?No, not really, but I hope you understand the point.
There are lots of great stat software packages (SAS, SPSS, R, S-Plus, STATA, etc.). R is even free. They all do survival analysis. Most of them can be used to spin the world backwards in the hands of a talented statistician/programmer, but therein lies the rub. Even if you gave me all the tools that you guys have, I wouldn't be able to do a decent job at what you do. I might be able to tie one of those fancy knots in a suture after a couple tries, but I certainly wouldn't be able to do what I would need to do.
About the software products, though:
None of them is easy to use, I believe, by someone who is not completely sure of what he or she is doing with an analysis. In that sense, all of them can be brutally difficult to use for someone who isn't a statistician.
Rafe
Based on the news article, the analysis corrected for age, smoking, and other potential risk factors for coronary heart disease. In such an analysis, the results for the coffee-drinking variables reflect their additional effects after the other factors have been taken into account. The results show that there are no additional effects in the data, NOT there are no effects. There still may exist real effects but since coffee drinking is associated with smoking, some effects could be explained away by the smoking variable.
The estimates of risk ratios are all around one (i.e. no difference from the baseline, which presumably is no coffee drinking at all), with 95% CI's including one. This means we can't tell if they differ from one. But the news article picked the values that happened to have point estimate smaller than one, ignoring the precisions associated with the estimates, and claimed: "In fact, men and women who drank six or more cups of coffee a day for up to 20 years had a slightly lower relative risk". This statement will mislead people to think "if I want to drink coffee, then drink a lot". There was only one 95% CI that didn't include one (just slightly off from one), but given there were several reported CIs, it is quite possible that by chance alone, a CI does not cover the true value.
The same principle applies to other model-comparison methods. For example, when using cross-validation to compare models, we estimate prediction performance of the models. When missing data is present, models may have been evaluated based on different subsets of data, and more complex models tend to be based on fewer subjects.
Suppose we fit a regression with a categorical variable h with four categories {h1, h2, h3, h4} and a continuous variable x and their interaction effects. As we explained in blog item Significance of a category in a categorical variable, we effectively fit a model allowing the four categories to have their own intercepts and slopes. The main effects are reflected in the intercepts and the interaction effects are reflected in the slopes. Apparently again, an intercept and a slope are not comparable.
Suppose we fit a regression with two categorical variables and their full interaction terms. Again, the coefficients of main and interaction effects are not comparable. Let me explain it in the simplest case: both input variables x1 and x2 are binary and are coded as 0 and 1. The model β0 + β1I{x1=1} + β2I{x2=1} + β3I{x1=1}I{x2=1} is equivalent to the following:Suppose we study the effects of two variables on an outcome. Variable h is categorical with four categories {h1, h2, h3, h4} and variable x is binary (coded as 0 and 1). An investigator may be interested in a specific category, say h1, and wonder whether the effect of h1 is the same for any value of x, or in other words, if there is an interaction between h1 and x. If you carry out a regression analysis, you might include the interaction terms between x and all the categories of h and look at the z-test (or t-test) corresponding to the interaction term of between h1 and x. In this situation, what you are doing effectively is this: Given the effects of the combinations of x=1 and all categories but h1 are allowed to have the flexibility to be non-additive (in the scale of the right hand side), I am testing if allowing such flexibility to the combination of x=1 and h1 will significantly improve the model fit. Alternatively, if you include only the interaction term between x=1 and category h1 in your regression analysis and look at its corresponding z-test (or t-test), what you are doing effectively is this: Given the effects of the combinations of x=1 and all categories of h are constrained to be additive, I am testing if allowing the effect of only the combination of x=1 and h1 to be non-additive will significantly improve the model fit. Some other ways of quantifying the original question exist, although they may not be as justifiable as the above two ways. In fact, there is no single optimal way of quantifying the original question. In this situation, it is prudent to carry out analyses under different potential quantifications and see if they lead to the same conclusion, in other words, carry out sensitivity analyses to see if the results are sensitive to how you quantify a question.
An investigator once asked if calculations could be done to show the power of an approach to detect interaction effect between two variables. While asked to quantify the target magnitude of the interaction effect, he said this: We want to see the power to detect interaction when interaction explains half of the main effect. This question sounds noble but it is hard to quantify. It is quantifiable in some artificial ways, but I doubt it is quantifiable in any really meaningful way because the coefficients for main and interaction effects are not comparable (see my other blog item).
Signed rank test: When we have one sample, such as the differences from a paired data set, we rank the absolute values of all the numbers from the lowest to the highest. [The textbook we use excludes zero from the ranking while some software like Stata includes zero; they will lead to different statistics and different reference distributions under the null, but they should lead to the same p-values.] Then we sum over the ranks of the negative values and the ranks of the positive values, and pick the smaller of the two sums as a statistic. The null hypothesis is that the median of the underlying distribution is zero and the distribution is symmetric about the median (the second part often is missed in most textbooks). If the null is true, then the two sets of ranks in the positive and negative groups of a random sample should behave as if both the ranks and the signs have been randomly assigned and thus the group sizes may vary as a result (although the sum of group sizes is fixed). For each random assignment of ranks and signs, we can calculate the sums of ranks for the resulting two groups and pick the smaller sum. After enumerating all possible assignments, we have a reference distribution, and the statistic of the real data can then be compared with this reference distribution to obtain a p-value (the proportion of values in the reference distribution that is smaller than or equal to the real statistic). Since we always pick the smaller sum no matter if it is from the positive or the negative group, the p-value is two-sided.
Note that the symmetry part in the hypothesis is important and is the basis of our treating the ranks equally despite the signs. When we think the underlying distribution is far from being symmetric, both the t-test and the signed rank test are not suitable for detecting whether the central location of the distribution is different from zero or not. Thus, the choice of using the signed rank test over the t-test is mainly due to the concern of existence of outliers in the data.
Rank sum test: When we have two samples to compare (without pairing in design), we rank all the values (not absolute values) from the lowest to the highest, and then sum over the ranks for the two samples, and pick the sum of the smaller group. The null hypothesis is that the two distributions underlying the two samples are the same. If the null is true, then the two sets of ranks should behave as if the ranks have been randomly assigned. For each random assignment of ranks, we can calculate the sums of ranks for the two groups and pick the sum for the smaller group (always the same group). After enumerating all possible assignments, we have a reference distribution, and the statistic of the real data can then be compared with this reference distribution to obtain a p-value. A one-sided p-value is the proportion of values in the reference distribution that is equal to or more extreme (i.e. farther from the center) than the real statistic); a two-sided p-value is twice the one-sided p-value. Unless there was a strong a priori preference, a two-sided p-value should be reported.
Note that there is no symmetry requirement in the null hypothesis. Thus, the choice of using the rank sum test over the t-test may be due to the concern of either existence of outliers in the data or asymmetric underlying distributions.
The textbook explains the tests are about comparing medians. Now you can see there is little about the median in either test. In the signed rank test, the median and mean are the same due to the symmetry requirement of the distribution. The rank sum test is about whether the two distributions underlying the two groups of observations are the same or not.
The repair shop wants to know if the new procedure worked. One way of looking at the data is that there is a change of the fraction of cars with good engines from 500/900 to 580/900. Another way of looking at the data is a 2x2 table with initial engine status (good/problem) as one factor and engine repair status (yes/no) as the other factor; the four cell counts will then be 0, 500, 80, and 320. However, if you naively jump at an analysis by carrying out a test of independence, you are wrong. The repair shops question is not answerable based on the data, because the data are collected in a way that will always show improvement. Moreover, when two variables are dependent on each other by design, a test of independence doesnt provide any information, whether the result is significant or not. In fact, the repair shops question is even ambiguous, with the comparison groups not clearly defined; once the comparison group is well defined, the need for collecting more data will be apparent.
Here is another example. An investigator was relatively savvy at statistics; she could run some analyses herself. Once she asked me to help analyze her data, which had an ordinal outcome variable (like disease stages and many measures in psychiatric diseases). She would do ANOVA, which treated the outcome as if they were continuous and thus was inappropriate. I told her the proportional odds model might be suitable and she wondered if she could do it in her favorite software. Later, she e-mailed me the computer output she generated while trying out her software and asked if it was correct. Even if the output appeared to be okay, I hesitated to say yes, because I wasnt sure she understood what the model did and what assumptions it made. She might have viewed statistics as a field that lived by generating results.
This view of statistics is quite common. Some investigators want to send their post-docs to workshops to learn data analysis techniques with the hope that the post-docs will replace the statisticians, who often have higher salary. Well, the newly trained post-docs will replace the statisticians when generating computer output, but probably not when correctly formulating the problems and correctly interpreting the output. [Another advantage of having post-docs to do the work is they are more obedient, not like statisticians who tend to hold back and not to endorse the investigators new discoveries though analysis, a.k.a. data-dredging.]
Sometimes, it can. Interactions among the input variables may be a reason for such a phenomenon. But this phenomenon also can be observed when the input variables are independent of each other and there is no interaction effect between the input variables. A more fundamental reason is precision.
Suppose the outcome is really determined by two independent input variables x and y, linearly and without interactions. If you put both variables into analysis, great. If you put only x into analysis, the signals due to y have to be shouldered by both x and the error terms. If x and y are not correlated, the signals will mainly be attributed to the error term. This makes the variance estimate larger and leads to higher variation of the coefficient estimate. In some situations, the coefficient estimates from individual simple regressions may both swing towards zero, appearing to give weaker effects.
In Stata, you can repeat the follow code a few times and observe the coefficient estimates and associated standard error estimates.clear set obs 20 gen x1 = invnormal(uniform()) gen x2 = invnormal(uniform()) gen z = x1 + x2 + 0.1*invnormal(uniform()) regress z x1 x2 regress z x1 regress z x2
(1) Significance level is the false positive rate we want to tolerate. When all the other factors are fixed, the higher the significance level, the higher the power is. But the high power is achieved by sacrificing false positive rate. Thus, in power calculations, we often fix the significance level. (2) Obviously, when all the other factors are fixed, the larger the sample size, the more information we have and the higher the power is. (3) In a t-test, the target effect is the presumed difference in the means; in a regression, the target effect may be the absolute value of the coefficient for an input variable; in a case-control study, the target effect may be the odds ratio or the difference of the proprotions of exposure in the case and control groups. When all the other factors are fixed, the larger the target effect, the easier for a method to detect it, and thus the higher the power is. (4) The variance of outcome can be viewed as the noise level. In a t-test, it is the within-group sample variance; in a regression, it is the true residual variance. The lower the noise level, the easier for a method to detect the signal, and thus the higher the power is. Often the variance can be estimated through past results or a pilot study. If not, we have to speculate or consider multiple potential values for the variance. Sometimes there is no need to specify the variance because it is determined by the underlying unknown parameters; for example, in binomial sampling, the variance is determined by the underlying success probability. (5) An example of the type of the target effect is the choice of dominant, additive, recessive effects of an allele in genetic studies. An example of the baseline of the target effect is the allele frequency in the control group in genetic association studies.
A full statement about the power of a method should be: When the variance and the other relevant factors are given in aa, the power of the method to detect an effect of bb at sample size cc is dd at significance level ee. The power is a function of multiple factors: dd = f(aa, bb, cc, ee). For example: When the effect of the risk allele in a biallelic polymorphism is dominant (aa) and the risk allele frequency is 0.2 (aa again), the power of the method to detect an odds ratio of 1.5 (bb) with 200 cases and 200 controls (cc) is 0.85 (dd) at significance level 0.05 (ee).Many people think the power is a function of solely sample size. This is true only when all the other factors, especially the magnitude of the target effect, are fixed. Power also is about a method. Some people even forget this and routinely ask what is the power of the study? I dont know what the question means unless the study employs a single data analysis method and all the other factors are fixed. This comes back to a common theme: all concepts have context and should be talked about with their context. Power also has context.
In sample size calculation, we want to determine the sample size necessary to reach a certain level of power. As described above, the power is a function of multiple factors: dd = f(aa, bb, cc, ee). In sample size calculation, we want to find out, given the other factors, the sample size cc such that the resulting power achieves some desired level. Only in very simple situations can we derive a closed-form formula (as in EMS Table 35.1) for the sample size as a function of the other factors: cc = g(aa, bb, dd, ee). The formula may be exact or approximate. In more complicated situations, we have to use simulations to carry out the task.There is another BIG assumption when talking about power or sample size: We often assume the sample is homogeneous or the admixture structure of the sample is fully specified and is taken into account in power calculation. In reality, this turns out to be a big assumption and is almost always wrong. As a result, sample size calculations almost always over-simplify the reality and lead to under-powered studies.
All data analysis methods have assumptions. This is true even when you calculate a simple average, because by adding up numbers you are assuming the numbers are comparable.
We can evaluate a model structure by estimating its prediction performance. In a ten-fold cross-validation, we randomly divide the dataset into 10 exclusive parts with equal size. For each part, we leave it out and fit the model using the other nine parts, and then use the fitted model to predict the outcome for the observations in the part we have left out. We then compare the predicted and the observed outcomes and summarize the prediction performance (e.g. mean squared errors for continuous outcomes and misclassification rate for binary outcomes). The model structure with the best performance will then be selected as the best possible model and will be used to fit all the data to obtain a final model fit.
Some people may have a perception that once cross-validation is built into a model selection procedure, over-fitting is no longer a problem. This is wrong, and here is why. Suppose a new outcome variable has been generated randomly, without connections with any of the other variables. For each model structure, we still can use cross-validation to evaluate its performance on predicting this newly generated outcome. The measure of the performance is a statistic and will have its own variation. Thus, by chance alone, some model structure will appear to have good prediction performance. If the set of variable combinations is large, some variable combination will appear to have great performance, again, by chance alone. If you claim it as your winner, you are over-fitting the data because the outcome had nothing to do with any other variables.
I was involved in a debate on this and one investigator even stated that sometimes he saw some tendency of correlation between the two observations from an animal, but he would do a statistical test to see if the correlation was significantly different from zero, and if not, he would treat the observations as if they were independent. This is dangerous. If you suspect there is correlation, take it into account in your analysis. The correlation may not be strong enough or the sample size may not be big enough to yield statistically significant correlation. But absence of evidence is not evidence of absence. If correlation exists but is not treated accordingly, the analysis often gives you inflated confidence in the results because it tends to yield a smaller variance estimate than reality, and leads to a higher false positive rate than what your expected to tolerate.
The same principle applies to other organs in the body. Correlated data should be treated differently from what we have covered in an introductory course. Seek help from statisticians if you have correlated data.
One drawback of treating variables as continuous is the linearity assumption, which may be too strong in some situations. One solution is to use restricted cubic splines, which is much more flexible than assuming just four or five flat levels for the variable.
Similarly, tables are useful tools to display data and results. But the need for using tables to display data and analysis results shouldnt dictate that our analysis has to be based on tabulated numbers (i.e. categorized variables) instead of the original, more informative observations.
Dont strip results off contexts and try to interpret results. Contexts include structural and distributional assumptions (e.g. linearity, logit function as connection, normality, exchangeability, etc.), the units of variables, the ranges of variables, transformations, other variables included in analysis, how the data were collected, etc.
This dependence on a variables unit is not limited to logistic regressions. It is everywhere.
You can be fooled if you are not careful. Suppose you carry out simulations to evaluate the performance of a method to detect genetic, environmental, and interaction effects, and you choose BMI and a genetic marker as input variables (BMI ranging from 20 to 35 and genotype coded as 0, 1, 2). To generate simulated data, you need to assign some effects for the input variables. You may define the coefficients of the two variables the same number, say 1.1, and think you have assigned equal effects to the variables. Well, yes and no. The effects corresponding to the units of the variables appear to be equal. But since BMI has a numerically much larger range than genotype does, you have simulated a much stronger effect for BMI than that for genotype, and your method probably will appear to be powerful to detect the BMI effect than the genetic effect.
A small number doesnt mean a small effect, and a large number doesnt mean a large effect.
A transformation may totally change the unit of a variable and may make result interpretation difficult. While it is true that all data analysis methods have assumptions, keep in mind that in an analysis using a transformed variable, we are making relevant assumptions in the transformed scale and you need to think if it is reasonable to do so.
If you still dont understand, here is an analogy. Suppose you have 50 coins. You may carefully examine them and pick a coin that looks problematic, then flip it 10 times and see 9 heads and 1 tail. The corresponding p-value is 0.021. You also may flip all the coins each 10 times and see 9 heads and 1 tail for the coin that was suspected problematic. But now the p-value will be different because the chance of seeing any coin with 9 heads and 1 tail or more extreme results is much bigger. Even if all coins were fair, the probability of seeing any one coin with 9 heads and 1 tail or 9 tails and 1 head or 10 heads or 10 tails would be 66%; that is, the p-value now is 0.66.
All these tell us that pre-specification, if possible, is extremely important. Once you prioritize your goals and act faithfully, you wont suffer the corrections for many relatively less important tests, and you wont feel guilty for searching for a result and not correcting for multiple comparisons. In any situations, including large-scale analyses, it is your last resort to do parallel data analysis with equal treatment of the variables or tests. Dont let the machines replace your minds.
You may think a data set consists of both signals and noises, with unknown proportions. We try to extract the signals from the data. If a procedure extracts more "signals" than those embedded in the data, it is over-fitting. Although we don't know how much extracted signals are approximately real, there are principles to follow to guard against over-fitting.
Over-fitting is mainly due to searching. Suppose you have an outcome variable and 20 potential explanatory variables and you think two of these variables should explain the outcome. You may be tempted to look at all pairs of variables and see which pair gives you the best fit to the data. The resulting model, with the pair of variables selected to have the best fit, will probably over-fit the data. Similarly, step-wise variable selection almost always leads to over-fitting.
Over-parameterization also can cause over-fitting. If you have data from 100 subjects and you want to fit a model with 15 parameters, you are on the edge of over-fitting. This is because you wont have enough accuracy to estimate those parameters and the parameter estimates will tend to have large variance and tend to tailor to the 100 data points at hands.
In machine learning, there seems to be a common perception that once cross-validation is built into a model/variable selection process, over-fitting is no longer a concern. This is wrong. Cross-validation does help curb against over-fitting, but it cannot fully protect you from it.
Now, suppose there are two ordinal input variables (e.g. genotypes at two markers). One may follow the same logic and think it is great to treat all two-way combinations separately and estimate combination-specific risks and use them as the basis of his inference. In this situation, all the perceived advantages of being model-free, being free from all the linearity assumptions, and being non-parametric are again misperceptions. A saturated logistic regression, in which the input variables are treated as categorical and the interaction terms are full interactions (not product interaction), will capture all the information he is capturing and can do all he wants to do in his own approach. This fact also holds for multiple input variables. You can make up a test dataset and see this equivalence yourself.
The power of regression analysis is it is a unifying framework. It can be pushed to one extreme end to be saturated and to another extreme end to be parsimonious, which often is needed when we dont have enough data to achieve both complexity and robustness.
Note that we have seen this phenomenon in homework 3 problem 3, in which we did analyses on lung capacity using age and cadmium exposure as input variables. When cadmium exposure was considered alone, it was significant; when both cadmium exposure and age were considered but without interaction, cadmium exposure was no longer significant; when interaction was included in analysis, the "main effect" for cadmium exposure was significant again. In the first analysis, we only looked at the marginal effects of cadmium exposure; in other words, we forced the slopes with respect to age for both exposure groups to be zero. In the second analysis, we forced the slopes to be the same, but not necessarily zero. In the third analysis, we relaxed the constraint on the slopes and allowed them to be any values. You can see that assumptions on one part of the analysis (i.e. slope) may have big effects on another part of the analysis (i.e. intercept).
This is because counties have vastly different population sizes that would serve as denominators for the calculation of incidence rates. As a result, the precision of the estimates is different for different counties. Counties with fewer people tend to appear to have high or low incidence rate because of the higher variation in their incidence rate estimates. For example, in a U.S. mortality map for kidney cancer, among the 10% counties with the highest kidney cancer mortality, counties with small population sizes such as those in the mountains tend to be included, while some of their neighboring counties (also with small population sizes) were among the 10% counties with the lowest kidney cancer mortality. Large metropolitan areas and coastal areas rarely belong to either. This is because the variation in the ranking of incidence rates increases as the population sizes decreases. As an extreme example, suppose there was a county that had only one person; that county would end up with either the highest or the lowest mortality rate.
Such a phenomenon also happens when we want to compare schools or hospitals on their performance. For example, there were 118 ophthalmology residency training programs in the U.S. between 1999 and 2003. After the trainees graduated, they would take and pass exams to be certified. If we simply rank the programs based on their failure rates, small programs tend to be overly rewarded when they happen to have few or no failure or overly punished when they happen to have a few failures. After all, a program with 10 trainees would easily move up or down 10% in failure rate just for one trainees exam result.
Colored/shaded maps also can mislead people, as demonstrated in the book How to Lie with Maps written by Mark Monmonier.A correct thought process is this: I have a short list of suspected effects that I want to see if my data (1) support them strongly (then I can write it up), or (2) show some consistent trends (then I am encouraged to follow it up), or (3) suggest no or opposite effects to the extent that there is little chance that the result is just due to having an unlucky sample (then I need to think why and may consider to give it up). In the latter situation, I may write up a report if many people have the same suspicion of the effects as I did.
The range of a variable should be considered when interpreting results. We have seen this when I cautioned on the interpretation of correlation coefficient in a simple linear regression.
When we want to infer the effect of a variable for a value outside the range of our data, we are extrapolating. Extrapolation often is dangerous because of lack of data support around the value at which we are making the inference. It heavily depends on how the fitted curve extends outside the data range, and thus it heavily depends on the model structure we chose to fit or ended up having.
When we want to infer the effect of a variable for a value inside the range of our data, we are interpolating. If we have data for subjects at around 10 and 20 years old but very few or no subjects at around 15 years old, an inference about the age effect at value 15 may be problematic, because it also depends on the model structure we chose to fit or ended up having. But in general, interpolation is less a problem than extrapolation is.
The reason is similar to the relationship between interaction effect and main effect. When you include an interaction effect but not the corresponding main effects in an analysis, you are imposing strong and often unrealistic structural assumptions, making it hard to interpret the results. Similarly, when you include a quadratic term but not the linear term, you are imposing a strong structural assumption.
One reason for such a failure is that we looked at so many results before reaching a final, best model. No mechanism is built in to account for such searching. Thus, the final model and its parameter estimates and standard error estimates tend to be too optimistic. To evaluate the variation of a step-wise variable selection result, you can generate multiple bootstrap data sets, apply the same selection protocol to each data set, and see how often the results across data sets agree.
Another reason is similar to Simpsons paradox. Variables may appear to be significant or non-significant depending on what other variables have been included. Thus, adding and dropping variables sometimes depends on the order of variables.
When you see a paper reporting a model that resulted from step-wise variable selection, be wary. Some papers report a model in which all variables are significant, this often is a sign that the model was derived through step-wise selection and was too good.
If you have to use step-wise variable selection, use your common sense. Think what variables should always be included, or what variables should be preferable to some others, or what variables should be logically included/excluded as long as some other variables are included/excluded.
In a statistical test, we calculate a test statistic and compare it with a reference distribution to determine the corresponding p-value. The reference distribution often is a nice mathematical distribution, such as normal, t, chi-squared, etc. The validity of using these distributions is based on large sample theory (asymptotic theory), in which we can prove that assuming the data meet some assumptions, as the sample size increases, the distribution of the test statistic approaches one of these distributions. Hence, for an asymptotic distribution to be used as a good approximation, two issues need to be satisfied: the assumptions are not strongly violated and the sample size is not too small. If one of these is not met, the approximation may not be good, especially in the tails, where we care the most about. Permutation is an alternative way of generating a reference distribution to determine the p-value. It often has weaker assumptions and thus often is more reliable.
Permutation tests were invented to avoid relying on asymptotics. They cannot do away with the problems associated with multiple comparisons. As long as you carry out more than one permutation test, your results are subject to the issues of multiple comparisons.
A fundamental problem in multiple comparisons is we carry out multiple tests and report test results individually. We can view this whole process as a protocol and we need to evaluate the probability of this protocol leading to one (any one) small p-value by chance alone. In some simple situations, permutation test can be used to achieve this. For example, in a case-control study, we carry out multiple disease-marker association analyses, one for each marker. We can pick the smallest p-value and treat it as a statistic. Then, we can permute the affection status of the subjects and for each permutated data set, carry out the disease-marker association analyses and pick the smallest p-value. The smallest p-value from the real data can then be compared with the distribution of the smallest p-values generated through permutations, and a single p-value can be derived. Here, we essentially carried out a single test. We could carry out a permutation test because the situation was simple: All the original tests are testing for association between disease status and variables (markers in this example), in which the grouping variable (cases or controls) is the same for all tests. When you carry out multiple tests using more than one grouping variable, permutation cannot be used. For example, suppose you have multiple candidate genes and multiple outcome variables. If you carry out association analyses for all gene-outcome pairs, permutation cannot bail you out of the problems of multiple comparisons.When we treated age group as a categorical variable, the interaction model with three extra parameters only increased the 2x(log-likelihood) by 5.27 over the non-interaction model, with corresponding p-value 0.153. When we treated age group as a continuous variable, the interaction model with only one extra parameter increased the 2x(log-likelihood) by 6.46 over the non-interaction model (a different non-interaction model), with corresponding p-value 0.011. We might then have different interpretations depending on which test we chose to carry out.
This is a good example to demonstrate the fact that results should always be interpreted with context. In fact, these two "interactions" are different interactions! If we don't know which interaction we are looking after, we may do both tests and correct for multiple comparisons, which may hurt the power. If we want to avoid multiple comparisons and choose only one test, the test with fewer degrees of freedom often is more powerful (but not always).
In a regression analysis, we try to connect the outcome with a few other variables called input variables. Ideally, the outcome should be determined by the input variables in a deterministic way. In reality, it is almost impossible to find all the relevant variables, and we have to think of the outcome as a random observation, but with some unobserved parameter underlying it. What we want to do is to connect that unobserved parameter with the input variables in a deterministic way.
For a continuous outcome variable, the unobserved parameter is the average outcome for subjects with the same values of the input variables. We can equate this unobserved parameter with a linear combination of the input variables. This leads to a linear regression.
For a binary outcome variable, the unobserved parameter is the probability for subjects with the same values of the input variables. However, when we consider a linear combination of the input variables, the range of the potential values is unlimited, while the probability is bounded between 0 and 1. We cannot connect them by just equating them; if we did it, it would cause lots of problems in calculation and interpretation. We have to free up the probability by transforming it to a scale that also is unlimited. (Or equivalently, we have to transform the linear combination of the input variables to a scale that is bounded between 0 and 1.) There are many such transformations, among which the logit function f(p) = log[p/(1 − p)] is a popular choice because of its easy interpretation of the parameters in the linear combination of the input variables. This leads to a logistic regression. Note that by using the logit transformation, we impose a structural constraint: the rate of risk increase at p is as fast as the rate of risk decrease at 1 − p. This constraint may not be good in some situations, in which we need to use other transformations such as the complementary log-log function f(p) = log(−log(1 − p)).
For a count outcome variable, the unobserved parameter is the underlying rate for subjects with the same values of the input variables. Similarly as above, rate is bounded to be positive and thus we cannot just equate rate and a linear combination of the input variables. The logarithm transformation f(λ) = log(λ) is often used to transform rate and we can equate log(rate) with a linear combination of the input variables. This leads to a Poisson regression (sometimes called a log-linear model). Again, there is a structural constraint, but the constraint is mainly due to the assumption of Poisson distribution, for which the mean and variance are the same. In some situations, the count data have much higher variance than mean and we need to use other distributions such as negative binomial, resulting in a negative binomial regression.
All the transformations are called link functions. In linear regression, we appeared not to use any transformation; in fact, we used a special transformation f(x) = x. This identity function also can be viewed as a link function. A regression model with a link function is called a generalized linear model (GLM). All regressions above are special types of generalized linear models.
For survival outcome variables, the unobserved parameter is the underlying survival function for subjects with the same values of the input variables. Note that a survival function is not a single number; instead, it is a function over time t indicating the probability of "survival" (i.e. having not experienced the event of interest) by time t. A survival function corresponds to a hazard function, which is a function of time t indicating the rate of the event of interest at time t for a subject who has not experienced the event before time t (i.e. who has "survived" till time t). We may think there is a baseline hazard function, and for each subject, there will be a ratio of the subject's hazard function over the baseline hazard function. This ratio also is a function of time. In a Coxs proportional hazard regression, we equate this hazard ratio with a linear combination of the input variables. In this situation, some input variables may themselves be functions of time. When none of the input variables are functions of time, a subject's hazard ratio will be a constant and the hazard functions for different subjects will be proportional to each other.
Data collection determines the kinds of information that will be available for analysis and that will serve as the basis for interpretation. In most situations, interpretation of analysis results depends more on how data are collected than on how data are analyzed. Because of this, data collection is more important than analysis. Data collection really determines the quality of a study and how much information the data will provide to answer the research questions. Data analysis just tries to extract such information and wont compensate for a poorly designed study no matter how sophisticated a method is.
A good design of data collection is heavily dependent on the subject matter and traditionally it is not covered in an introductory course. But its importance should be appreciated. (And statistics teachers need to put more emphasis on this aspect in their teaching.)
Although I rank data analysis as the least important aspect compared to data collection and interpretation, it obviously is indispensable. Data analysis is a necessary step to bridge data collection and interpretation. In order to have a valid and efficient analysis, we need to choose methods according to how the data are collected, the nature of data (e.g. variable types), and the goal of analysis. Because there are many types of data and many analysis goals, there are so many different statistical methods. The existence of many methods makes Statistics appear like a box of tools and sometimes a bag of tricks. But all they provide are efficient ways of analyzing data so that the interpretation will have a valid basis.
The choice of an analysis method has a big impact on interpretation. In addition to how data are collected, interpretation depends on the assumptions of an analysis method and sometimes on the analysis procedure itself. You dont want your analysis result to be more of a function of assumptions than a function of data. You also dont want the result to be more of a function of the analysis procedure, such as over-fitting.
Among the three aspects of Statistics, data analysis is the most technically challenging, and thus has historically attracted statisticians attentions. Unfortunately because of this, some academic statisticians have developed a narrow view of Statistics and they value technicality over all the other aspects of Statistics. Avoid such technophiles. But it doesnt mean technicality is unimportant and you can use a wrong method or an over-fitting procedure and dismiss critiques on the technical validity of your approach.
As I said earlier, interpretation is the most important aspect of Statistics. However, when you are ready to interpret your results, the results have already been influenced by what data were collected, how the data were collected, and how the data were analyzed. It often is underappreciated that the ways in which we process data, analyze data, and quantify effects (e.g. through p-values) can lead to many artifacts and caveats. In addition, many people interpret results on the basis of half-baked, face-value understanding of the analysis methods involved. As a result, misinterpretations, wrong interpretations, and wishful interpretations are so prevalent in current research that they are like an epidemic.
Because there are so many artifacts and caveats in data analysis and interpretations, I spend a significant portion of the course time on interpretation of results. I hope my course will help you develop correct views of Statistics and avoid becoming future abusers of Statistics.
PS. Historically, Statistics was limited by available computational capability. Current statistical toolbox does reflect such historical limitation.
PS2. Statistical thinking may appear to be different from your daily thinking. If you feel this way, it is because your daily thinking is not clear and logical enough. Learning statistics should make your daily thinking clearer and more logical.
I allocated 50% of your grade to homework and active participation of discussion, and 50% to midterm and final exam performance. This allocation is a common practice that you should have been familiar with as a graduate student. If you didn't do well in the midterm, you still have chances. More than half of the grade (57% = 25% final + 20% active participation + 12% homework) still is open for grabs. In addition, if you do much better in the final exam than in the midterm, I will up your grade one level (e.g. B to B+, B+ to A-, etc.).
The answer is no. The small p-value in the simple linear regression of SBP on age tells that when age is considered alone (i.e. without adjusting for any other variables), its effect on SBP is significant. It has nothing to do with combining groups or not and the analysis doesnt have any component to reflect the effect of combining groups. But if we have labeled this analysis as combining groups, the results might be interpreted as if they offer information on the appropriateness of combining groups. We could have labeled the analysis as combining vegetarian and non-vegetarian groups or as combining Democrats and Republicans. Can we claim that the small p-value also suggests that diet or party affiliation has an effect on SBP? These factors may have effects on SBP, but the result from that simple linear regression has no logical connections with these factors.
Example 1: Suppose there are three cards. One is red on both sides; another is white on both sides; the third one has red on one side and white on the other side. Put then under a hat, pull out one card, and only look at one side. Suppose the color you see is red, what is the probability that the other side also is red? [Many people think the answer is 1/2, which is wrong.]
Example 2: Suppose prisoners A, B, C were on death row. The King decided to pardon one of them. He randomly chose a prisoner to pardon, told the warden of his choice, and asked the warden to keep secret. Prisoner C wanted to know if he would be freed. Without any information, his chance was 1/3. He knew the warden wouldnt tell him if he would be freed or not. So, he asked the warden who between A and B would be killed. The warden reasoned this way: Either A or B or both would be killed; I could just pick a person that would be killed and tell C, without releasing any information about Cs fate. Then he told C that A would be killed. Then C reasoned this way: Given A would be killed, either B or I would be freed, thus my chance of being freed increases to 1/2. Who do you think was correct? [It was the warden.]
One colleague of mine once said this: "For what it is worth, most small studies really really really suck. They touted as 'pilot studies' which is really secret code for 'we just wanted to run an experiment and didn't want to take the time to do it right'. As such, they end up at best only giving very sloppy estimates of variance that aren't even applicable because a real study can't be run the same way. At worst, they provide no information or, even worse, their sloppy conduct and poor planning lead to biased estimates that point people the wrong way. Sometimes it takes years to get the ship turned around. Small, poorly-designed, 'pilot' studies are foolhardy, haphazard, ignorant, arrogant, wasteful, possibly unethical, and sometimes even dangerous."
Researchers often are not aware of the serious long-term impacts a poor study design will lead to. In my opinion, the article The scandal of poor medical research should be read by all researchers and if necessary, refreshed every year.In a likelihood ratio test, we often compare the test statistic with a chi-squared distribution with a certain DF. In this situation, we essentially are comparing two models, one full model and one reduced model, to see if the full model provides significantly better fit to the data. The full model always has one or a few more parameters (or variables) than the reduced model, and the likelihood ratio test can be viewed as a test for significant contribution of these additional parameters given the parameters that have already been included in the reduced model. The number of these additional parameters often is the DF of the test.
Some other tests also follow this thought process. For example, in the test for departure from linear trend on a 2xk table, we calculate the difference between the Pearson's chi-squared statistic and the trend test statistic to see if there is significant departure from linear trend. The Pearson's test statistic is essentially divided into two portions, one being the trend test statistic. The former test has k 1 DF while the latter has 1 DF. Thus, the difference has DF = k 2. Another example is in ANOVA, in which we also think of dividing the total sum of squares into a few portions.
All permutated data sets have the same numbers of cases and controls as the real data have; in other words, all marginal counts are fixed. Thus, the permutations are the possibilities in the hyper-geometric distribution given the fixed marginal counts. If the statistic of interest and the probability in the hyper-geometric distribution lead to the same ranking of the permutations, then the permutation test is effectively the same as Fishers exact test.
In reality, we often cannot enumerate all possible permutations of a data set. Instead, we sample random permutations and build a reference distribution accordingly. Then, the p-value of the statistic on the real data can be estimated using this reference distribution. In this situation, because the p-value is estimated, it is not exact. Only when we can enumerate all permutations can we calculate the exact p-value. Even in this latter situation, the exactness relies on the assumption that the permutation space is the sample space, which often is not correct. [In general, estimation through random sampling is called Monte Carlo.]
Permutation test can be used in situations other than a 2x2 table, in which there are two groups of subjects (cases and controls) and the outcome is binary. For two groups of subjects with categorical or continuous outcome, we can generate permutations by permuting the grouping status while keeping the outcome intact. In principle, this can be extended to data from three or more groups; we can always permute the grouping status. Another extension is in testing for correlation between two continuous variables; we can permute the values of one variable while keeping the values of the other variable intact. In non-parametric statistics, there are many rank-based tests, in which we compare the ranking of current data with all possible ways of ranking; they are essentially permutation tests in the rank scale.
If we replace the statistic of interest by the parameter of interest, we can generate a permutation distribution of the parameter of interest, and construct a 95% confidence interval by using the 2.5% and 97.5% percentiles of this distribution.
Fishers exact test is a famous test. Fisher introduced this test using data from a tea tasting experiment. A lady claimed to be able to tell if tea was poured into milk or milk was poured into tea. Fisher designed an experiment in which there were eight cups, four with tea poured in first and the other four with milk poured in first. He also told the lady there were four cups each. Note that because the lady knew this fact, her answers would be constrained: she would claim four cups as one group and the rest as the other group. If we put her answers into a 2x2 table (with the real grouping and her grouping as row and column variables), her potential answers would have fixed marginal counts on both the rows and the columns. The distribution of data with such a constraint is called a hyper-geometric distribution. In this context, Fisher developed his exact test. Unfortunately, only in this context is the test exact.
In reality, we rarely see situations with all marginal counts fixed. In the tea-tasting experiment, if the lady didnt know there were four cups each, she could have claimed five cups as one group and the rest as the other group.
For a 2x2 table, we often want to test for association between the row and column variables. A 2x2 table of two variables may result from population sampling without controlling for marginal distributions of the variables. Then the data essentially came from a multinomial distribution with four categories. A 2x2 table also can result from a cohort study or a case-control study. In a cohort study, we may prospectively follow up a certain number of subjects in the exposure group and a certain number of subjects in the non-exposure group. In a case-control study, we may choose a certain number of cases and a certain number of controls and gather exposure data retrospectively. In either situation, the data essentially came from two binomial distributions. In all three scenarios, asymptotic theory has shown that when the sample size is large, under no association, Pearsons chi-squared test statistic will be approximated by the chi-squared distribution with one degree of freedom. For all three scenarios, Fishers exact test is only an approximation.
These two tests may give you different results, but because they are correlated, their results sometimes are similar enough to lead to the same conclusion or decision. In general, Fishers exact test wont inflate false positive rate but sometimes is quite conservative; Pearsons chi-squared test sometimes is less conservative than Fishers exact test and thus is more accurate but sometimes may inflate false positive rate. A commonly used criteria is that when the overall total is n > 40, or 20 ≤ n ≤ 40 and the smallest expected cell count is ≥5, we use Pearsons chi-squared test; when the overall total is n < 20, or 20 ≤ n ≤ 40 and the smallest expected cell count is <5, the approximation in Pearsons chi-squared test may not be good and Fishers exact test is a good alternative.
If you said yes, lets try this. A human being randomly picked from the world is almost surely not the president of the United States. George W. Bush is the president of the United States, then he is almost surely not a human being.
In fact, only when Pr(not B | A) = 1, can we say Pr(not A | B) = 1. When Pr(not B | A) is not 1, not matter how close it is to 1, Pr(not A | B) can be any number. For example, suppose we have two fair coins and we toss them 20 times separately. Let B = {coin 2 had same faces in 20 tosses} and A be a certain pattern about coin 1 result. Then due to independence between the two coins, Pr(not B | A) = Pr(not B) is close to 1, but Pr(not A | B) = Pr(not A) can be anywhere between 0 and 1 depending on how we define A.
Many newspaper reports follow similar traps. Examples are: Boys more at risk on bicycles, Soccer most dangerous sport, etc. Read news reports with a grain of salt.
To do statistical inference, we often construct a model for the data with some unknown parameters, and treat all possible sets of parameter values as candidates. For each set of parameter values, we can calculate the probability of observing our data. (You see, statistics relies on probability.) Such a probability changes as the set of parameter values changes; thus, it can be viewed as a function of the parameters and is called a likelihood function. We often look for the set of parameters that maximizes the likelihood of observing our data and a range of parameter sets that lead to likelihood not too small compared to the maximum likelihood. The former is called a point estimate and the latter is called a confidence region (or confidence interval if we have a single parameter).
For categorical variables with only a few categories, we may have power to study all possible interactions. For example, for two variables each with 3 levels, the main effects need 4 parameters (2 for each variable), and the full interaction requires additional 4 parameters. When the categorical variables are ordinal variables (e.g. genotypes), we may have coded them using numbers such as 0, 1, and 2. Unfortunately, careless investigators may blindly carry out an analysis on interaction by including the products of the variables. For genotypes at two bi-allelic markers with coding 2/1/0 for AA/Aa/aa and BB/Bb/bb, using the product term is equivalent to assuming 4 levels of interaction effect: {AA/BB}, {AA/Bb, Aa/BB}, {Aa/Bb}, and {aa/__, __/bb}, with the effect of AA/BB twice as much as AA/Bb or Aa/BB and 4 times as much as Aa/Bb and the effect of the other 5 genotypes being zero. Of course this is not a full interaction and the resulting analysis wont have power to detect interaction patterns that far differ from this. It is unfortunate that this careless analysis has lead to criticisms on regression analysis for being not powerful enough to detect interactions.
PS. The word "interaction" means statistical interaction, which may or may not translate to physical interaction.
In this situation, the software often isnt smart enough to remember that the interaction terms are functions of existing variables; it will treat the new variables as if they are separate variables. Consequently, if you include x3 and ignore x1 (or x2, or both), the software will generate results. However, when you include an interaction effect but not the corresponding main effects in the analysis, you are imposing strong and often unrealistic structural assumptions, making it hard to interpret the results (see next paragraph). Thus, even if you think there is no main effect of x1 but interaction effect exists between x1 and x2, you should include x1 in the analysis. This leads to a general guideline: As long as you put the interaction term in the analysis, you should also include the corresponding main effects.
Now I show why removing the main effect term(s) while keeping the interaction effect term often will lead to unrealistic structural assumptions. Suppose you fit a model with the right hand side being β0 + β2x2 + β3x1x2. Suppose x1 is binary with values 0 and 1. Then this model is equivalent to the following:The requirement of including main effect terms whenever related interaction terms are included is purely technical and is for the purpose of correct interpretation of results. It has nothing to do with whether or not there is a main effect in the data. The validity and power of the analysis doesnt depend on the existence of main effects. Unfortunately, some people misunderstand this requirement and wrongly criticize regression methods by saying they require main effects to exist and thus they are not suitable when there are no main effects in the data or in the reality. Such criticisms are wrong.
When you fit a regression with an interaction term, it often happens that one main effect is no longer significant. It doesnt mean the main effect doesnt exist or you can remove the main effect term from the model. It just means that the main effect column doesnt provide significantly more explaining power of the outcome after the other main effect columns and the interaction effect columns have been taken into account. (Remember that interpretation of t-test results always is conditional.) The variable still explains the outcome, maybe mainly through the interaction effect columns. If you want to have correct interpretation of results, you need to keep it in the model.
In some situations, a naïve calculation of p-value may be way off. In genetics, people routinely test for Hardy-Weinberg equilibrium (HWE). The traditional test statistic often is well approximated by the chi-squared distribution with 1 degree of freedom. I once derived a new test for HWE, with test statistic formula similar to that of the traditional one. I was tempted to use the same chi-squared distribution to calculate p-value. However, the new statistic had a fairly wider variation than the traditional one. As a result, the true probability of observing the data was much higher than what the chi-squared distribution would tell me. In other words, the p-value calculated based on the chi-squared distribution was much smaller than the true p-value. For example, a data set had p-value 0.02 when calculated based on the chi-squared distribution, while its true p-value was >0.1 when the variance inflation was taken into account. I am sure these two numbers will give you quite different levels of evidence against the null hypothesis.
In many situations, the p-value is designed to reflect the probability of observing data as extreme as or more extreme than your current data GIVEN both the model structure and the variables under consideration are pre-specified. If you use a screening method to search for the best list of variables or the best model structure, the p-value often doesnt reflect the additional variation brought in by the screening process. In other words, the true distribution of the test statistic has a wider variation than what the p-value can reflect, and the p-value may give you a false sense of strong evidence against the null hypothesis.
P-value is supposed to give you a sense of the level of departure of your data from the null hypothesis, and you put beliefs onto the null or alternative hypotheses on the basis of the p-value. However, if the p-value you get is a lot different from the true p-value, you will have a false sense about the level of evidence in your data. Watch out.
Study design has a few components: (i) subject ascertainment, (ii) variable collection, (iii) allocation of resources, and (iv) logistics and data management. We will address these issues in the following paragraphs.
For studies on human subjects, we need to determine what demographic composition is the best to answer our questions and to determine the criteria for inclusion and exclusion of subjects. If nested ascertainment is needed, we need to select the best nested sampling procedure. In addition, we carry out sample size calculation to determine the number of subjects needed to have enough power to detect the effect that we intend to demonstrate. Moreover, we want to ensure that the subjects are representative of the population that we want to study, that potential confounding factors are taken into consideration through individual or distributional matching across different groups, that potential selection biases are avoided, and that randomization is carried out if subjects will be assigned to different groups. Without steps to address these issues, misinterpretation of analysis results may occur.
We also need to determine a list of variables to collect. Make sure the list includes variables that potentially influence the outcome and factors that potentially confound with the other variables. This will allow us to adjust for the effects due to these variables and to carry out stratified analysis if necessary.
All studies are constrained by funding and resources. Hence, optimal allocation of available resources is extremely important. There is a tendency for investigators to allocate resources to cover many multi-factor combinations, resulting in small numbers of subjects in each combination. This often dilutes the resources and makes it difficult to have any conclusive results when the data are analyzed. We want to have optimal allocation of resources so that we are guaranteed to have enough power to conclusively answer our questions.
Data management also is important to the success of a study. Data analysis and interpretation will depend upon the quality of data. However, data may have been recorded wrongly when recorded on paper and entered into computers, and may have been recorded in different formats or units. Data may also have been merged inappropriately when patient records are extracted from multiple databases. We need to implement procedures to minimize the chances of these happening and to check for potential errors. Data checking involves checking for logical inconsistencies and variable ranges, and detecting unexpected patterns and missing data patterns through exploratory data analysis. In addition, work with IT folks to determine database management and access policies to ensure data quality.
Study design is much more than sample size calculation. Involving statisticians in the design stage will pay off tremendously; if it is impossible, consult them. You will be glad that you did it. Sir Ronald Fisher once said, To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of.
Measurement error happens everywhere, and we need to know its impact on analysis results. One impact is that it gives you less accurate measurements and thus lowers the signal/noise ratio in your data, leading to loss of power. Another impact is attenuation: If there are measurement errors in an explanatory variable, the parameter estimate for the variable tends to attenuate (e.g. to bias towards zero or no effect), leading to under-estimation of the effect and loss of power.
Should we do repeated measurements on a subject to bring down the measurement error? It depends. In general, spending resources on one measurement per subject is more efficient than multiple measurements per subject. However, if the magnitude of the measurement error is big compared to the variable range in your data set, the impact may be big to warrant special consideration to bring down measurement errors. This is because the magnitude of attenuation depends on the ratio of within-subject variation (a.k.a. measurement error) and between-subject variation. You need to have a good idea about how big within-subject variation is.
PS. Not surprisingly, many studies follow the opposite track: Analyze the data to the death to find out any "signals" and "patterns".
PS2. Graphs often are easier to understand than tables. The speaker was good in this aspect.
Interpretation of results really depends on how the data were collected.
Therefore, t-tests in any regression analysis should always be interpreted as conditional tests, and should not be used to judge the absolute usefulness of a variable to explain the outcome except when the variable is independent of all the other variables.
Interpretation of results really depends on what you included in the analyses.
The logic behind hypothesis testing is similar to "innocent until proved guilty" in the legal system. You put the innocent as the null hypothesis and the guilty as the alternative hypothesis. Examples of the innocent include no difference, no effect, no change of effect, equilibrium, etc. The data have to show strong departure from the notion of innocence for you to reject the null hypothesis. However, there is a big difference between hypothesis testing and the legal system. In hypothesis testing, if you don't prove guilty, you don't necessarily acquit the null. In other words, this is not a binary decision. Because of this, we only say we "don't reject" the null hypothesis and avoid saying we "accept" the null hypothesis.
When you test for an effect, you need to have a way to capture the effect based on your sample. The result often is a number, called a statistic. [It is possible to have a vector of numbers to capture the effect instead of a single number.] The statistic indicates some level of departure from the null. Once you have calculated your statistic, you need to know how likely you can see such a departure or larger departure from the null if the null hypothesis is true. To achieve this, you need to compare your statistic with a reference distribution, which is the distribution of the statistic if the null is true. The fraction of the reference distribution that is as extreme as or more extreme than your statistic is the p-value. Classical statistical methods often have well defined reference distributions (normal, t, F, χ2, etc.) or asymptotic reference distributions. Many recent statistical methods rely on computers to generate reference distributions through permutations or simulations.Lack of power may be due to small sample size, but also may be due to inefficiency of your statistic to capture the effect, or both. For example, suppose you have two groups of measurements and want to compare them. You do a t-test and the p-value is large. However, the two groups may be quite different in their distributions; they may happen to have similar averages. In this situation, the test statistic in the t-test is not efficient to capture the difference.
P-value is not only a function of the effect you are testing for, but also a function of the sample size. Thus, the interpretation of a p-value has to take both effect and sample size into account. In a large study, a significant p-value may not mean there is a biologically significant effect; it may have been driven by the large sample size. In a small to moderate study, a non-significant p-value may not mean a biologically non-significant effect; the sample size may be small not to have enough power.
PS. A p-value of 0.03 can't be interpreted as the probability for the null hypothesis to be true is 0.03. We do put personal believes on the possibility of the null being true or false, but p-value is not designed to quantify the belief, even though small p-values often go with strong belief of the null being false.
Although this phrase is correct and will be understood by people who understand it (of course), its psychological impact will be different from "we cannot reject the notion of no difference". "There is no statistical difference" is so similar to "There is no difference" that many people may unconsciously perceive them as the same.
Another problem in that paper was they mentioned false discovery rate (FDR) and referred to Benjamini and Hochberg (1995). Looks good so far. But what they really meant was just false positive rate, a.k.a. type I error rate. The reason might be that FDR sounds newer and fancier than type I error, which is so 19th century.
Sound sophisticated and good, right? Well, I did a simulation study by randomly generating such data with affection status randomly assigned. Because the data were randomly generated, the markers couldn't survive Bonferroni correction. Nonetheless, I just picked the top 37 performers and calculated the first principal component and saw how good it would predict the affection status. It turned out that on average I could reach 92% correct prediction and 40% of my simulation replicates had 93% or better prediction. So, I have 40% chance to achieve their feat or beat them on prediction performance. In other words, their prediction performance is well within the variation expected due to chance alone.
In general, when there are many variables, if the outcome is involved to guide the variable selection process, then the selected variables tend to correlate with the outcome and "predict" the outcome well, by chance alone.
No laughter, please.
Often times there are some others who also have the same questions. They will admire your courage instead of thinking you are stupid. If you don't ask questions, you are blocking yourself from accessing knowledge. (Wow, am I knowledge?) Anyway, you will differentiate yourself from the others further, in a bad way. Need more elaborations? Then you may be right that you might be stupid.Well, think again. To give you an analogy: Suppose you propose to collect two groups of people (men and women) and measure their heights. Association apparently exists between gender and height, and it probably is enough to have 100 men and 100 women to show the groups have significantly different heights. But whatever prediction model you build based on height will have many misclassifications for sure. Being able to detect difference in a variable and being able to predict well using that variable are different things.
In quantitative sciences, the word "frequency" often is used interchangeably with "count" or "number", except in genetics, where it is used for proportion or probability (due to historical reasons). "Relative frequency" is used for proportion. Unfortunately, in daily language, "frequency" may mean both "rate" and "count".
The word "probability" is a relatively abstract concept. But it can be thought of as rate or proportion. Fortunately, no one uses it for counts.
Terminologies tend to have different meanings from their daily usage. Same as in many other fields, terminologies in statistics can mislead people. Examples include: parametric, valid, significant, exact, bias, interaction, model, regression, normal, confidence. Although these words come from daily language, they often have narrower or totally different meanings in statistics. Dont just interpret terminologies at their face values.
Now, my confession: As a non-native speaker, I have limited exposure to the various possible usages of a words. This has led to two problems. (1) I tend to have narrower understanding of a word. For example, "rate" has always been proportion or pace to me. (2) I also misunderstand words a lot. For example, for a long time, I thought "differentiate" was a math terminology and had trouble understanding what people were talking about when they used "differentiate" in a context apparently without any function. Another example is "linkage", which I learned through genetics. People often use this word in contexts that have nothing to do with genetics. A more confusing problem is some people do use "linkage" in genetics context but use it for its daily meaning (i.e. association or connection) instead of its narrower genetics meaning.
In the long run, your sample will be representative of the population most of the time; however, for the sample in your current study, you have no idea whether it is representative or you are in bad luck of having an unrepresentative sample. As a result, statistical inference is like gambling. You may think of all possible samples as 95% representative and 5% unlucky, and calculate a 95% confidence interval (CI) based on your sample. The interpretation of this 95% CI is this: if your sample belongs to the 95% majority, the true value should fall inside the interval; if your sample belongs to the 5% minority, the true value is outside the interval. Even though in the long run, 95% of the time your CI will cover the true value, for your current sample, your mindset (at least mine) will be like a gambler.
If you don't want the unlucky probability to be that high, you can change it to a lower value, say 1%. Then you shall classify all possible samples as 99% representative and 1% unlucky and calculate a 99% CI (which will be wider than a 95% CI) based on your sample. If your sample belongs to the 99% majority, your CI should contain the real value; if not, you are just unlucky. Because the 99% CI is wider, you have less accuracy, and this is the price you pay for having additional confidence.
One caveat in using exploratory analysis is that the data effectively are used twice, telling you the pattern in the data and contributing to your later analyses formalized based on the identified pattern. For example, you look at a scatter plot of two variables and find it is quite linear, and then you fit a model assuming linear relationship between the two variables. However, this linearity assumption is not pre-specified and not independent of the data. This may lead to over-fitting because most analysis procedures quantify results (e.g. through p-values) assuming the model structure is pre-specified. Also note that the more extensive exploratory analyses you do, the more probability you will find "patterns".
Unless you are doing a pure replication study, the advantages of exploratory analyses outweigh the disadvantage described above. So, as a general guideline, always see the data. Dont just blindly believe in the numbers coming out of an analysis procedure without looking at the data.
Categorization is very rigid. For example, BMI quartiles allow only four possible levels of BMI effects, while a simple linear or quadratic function of BMI allows many more possible levels of BMI effects. BMI quartiles take 4 degrees of freedom away from data to estimate the associated effects. These 4 degrees of freedom will be better spent if we use a restricted cubic spline on BMI with 4 knots.
Some people perceive effects in terms of odds ratio. Because the traditional educational materials on odds ratio tend to focus on categories, they feel it necessary to categorize variables so that they can understand the results. In fact, odds ratio can always be obtained from analyses with continuous variables.
If you laughed, good. Then try not to do this when you are an investigator in the future.
For example, an investigator had money for only 50 mice and she wanted to know (1) if the effect of a drug would differ for two different mouse strains, (2) if the response difference between the strains would vary at different dosage levels, and (3) for each strain, if the response would differ at different dosage levels. She tried 2 strains and 5 dosage levels, allocating only 5 mice to each strain-dosage combination. Because the resources were diluted to so many different combinations and the variation of responses among the 5 mice for any single combination was quite high, we could only answer the first question in a conclusive way. She might have thought of three questions, but they translate to more than a dozen statistical tests that relied on only 50 data points! Actually, one test did show significance (without correcting for multiple comparisons). But when you do 12 tests, the probability for any test to appear to be significant at level 0.05, by chance alone, is 0.46, almost 50%. In other words, if you repeat this experiment twice, it is highly likely that there is a significant result in one experiment or the other, just by chance alone.
Basically speaking, Simpson's paradox describes the seemingly contradicting effects of variable 1 on variable 2 depending on whether we ignore variable 3 or take variable 3 into account. This often is because the subsets of data stratified by variable 3 are not comparable (i.e. heterogeneous) with respect to variables 1 and/or 2. Thus, pooling the heterogeneous subsets together while ignoring variable 3 often leads to misleading results.
Students: If you are interested, identify the three variables in each of the five examples in the above linked article.