A teaser to Ryan Martin’s “A Statistical Inference Course Based on p-Values”

This post is a segue into Ryan Martin’s “A statistical inference course based on p-values”

Have you ever seen the authors of a science paper report  p = 0.000? Is that possible? And what does that mean? Well, if it were possible, it would mean that there’s a zero percent chance we’d see the experiment’s results if the null hypothesis was true; that with 100% certainty, the investigators believe they would never get a false positive when re-testing their hypothesis.. But such an assumption is a little irrational, as nothing in life is so certain–even if we had the time and resources to survey every subject in a population. At best, these zero-p authors meant to express p < 0.05 to indicate that the p-value is below a desired significance value. If this is the case, then we can then recite the standard textbook interpretation we learn of p-values that goes something like:

If the p-value is less than our significance level of 0.05, we have enough evidence to reject our null hypothesis.

Since there are multiple interpretations for the p-value, we define it here for a toy example as follows: suppose we collect some observation y* that measures an unknown $\theta$ of interest. Perhaps this elusive $\theta$ represents the average number of credit cards an American owns. Under this experiment, y is available to us–in that we can select a representative group of Americans, check their credit reports for the number of credit accounts opened, and calculate an average–but  is only available through y. If we had all the time and money in the world, we’d collect as many ys as possible, because that’d give us a way to form the density function for y upon which we can do inference on $\theta$. In this case, doing inference on $\theta$ might entail figuring out where the majority of our ys lie, but since we don’t have all the time and money in the world to get all the ys possible, we use what’s available to us, in our statistical toolbox, and that sometimes arrives in the form of a hypothesis test.

In order to carry out a proper hypothesis test, we must set up a null and alternative hypothesis. Let’s say, how things stand, the status quo is to take $\theta_0 = 7$ credit cards, which we’ll define as our null hypothesis. For simplicity, we’ll take our alternative as $\theta \neq 7$, so given some data point y*=5, and some assumptions on how we believe our $\theta$s are distributed (for example, normal with a mean of 7 and standard deviation of 1), we can construct the figure below: Illustrating the p-value (shaded in red), or the proportion that the distribution of $\theta$s under the null hypothesis fall at values at or more extreme than what we observed y*.

While this example might seem overly simplified, we chose it as it illustrates the original intention of the p-value, simply as a measure of “statistical position” of the data given the null hypothesis (Fraser and Reid, 2016). Yet, despite this simplicity, given the computational resources available at the time in which hypothesis testing gained popularity, calculating this p-value was challenging. As a result, reference tables were created in which p-values were evaluated relative to standard values, like 1%, 5%, 10% and 20%, such that the end result of investigations would simply be some approximation of the p-value, leading to conclusions like “not significant at the 5% level” or “significant at the 1% level”. This, unintentionally, moved p-values to take on more meaning than just that of a measure of “statistical position” but that of a way to make a decision for, or against, some null model (Fraser and Reid, 2016).

As it caught on that p-values below 0.05 were meaningful enough to make a stand against some null hypothesis, in order to ensure an experiment leads to interesting results, researchers have started to game the data pipeline: in the selectivity of subjects, in the disclosure of results, and in methodological development (Leek and Peng, 2015). Our use and abuse of p-values have led to controversies around the inability to reproduce experimental results, to the extent that some journals have even banned the use of p-values altogether. Yet statisticians remained undeterred by such moves and continue to rely on this statistic, understanding its strength in constructing tests with desired frequentist Type I error (chance of a false positive).

Given this reliance on p-values, Martin and Liu have proposed an alternative reassessment of the p-value (Martin and Liu, 2013). They’ve designed, instead, a framework for arriving at the plausibility of a specific hypothesis, which allows for the more intuitive approach of interpreting it as some quantitative index for “truth” of some hypothesis $\theta$ given the data y*. They recognize that while it’s difficult to prove whether some $\theta$ is true given the data, it’s possible to ascribe some level of plausibility to $\theta$. Their framework thus allows for the construction of plausibility functions, from which investigators can use to arrive at some conclusion towards a hypothesis $\theta$ by choosing some cutoff c, where if the plausibility is less than c, the hypothesis is sufficiently implausible. Now if we were to symbolize this in mathematical terms, letting pl(A) represent the plausibility of some hypothesis A, utilizing Martin and Liu’s new framework, we can make subjective conclusions based on whether: $pl_D(A) \leq c$

From a practical point of view, this plausibility function, at least applied to our toy problem discussed above, results in the same “p-value function,” or “confidence distribution function,” as explored in Xie and Singh (2013), among others. This p-value function is an extension of the p-value itself; instead of calculating a single p-value associated with one null hypothesis $\theta$, imagine calculating it for a range of possible $\theta$s, resulting in: We thus avoid singling out one alternative, leaving it up to domain experts to choose the appropriate cut-off c, where $\theta$ values that fall below this plausibility threshold c are sufficiently implausible. Valid inference is then achieved when we connect such subjective opinions to a sound model. In other words, inferences are meaningful when they can generalize beyond some single study, so a sound model must be designed to capture the variability in the data observed, as well as any other sample that would be attained under similar conditions. If we can arrive at such a model called $P_\theta$, indexed by some parameter $\theta$, valid inference is achieved if: $P_\theta (pl_D(A) \leq c) \leq \alpha$

where $\alpha$ is some small value so that the probability of arriving at our subjective opinion that A is sufficiently implausible must be no greater than $\alpha$ under our proposed model $P_\theta$. We thus control Type I error such that we guard ourselves against saying the data supports a particular alternative when in reality it is false. If this is satisfied, then our subjective opinions based on our plausibility function is valid. While the p-value function arises as a special case of this new inferential framework, plausibility functions are in fact more general as they extend past the hypothesis testing context and steer clear of the need for any asymptotic justification through the use of predictive random sets (the technical differences of which can be further explored here: https://arxiv.org/abs/1206.4091 and https://arxiv.org/abs/1606.02352). Beyond these advantages though, the potency of this new inferential line of thought lies in its detachment from exactitude–from using p = 0.05 or p = 0.005–from setting some arbitrary standard that determine how publishable or worthy an experiment is. It focuses, instead, on the development of plausibility functions that can be easily interpreted as quantitative indexes of “truth” upon which researchers can use to guide their decisions–underscoring the subjectivity involved in doing statistics, the fact that we cannot free ourselves from the responsibility of using our own judgement.

References

Crane, H. and R. Martin (2018), “Is statistics meeting the needs of science?” PsyArXiv.

Fraser, D. and N. Reid (2016), "Crisis in science? or crisis in statistics!"

Leek, J., and R. Peng (2015), “Statistics: P values are just the tip of iceberg,” Nature, 520, 612.

Lilienfeld, S. O. et al. (2015), “Fifty psychological and psychiatric terms to avoid: a list of inaccurate, misleading, misused, ambiguous, and logically confused words and phrases,” Frontiers in Psychology, 6, 1110.

Martin, R. (2017), “A statistical inference course based on p-values,” The American Statistician, 71 (2), 128-136.

Martin, R., and C. Liu (2013). “Inferential models: a framework for prior-free posterior probabilistic inference.” Journal of the American Statistical Association, 108, 301–313.

Xie, M. and J. Singh (2013), "Confidence distribution, the frequentist distribution estimator of a parameter: a review with discussion," International Statistical Review, 81, 3-39.