Research into the performance of tests for diagnosis and treatment selection

This is my proposed introduction to the final chapter of the forthcoming 4th edition of the Oxford Handbook of Clinical Diagnosis:

Very little of the transparent diagnostic reasoning process described in the Oxford Handbook of Clinical Diagnosis is currently supported by research evidence. In order to address this problem, the final chapter proposes new concepts. It is directed at medical professionals, their students and medical scientists especially those who still recall what they have been taught about mathematics and statistics and also at professional statisticians whose training ensures that they have this knowledge.

The chapter begins by addressing the connection between observed proportions and probabilities, how a probability based on a proportion of 1 out of 2 (i.e. 0.5) is different to 1000 out of 2000 (also 0.5) and how such probabilities might be expressed differently. There is then a discussion about the possible proportions that might be discovered eventually if an infinite number of observations are made in an identical way and also if the study were to be repeated only with the same number of observations.

Bayes rule is shown to be a special case of an ‘inverse P circuit rule’ and closely related to the syllogism. This is followed by various representations and interpretations of Bayes rule that rely heavily on conditional independence (e.g. by Bayesians, AI and machine learning advocates) and those who do not. Differential diagnostic reasoning by probabilistic elimination assumes conditional dependence (not independence) and applies a theorem based on the extended version of Bayes rule. Instead of overall likelihood ratios based on false positive rates divided by sensitivities, reasoning by probabilistic elimination uses ratios of sensitivities and false negative rates. Specificity and false positive rates are useful in population studies but not in clinical reasoning at the bedside or clinic. Tests which are very useful for reasoning by probabilistic elimination may appear useless if assessed using overall likelihood ratios based on specificities and false positive rates.

Instead of dichotomising result into high normal or low, numerical results can be interpreted directly (e.g. the differential diagnosis of abdominal pain in a person aged 15 years instead of in ‘children under 15 years’ and ‘adults of 15 years and over’). Instead of assessing the average efficacy of treatment in subjects with a test result within some fixed range (e.g. an albumin excretion rate (AER) of 20mcg/min or above), the efficacy can be estimated at each possible value of the test result (that reflects the way that experienced clinicians would apply RCT results by assessing the degree of severity of an illness in an individual patient). This can be modelled by fitting calibrated logistic regression functions or some other modelling function separately to the placebo and treatment data. This shows that it is practically as well as mathematically inappropriate to assume that risk ratios or odds ratios are constant for all baseline risks especially those estimated on populations not assessed during a RCT. Applying logistic regression also allows 95% confidence intervals to be placed on the estimated probabilities. These can be interpreted as a probability (or confidence) of 0.95 of replicating an estimated probability within the 95% confidence limits after an infinite number of observations are made (by analogy with 95% predictive intervals).

Interpreting individual test results allows thresholds to be established for diagnoses and for offering treatments that avoid over-diagnosis and over-treatment. The latter are often due to basing thresholds arbitrarily on two upper or lower standard deviations of a test results in some population. On the other hand, interpreting individual test results (e.g. an AER of 67mcg/min as opposed to an AER of 20mcg/min or over) allows more accurate probabilities to be used during shared decision making with a patient perhaps by applying Decision Analysis. When doing this, it is also important to take into account the variation in the results of tests performed on individual patients by estimating the standard deviation of those test results.

My aim is to try to improve understanding and collaboration on research into diagnostic tests between the statisticians, the AI and medical communities. How do you think that they will respond to this?


This is definitely a big improvement over the way it’s taught until now. I think the students would be even better served by teaching them probabilistic diagnosis through logistic regression. This handles pre-test probabilities, non-binary diagnoses, continuous test outputs, and is consistent with the typical type of study used to build diagnostic models—cohort studies. [Sens and spec are only consistent with case-control studies.]. This chapter covers this: Biostatistics for Biomedical Research - 19  Diagnosis. Bayes’ rule is used in all medical school diagnostic curricula but this is the wrong time to introduce Baye’s rule if you already have a cohort study.


Blockquote Applying logistic regression also allows 95% confidence intervals to be placed on the estimated probabilities. These can be interpreted as a probability (or confidence) of 0.95 of replicating an estimated probability within the 95% confidence limits after an infinite number of observations are made (by analogy with 95% predictive intervals).

I may be misunderstanding, but I believe that a frequentist CI, once generated, is not probabilistic and this includes that it’s not a “replication” probability. The confidence level is the long-run accuracy of the method to “capture” the true parameter value, but not probabilistic regarding any particular interval. This would fall more in the realm of a Bayesian Credible Interval, correct?

Correct. The language is unclear.

Thank you for your comments @med_stat and @f2harrell. I accept the point that I did not explain myself. However, I did describe my reasoning in another post: The Higgs Boson and the relationship between P values and the probability of replication.

The Bayesian posterior probability that a true result will lie within a credible interval relies on a prior probability distribution of possible true results that is estimated informally. The data arising from a real study is used to estimate a likelihood distribution of the observed study mean or proportion conditional on a range of possible true results. However, the patients studied to estimate the latter likelihood distribution from data are not a subset of the set those patients imagined in order to arrive at the former prior distribution, so this violates a condition of Bayes rule. The latter only holds true when U is a ‘universal set’ so that X⊆U and Y⊆U so that p(Y|X) = p(Y|U)*p(X|Y)/p(X|U). In other words, the informal Bayesian prior has to be equal to the result of a previous posterior probability obtained from a prior probability distribution conditional on some universal set and an informally estimated likelihood distribution.

If the prior probability distribution conditional on some universal set is uniform, the informally estimated Bayesian prior distribution can also be regarded as an estimated likelihood distribution of an imagined mean or proportion conditional on a range of possible true results. However, this raises the question of what is the universal set. The universal set could be the scale of possible values to be used on the horizontal axis. As each of these values is unique then their prior probabilities would be uniform. This information about the scale would be known prior to the knowledge used to estimate the informal Bayesian distribution and the result of the real study. This is the only logical model that I can think of as being consistent with a universal set in this situation. If the imagined and real study generated distributions are regarded as the result of random selections from a single population with a true mean or proportion, then their joint distributions can be modelled by an assumption of conditional independence.

In the situation that I describe in my original post, there is no informally estimated prior distribution, so I did not think I could call it a ‘Bayesian’ posterior probability of a result falling within a credibility interval. The 95% confidence interval summarises the likelihood distribution of the observed mean or proportion conditional on a range of possible true means or proportions. By adopting a universal set of unique values with uniform priors, the posterior distribution is identical to the likelihood distribution, so that the 95% posterior probability of replication interval is identical to the 95% confidence interval. We use the same reasoning that includes uniform prior probabilities in clinical settings to calculate 95% posterior probability prediction intervals from standard deviations (as opposed to SEMs). I follow this approach to maintain consistency between my clinical reasoning and statistical reasoning. It does not affect the results of calculating Bayesian posterior probability distributions by incorporating a distribution that has been estimated informally.

1 Like