I think about mass screenings for low-prevalence problems, and built a small interactive simulator meant to help people visualize the implications of programs of this structure.
A few thoughts (take with a grain of salt, I’m no expert in this area). The setup you present is a textbook correction for adjusting diagnostic test results – great for regular audience and stats classes but not all that interesting to the research community. To that end, I think the app works just fine (I didn’t check the math).
For a more research focused audience, as you might find here, there are a few issues with the standard diagnostic test setup that are worth exploring. For example, what if:
The user is allowed to set cost/benefits of false/true positives/negatives. What if a forced binary choice is not necessary?
Thanks very much for taking the time to look and for these thoughtful suggestions.
A few clarifications on the tool’s scope and aims: The tool is meant to help people correct the base rate bias by visualizing the full 2×2 outcome spread for binary tests implied by Bayes’ rule. One of the persistent problems in public discussions of tests or screenings of this structure is that people collapse the test into a single accuracy number. I was surprised myself that this didn’t appear to already exist, given how standard the math is. The goal is intuition-building for non-technical audiences, not comprehensive decision support.
On your suggestions:
Costs/benefits/utility differences.
Yes — that moves toward decision analysis rather than pure classification. This is partly addressed in the “Important considerations” text on the bottom right of the tool. Future incarnations need to incorporate mechanisms other than the test classification pathway. Fienberg’s and many other analyses omit them, but strategy/behavior, information effects, and resource allocation contribute to many of such programs’ real-world effects.
Uncertain sensitivity/specificity.
Thank you for the Gelman and Carpenter reference — excellent paper. I’ve now added this to the limitations. The tool assumes parameters are known and fixed, which is pedagogically cleaner but obviously unrealistic. Adding uncertainty bands would be valuable for a research tool, but might obscure the core lesson (correcting for the base rate bias) for the general audience envisioned.
Covariate-dependent performance.
Could be valuable for future extensions, though it introduces significant complexity around causal structure and statistical fairness trade-offs (see my posts on Simpson’s paradox and fairness concepts for why I’m cautious here)
Continuous markers and thresholds.
Yes — many tests are thresholds on continuous data, and arbitrary dichotomization is its own methodological challenge. For the purposes of this tool, though, I’ve intentionally restricted the domain to screenings that are deployed and experienced as binary decisions (positive/negative) because that’s the form in which the downstream harms of mass screening actually occur.
The core question: Should this remain a simple educational tool, or evolve into something more sophisticated? I lean toward keeping it simple but being more explicit about what it doesn’t model. What do you think?
I’ve updated the tool’s limitations section to acknowledge these issues more explicitly, including the G&C reference. Thanks again for the thoughtful comments!