I have seen the dozens of posting on the multiple-testing subject. I also have done quite a lot of Stan-based analysis and had to defend these against p-value greedy authors and reviewers. I am also co-signatory of the Nature paper on the subject.
But lets discuss a concrete example of a recent fallback into p-picking:
A medical researcher who always asks for “gold-standard” p-values as a reviewer got a new machine that extracted 43 (yikes!) parameters from each patient’s recording. Some are highly dependent (AUC and maximal value), some are reasonable normally distributed, most are highly skewed or hurdle. The parameters were the outcomes measured on one visit 3 years after two types of surgery. Let’s assume patients were randomized (they were not, but that’s another fight), and no preop-values are available (machine is new, you know )
His question was: What metrics have changed most by surgery? There is no composite hypothesis, the individual metrics were the question (machine vendor wants a selling point, you know). So he asked for non-parametric tests between surgery types, and 4 metrics were “significant” (“please add 1-2-3 asterisks”). I got nervous.
When he decided to pick the 4 significant ones and publish these only, my blood pressure went ballistic, I revised the report and put a p.adjust Hommel on it as a penalty - I did not feel very comfortable, dependencies and so, but my blood pressure went down.
After the gun-smoke had dispersed, I suggested to remove Hommel and ask him to publish all parameters and Lehman-Hodges intervals, leaving the decision to the reader. I remember that 25 years ago Altman asked me to do this in a BMJ manuscript.