I have developed a machine learning model to predict a quantitative output for medical diagnosis (low bone density). I want to convert the model output to a binary outcome and compare it to the gold-standard. The validation set (used to for model selection) and the test set are of similar distributions. I have done ROC analysis on both sets. I am wondering what is the most appropriate set with which to select a threshold for future use in the real world. If using the test set to select threshold, is this no longer considered to be a true test set since we have changed the way the system operates based on its performance on this set?

There is only one way that works. Develop a utility function that specifies the consequences of various decisions. Then compute the expected utility and make the decision that optimizes that. Any attempt at thresholding that does not do that will arrive at a non-reproducible threshold.

Thanks for your response Frank. Part of the issue is that not all the information to make the decision (i.e. clinical risk factors) are available at the time of using our model so we have to make assumptions. But I agree that this can be done. However, should this utility function be constructed based on the validation set or the test set?

When no utility function is available then the clear choice is to estimate probabilities as well as possible and gives those probabilities to decision makers. If you use thresholds prior to the decision point then you are making utility-devoid decisions for the decision makers. Give them the inputs to their decisions.