Suppose you predict the full conditional distribution of some non-zero continuous outcome as a complex non-linear function of predictors, and the predicted distribution does not have a tractable functional form. Suppose now that you want to generate a human-interpretable evaluation of the final model’s performance in predicting full distributions across a large number of predictor combinations. I propose that to measure moderate calibration you could plot the actual distribution percentiles against the predicted percentiles, and summarize that relationship using either a natural cubic spline or a simple linear regression. Alternatively, you could compute high-resolution (say, percentile-grain) histogram difference between the actual and predicted distributions. Are these valid calibration methods. Are there are human-interpretable methods to evaluate the moderate calibration of the final model that you might suggest? Thank you.
P.S. I should mention that the predicted distributions are estimated from a selected sample of data and distribution predictions might somehow be adjusted through post-stratification. The actual distributions would come from a population-level survey.