Descriptive Statistics vs Statistical Models

Sudhi_Upadhyaya · September 14, 2022, 5:18pm

Most of us know when to use simple Statistical Models and when Machine Learning may be suitable.

I am curious to know if there are situations where just descriptive statistics is good enough and use of statistical models is unnecessary or unjustified ?

cwatson · September 14, 2022, 9:00pm

I think it depends on what you mean by “good enough”, and I wouldn’t say that statistical modeling could be rendered “unnecessary”, but I think it depends on your ultimate goal.

The most obvious situation, to me, is when you are not interested in extrapolating from the data you have. For example, in my work (deep learning applications in radiology) we ran a small pilot study to give us an idea of how much our product improves users’ “productivity” (really, we measured how long it took to complete certain tasks). A colleague in marketing asked if they could use the summary statistics (e.g., mean, median, average between-condition difference), but I said they only apply to the participants and certainly we couldn’t claim that all potential users would see the same effect. We also used the data to understand better how the product performed with different types of input data; e.g., if we saw that we did poorly with CT data we’ll have to focus our efforts on that, as opposed to X-ray. Although the data were from a single source (hospital system), it wasn’t unreasonable for us to assume it might be a problem with other hospitals’ data.

Sudhi_Upadhyaya · September 15, 2022, 3:17am

Cwatson I agree with you. However the article from these authors (Lesko, Fox, Edwards) A framework for descriptive epidemiology - PubMed suggest

I think the authors provide various examples why adjusted results are not appropriate. I disagree with their examples because my understanding is that if the competing events are rare then adjusting should not cause an issue. So I am very skeptical about their arguments and conclusion where they say, descriptive summaries are more relevant in certain situation than adjusted results.

.

kiwiskiNZ · September 15, 2022, 10:47pm

Good question!
When you have all the data and are merely reporting on what was without trying to infer what will be. No p-values, no CIs (logically) from statical models are necessary.

If you don’t have all the data, then models are necessary to infer for the subjects not in the data or for future subjects. However, if you think the “missing subjects” are not missing at random, then inference with models would not be justified. (I think)

Sudhi_Upadhyaya · September 16, 2022, 12:26am

Thanks kiwiskiNZ, I agree most methods, specifically those that deal with missing data, are based on the assumption of MAR. If the missingness is MCAR then it makes sense to stick to descriptive summaries, but it is hard to prove MCAR.

James_Stanley · September 16, 2022, 4:46am

Replying to your slightly earlier post: I think the recommendations have a bit more nuance than implied in those short quotations. For example, the in-depth discussion of adjustment comes in the middle of a section that cautions against broad covariate adjustment in a descriptive study (where in causal language that set of covariates might well include a mix of confounders, mediators, and moderators) which runs as follows:

Adjustment implies an intervention on the data and a distortion of reality, e.g., “Would Black people still have lower prevalence of viral suppression if they had the same distribution of HIV acquisition risk factors as white people?” Inappropriate adjustment may understate the magnitude of disparities (45) and adjusted statistics are prone to be interpreted causally, which could lead to inappropriate recommendations (9). We endorse reporting and primary interpretation of unadjusted results for
descriptive studies and clear justification and proper interpretation in cases where adjustments are made.

I’m quite sympathetic to this longer statement (though would suggest that also reporting some basic adjustment for e.g. age is almost always of use in descriptive epi studies).

In a way this distinguishes between “facts on the ground” (group A has higher risk of disease, without adjustment); slightly more complex scenarios where there are somewhat mechanical explanations for those initial differences (disparity is larger/smaller when we account for different age structures); and more complex modelling that risks overinterpretation, or bias e.g. through adjusting for mediators.

Overinterpretation is definitely an issue: it is quite common to see conclusions along the lines of “there was no difference in disease outcome when adjusting for X, Y, Z” when a better summary might be “differences in disease outcomes were substantively explained by differences in the distribution of X, Y, Z”.

f2harrell · September 16, 2022, 12:02pm

In general I can’t think of many studies where there’s not at least one variable (sometimes age) to adjust for. Descriptive statistics can’t handle missing independent variables and usually can’t handle censoring or lower limits of detection. Hence descriptive statistics are far less useful than they appear.

Sudhi_Upadhyaya · September 16, 2022, 3:30pm

Frank, you are right, we always start with LOD/LOQ and Urinary Dilution adjusted values, never directly report anything from unadjusted raw values. Pardon my ignorance in Survival Analysis. In case of censoring, we report Incidence Rates (Events/Person Time at Risk) is it not ?

f2harrell · September 16, 2022, 10:37pm

To be clear I was meaning to refer to regression adjustments, not pre-specified mathematical adjustments.

Sudhi_Upadhyaya · September 17, 2022, 4:41am

Perfect, Got it. Thank You.