Methods for analyzing (primary) percentage data

albertoca · December 2, 2020, 10:21am

I wanted to ask about the best method to analyze percentages that do not represent an underlying count or a binary variable, as a response variable. In this case, it is a variable (% wound healing) that is expressed naturally as a percentage, therefore, with boundaries at 0 and 1. The independent variable is a categorical variable with 4 levels.
I have read that beta regression with post hoc corrections is the best method. However, my colleagues have sent me an analysis with the Kruskall-Wallis test. I have read several articles explaining why ANOVA is not correct, but I am not sure what happens with nonparametric tests. Is the Kruskall-Wallis I received correct? On the other hand, if I apply beta regression, I think there is no problem with post hoc analyzes, as with any other model, is there any more advisable here?

MSchwartz · December 2, 2020, 1:23pm

I presume that you do not have the underlying numerators/denominators for the derived percentages relative to your observational units? That is, the total number of wounds and the number that healed for each unit? If you do, you could use logistic regression directly.

If you have actual 0s and 1s in the dependent variable values, you cannot use beta regression, as it works on the open interval (0, 1). Some people will add or subtract very small values to any 0s and 1s, but there are caveats to doing so, as you might expect.

Another option would be to use a fractional logit model and use robust standard errors.

Alternatively, use a log transform on the percentages to see if you can then use a regular ANOVA.

The Kruskal-Wallis test is a generalization of the Wilcoxon test for >2 groups, and is effectively a non-parametric ANOVA, which can be used when the data fail the assumptions that underlie the normal ANOVA.

So there are a few options depending upon what makes sense for your context and hypotheses.

MSchwartz · December 2, 2020, 2:01pm

It just occurred to me that your percentage would healing is the percent that each individual wound healed, presumably based upon a reduction in the surface area of the wound, as opposed to the subset of a total number of wounds that healed.

Thus, disregard my initial comment about numerators and denominators above.

Time for more coffee apparently…

Pavlos_Msaouel · December 2, 2020, 2:41pm

I have read several articles explaining why ANOVA is not correct, but I am not sure what happens with nonparametric tests. Is the Kruskall-Wallis I received correct?

Great question. If you use the Kruskal-Wallis then you are essentially treating the outcome (dependent) variable as ordinal, which I think is not unreasonable and has the advantage that it is transformation-invariant (vs ANOVA). May be even better to just go for a proportional odds model, of which Kruskal-Wallis can be considered a special case, so that you can fit covariates and get various other useful estimates such as exceedance probabilities. @f2harrell is the expert on this and discusses it at length, e.g., on BBR sections 7.6 onwards.

f2harrell · December 3, 2020, 8:26pm

Yes an ordinal analysis will be robust. Using the proportional odds model instead of its special case the Wilcoxon-Kruskal-Wallis test will allow for more accurate handling of ties in Y.

albertoca · December 4, 2020, 10:26pm

Thank you for the answers. In this case the response variable, wound healing (%) is an in vitro analysis that measures the degree of repopulation of a culture of tumor cells. We have used it to measure cell migration as a function of various treatments.
I think we will make a PO model then. One last detail, is there any specific recommendation for the post hoc tests?

albertoca · December 4, 2020, 10:28pm

It is an in vitro experiment with cell lines. A break is made on a cell culture plate, and a cell-free surface is left. After a while, the repopulation percentage is measured visually.

Cruz · December 8, 2020, 10:12pm

Because of endpoint %wound healing. Check compositional data analysis, an example at https://www.iti.org/iti-academy-consensus/CC4_Group4_2.pdf

med_stat · December 9, 2020, 12:11am

I like the PO model idea as well.

What are your thoughts on a GLM (possibly mixed if repeats) using a logit link and normal random component?

f2harrell · December 9, 2020, 12:02pm

The proportional odds (PO) and other semiparametric ordinal regression models seem to have an advantage here because the distribution of the percentage at a given covariate setting can be anything with these models, which include clumping at zero and bimodality.

arthur_albuquerque · August 7, 2022, 6:03pm

Maybe this new model can be of your interest @albertoca @f2harrell. The related R package uses brms under the hood.

I propose a new model, ordered beta regression, for continuous distributions with both lower and upper bounds, such as data arising from survey slider scales, visual analog scales, and dose-response relationships. This model employs the cutpoint technique popularized by ordered logit to fit a single linear model to both continuous (0,1) and degenerate [0,1] responses. The model can be estimated with or without observations at the bounds, and as such is a general solution for this type of data. Employing a Monte Carlo simulation, I show that the model is noticeably more efficient than ordinary least squares regression, zero-and-one-inflated beta regression, re-scaled beta regression and fractional logit while fully capturing nuances in the outcome. I apply the model to a replication of the Aidt and Jensen (2012) study of suffrage extensions in Europe. The model can be fit with the R package ordbetareg to facilitate hierarchical, dynamic and multivariate modeling.

R_cubed · August 7, 2022, 6:12pm

Blockquote
I propose a new model, ordered beta regression, for continuous distributions with both lower and upper bounds, such as data arising from survey slider scales, visual analog scales, and dose-response relationships.

Not seeing any advantage when there is the logistic model

Blockquote
This model employs the cutpoint technique popularized by ordered logit to fit a single linear model to both continuous (0,1) and degenerate [0,1] responses.

The emphasis on cut points and linear models already makes me skeptical.

Note that the simulations failed to include a logistic comparison.

f2harrell · August 7, 2022, 7:47pm

Yes and an obvious choice would be a simpler ordinal logistic model with no binning.