Without having experience with that type of experiment, I’ll just say that the best approach in general is to have variables/parameters in a grand model that capture things like strain of microorganism.
In the video, we talked about how converting a continuous variable into categories is a bad thing to do. But sometimes you may want to do that if you are using a linear model and you believe that the relationship between the dependent variable and the covariate is non-linear. Another example would be if you have the state as a covariate and you are using a random forest you cannot directly plug in state into the random forest model since there are too many categories. How do we deal with this sort of discrepancy?
You could relax the assumption on linearity and go with something like a spline regression. Your selection of knots would help characterize the “categories” of interest. As far as too many categories, I believe the general rule when it comes to knot selection is 3 or 4, but this is dependent on the data.
During the first lecture, someone requested some texts on study design.
Here are a few that I have used and found useful:
- Design and Analysis of Experiments by Douglas Montgomery
- Design and Analysis of Experiments with R by John Lawson
- Applied Linear Statistical Models by Kutner, Nachtsheim, and Neter (although not a design book, Part 4 makes some excellent reading on design of experiments)
I would love to hear other people’s suggestions as well.
As daszlosek points out there are good methods that allow you to model nonlinearity. Along with splines, there are also fractional polynomials. R has a great restricted cubic spline function (rcs) that can be used in the rms package. Apart from the problems with categorization already mentioned, another problem is that if you create enough categories to capture nonlinearity, you’re using up lots of degrees of freedom. The restrictions in the rcs function allow you to model nonlinearity more efficiently, generally using fewer degrees of freedom.
I’ve just updated a Shiny app to help visualize a range of non-linear relationships, and the benefits of using restricting cubic splines using logistic regression. Hope you will find it helpful.
R code available at https://github.com/drjgauthier/spliny/blob/master/app.R
See also this short editorial on the topic:
I have not heard the term ‘compositional before’, and it suggests to me the question, ‘is change from baseline’ a ‘compositional’ variable? Perhaps this question will be germane to another thread as well. Thanks!!!