What would be the most truthful way to draw inferences in a multilevel cell biology dataset?


I’m a PhD candidate in cell biology that (in retrospect) has never received extensive statistical training on how to best analyse the most common experimental design in my field. Most of what I am asking here is inspired from self-study of literature by Lazic, Harrell, Gelman, Lakens, and McElreath but I feel a need for clarifying some questions. I hope, I am in an appropriate place here.


Very often, cell biology experiments are influenced by variations spanning multiple levels.
But we commonly analyse (I learned how to analyse) variability in these experiments using t-tests or similar methodologies only at one of these levels.

As an example, a typical experiment for us uses neurons from mouse brains (level 1) that we grow in separate multi-well dishes (level 2) where we can apply separate treatments to each well. The readouts of these experiments often are quantified for individual cells (level 3). Usually the variability between cells (level 3) is much larger than the variability between animals (or culture dishes).

  • Common practice for a long time was to estimate treatment effects only at level 3 (the level with largest variability but also most observations) while ignoring variabilities introduced by the other levels.
  • As this analysis does, however, violate the independence of treatment assumption (several cells from level 3 are in the same well of a multi-dish plate when the treatment is applied), a “safe suggestion” was to average treatment effects over cells and treat the individual mouse brains (level 1) as the experimental unit (e.g. propsed by * What exactly is ‘N’ in cell culture and animal experiments?).

None of these approaches, however, specifically estimates the variance of the treatment. Analysing at level 3 (cells) merges variability between cells with the variability of the treatments, analysing at level 1 (animals) ignores cell-to-cell variability and is strongly influenced by animal-to-animal variability.

My questions

I assume treatment effects and variability to be the most relevant parameters for most research questions. But is this assumption fair?

  • Is it on the contrary more interesting (or robust) to estimate the treatment variance always in connection with the variability between the cells or experimental system the treatment is applied, to strengthen conclusions? I am thinking also about replicability in further experiments here, which might be influenced by the variability between experiments as well as by treatment variabilities.
  • If treatment variability is indeed most relevant for robust quantification of scientific experiments, shouldn’t we focus on methodologies that model the other relevent sources of variability to improve precision of our treatment estimate? How can we do that without pseudreplicating (see following block)?

Furthermore, regarding methodologies to accurately estimate the various variances:

  • If in such an experimental design a ‘readout per cell’ would be modeled with a mixed GLM with fixed factor ‘treatment’ and random factors ‘culture dish’ and ‘mouse brain’ - would this accurately isolate a treatment effect or would it still pseudoreplicate the experimental units (culture dish)?
  • Would it alternatively be possible to include cell-to-cell variability as a random factor (each cell usually only gives one readout) or would that not be required?
  • Or would it be best after all to summarize the readouts to the experimental unit and address animal-to-animal variability by normalization (e.g. by subtracting the mean(readout) per animal)?

I hope you can clarify some of my questions or point me to explanations of the topics. Especially cell biology related references would be very useful to me as I am constantly searching for material to forward to fellow researchers with similar questions.

Best, Joachim

PS: If you actually read until here, thank you for your interest :slight_smile: This somehow turned out much longer than I intended… also partial answers would help me a lot


You have laid this out beautifully. I hope that some basic sciences experimental design experts will join the discussion. I think you have asked all the right questions. Ignoring cell-level responses by averaging over cells creates a missed opportunity to better understand variation, and results in some loss of efficiency. A hierarchical random effects structure sees to fit the situation but I’m unclear on the realized effective sample size produced by such a model.


Joachim this is a good question.

I encountered the same problems while working with the very sensitive mass spectrometry instruments for proteomics and metabolomics where variations in the batch of the solvent is seen by the instrument.

You can try con control for how many variables (experimentally) you want but the best results lie on the classic advice given for robust statistics: randomize, randomize, and randomize.

In your example I would not complicate the analysis with levels, covariates, and mind-clogging procedures. I would try to mix and randomize the most I could.

If you have mice from where the cells are taken, I would mix the cells from different mice and then seed them. For treatment, I would also randomize the wells in plates (not a plate for controls and one for treatment).
Don’t forget also reagents. Use the same for everything and make good controls: e.g. if you give drugs, even 200ul in 500ml of saline are important in the sham media because is not only the saline but also the plastic on the pipette that you have to control for.

In conclusion I think that you can forget about levels and keep the statistics simple if you treat and think about your experiments like well designed RCTs.


Not an “expert” per se but I’ll lend my two cents.

I think @lucavd makes some good points and if you can “mix and randomize” then I think most of what was mentioned was worth following (though random effects for brain/mouse would still be worth looking at IMHO).

First, if you haven’t already I would check out these articles I have linked below. The are all very good introductions on multilevel models. It isn’t a panacea but these types of models can be useful.

  1. A solution to dependency: using multilevel analysis to accommodate nested data | Nature Neuroscience
  2. Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives | BMC Neuroscience | Full Text
  3. Understanding Mixed Effects through Simulation
  4. Intro to Linear Mixed Models in R

With the brief overview of your experimental design it looks like your have nested with “wells” with “brains” with each observation coming at the cell level. Now, cells in this case wouldn’t be included as a random effect because you only have 1 observation per cell. Also, if I am understanding you correctly each well would be applied a different treatment while each brain (overall) would still get both treatment. I would think then that you would want an explicit nesting structure with random slopes per brain with random intercepts per brain and well.

A possible model could be fit using the lmer function from lme4:

# Vary slope by brain only
mod_1 = lmer(y ~ treat + (1+treat|brain) + (1|brain:well), data = df_brains)
# Or, vary slope by brain and well
mod_2 = lmer(y ~ treat + (1+treat|brain/well), data = df_brains)

For the variance components the mixedup R package may also be helpful.


I carried out a dummy analysis of variance using Genstat for what I am guessing might be a possible design. The analysis uses the following code
Nothing else is required.
The block structure assumes that cells are nested within wells which are nested within brains. The designs assumes that treatments are randomised to wells. (So treatment is varied at the intermediate level of wells.)
As might be expected it shows that degrees of freedom for analysis are generated at the Brain/Well level and that what happens at the higher Brain and lower Brain/Well/Cell levels is not relevant for that purpose. The analysis can be found here http://www.senns.uk/Brain_well_cell.rtf Genstat Analysis


These are data from an example experiment and the analysis that Genstat suggest is appropriate

To illustrate Nelder’s approach to a multilevel experiment


Observation	 Brain	 Well	 Cell	 Treatment
1	 1	 1	 1	 A
2	 1	 1	 2	 A
3	 1	 1	 3	 A
4	 1	 1	 4	 A
5	 1	 2	 1	 D
6	 1	 2	 2	 D
7	 1	 2	 3	 D
8	 1	 2	 4	 D
9	 1	 3	 1	 A
10	 1	 3	 2	 A
11	 1	 3	 3	 A
12	 1	 3	 4	 A
13	 1	 4	 1	 C
14	 1	 4	 2	 C
15	 1	 4	 3	 C
16	 1	 4	 4	 C
17	 1	 5	 1	 C
18	 1	 5	 2	 C
19	 1	 5	 3	 C
20	 1	 5	 4	 C
21	 1	 6	 1	 D
22	 1	 6	 2	 D
23	 1	 6	 3	 D
24	 1	 6	 4	 D
25	 1	 7	 1	 B
26	 1	 7	 2	 B
27	 1	 7	 3	 B
28	 1	 7	 4	 B
29	 1	 8	 1	 B
30	 1	 8	 2	 B
31	 1	 8	 3	 B
32	 1	 8	 4	 B
33	 2	 1	 1	 A
34	 2	 1	 2	 A
35	 2	 1	 3	 A
36	 2	 1	 4	 A
37	 2	 2	 1	 C
38	 2	 2	 2	 C
39	 2	 2	 3	 C
40	 2	 2	 4	 C
41	 2	 3	 1	 D
42	 2	 3	 2	 D
43	 2	 3	 3	 D
44	 2	 3	 4	 D
45	 2	 4	 1	 C
46	 2	 4	 2	 C
47	 2	 4	 3	 C
48	 2	 4	 4	 C
49	 2	 5	 1	 A
50	 2	 5	 2	 A
51	 2	 5	 3	 A
52	 2	 5	 4	 A
53	 2	 6	 1	 B
54	 2	 6	 2	 B
55	 2	 6	 3	 B
56	 2	 6	 4	 B
57	 2	 7	 1	 B
58	 2	 7	 2	 B
59	 2	 7	 3	 B
60	 2	 7	 4	 B
61	 2	 8	 1	 D
62	 2	 8	 2	 D
63	 2	 8	 3	 D
64	 2	 8	 4	 D

Treatments by well and mouse brain

Treatment	      A	B	C	D
Brain	Well	 
1	          1	      4	0	0	0
	          2	      0	0	0	4
	          3	      4	0	0	0
	          4	      0	0	4	0
	          5	      0	0	4	0
	          6       0	0	0	4
	          7	      0	4	0	0
	          8	      0	4	0	0
2	          1	      4	0	0	0
	          2	      0	0	4	0
	          3	      0	0	0	4
	          4	      0	0	4	0
	          5	      4	0	0	0
	          6	      0	4	0	0
	          7	      0	4	0	0
	          8	      0	0	0	4


Analysis of variance

Source of variation d.f.

Brain stratum 1

Brain.Well stratum
Treatment 3
Residual 11

Brain.Well.Cell stratum 48

Total 63


Hmm, I am not sure that would be the exact design. I will post later but I assumed that within each brain there would be multiple wells for each treatment (again this is just guessing though).

However @JoFuchs if you want do what @Stephen has suggested above (and don’t have access to Genstat), you can complete this in R with the following code. Note: I used the data structure provided above and saved it as a csv then imported into R.

# Set correct file path!
brain <- read_csv("Documents/brain.csv") %>%
  janitor::clean_names() %>% 

# Reproducibility

# generate data
brain$y = rnorm(nrow(brain))

# fit using base R
fit = aov(y ~ treatment + Error(brain / well), # nesting structure
          data = brain)
# ANOVA (dfs match Senn)

# estimated marginal means
## Note: warning on output
emmeans(fit, pairwise ~ treatment)

The design I generated has two wells for each of four treatments within each brain. There are four cells per well. So the numbers are (2 brains )x (2 wells with brains per treatment) x (4 treatments) x (4 cells per well) = 64 observations.
Of course in principle you can do what you can do in Genstat in R, SAS, STATA etc. However, the code in Gesnstat is designed to guide you what to do and IMO R is inferior in this respect.
The Genstat code is
for a dummy analysis and
ANOVA Outcome
(if Outcome holds the results) for the real thing. It’s both simple and principled.

1 Like

Hi Joachim,

Good question! A few comments:

The model that Prof Senn described is what I would use. The alternative is a mixed effect model as suggested by @arcaldwell49, but if there are few mice (2 in this case) it’s hard to get a good estimate of the between-mouse variance.

“Would it alternatively be possible to include cell-to-cell variability as a random factor (each cell usually only gives one readout) or would that not be required?”

There is no need to include “cell” as a variable in the model as the cell-to-cell variability is the same as the residual variability. In other words, if you include “cell” as a variable, then you’ve accounted for all the variation. However, if repeated measurements are taken on each cell, then then it makes sense to include “cell” as a variable in the model and cell-to-cell variation can be estimated.

I wouldn’t recommend mixing (or pooling) cells from different animals because between-mouse variation is now mixed up with between-cell variation and can’t be estimated separately (although it can make the analysis easier). If you’re interested in assessing the sources of variance, then it’s best to keep them all separate.

Other interesting analyses include: (1) assessing if the within-well variability (variation between cells within a well) differs by treatment. The easiest way to assess this is by calculating a well-level summary statistic such as the SD, MAD, or IQR. (2) Assessing if the treatment effect differs between animals, especially if the animals are run on different days as this assesses the stability of the results across different experimental runs.


Regardless of how the data is analyzed I think all the commenters here agree that having a balanced experimental design that accounts for each strata is highly recommended (if not essential).

So @JoFuchs, I am curious about a few things regarding your study (I also assume data collection hasn’t occurred yet). Your first post indicated you don’t have much statistical training so I don’t want to leave you stranded not knowing how to execute what has been suggested on this forum.

  1. How many brains do you plan on sampling from? How many wells can be dedicated to each brain? How many treatments are there? How many observations can be made from each well (is it fixed? Or what is the min/max)?
  2. What is the nature of what you are trying to measure?
  3. What statistical software do you have access to and feel comfortable using?
1 Like

Thank you very much everyone! This has turned out much better than I ever hoped for!

Maybe as a clarification upfront: these questions were more of philosophical/educational nature from my side that I encounterd after spending some time reading more about statistical methods. While I am planning and executing such experiments at the moment, I am not looking for a specific answer to one experimental setting.
It is just that the outlined structure of experiments (with several levels and treatments applied somewhere in the middle) has been re-occuring, if not dominating my research so far and I think I am not alone in this in cell biology. And I have heard and read so many different opinions on how to best deal with this situation that I felt a need to clarify some things. I’ll try to provide a more specific brief summary for such typical experimental settings that are similar to what I described below to make this less abstract. (*)

What I am taking from this discussion (please correct me if I am wrong):

Experimental design

  • What I have described seems to be best described as a nested design (even when treatments are not applied at the lowest level). Having such a design balanced is critical.
  • controlling for variability experimentally (e.g., by randomizing) can enable using simpler statistical tools but does not allow to distinguish the potentially relevant variabilities at play

Preferred analysis strategies

  • specifying the proper nesting structure avoids pseudoreplication and allows to efficiently use all data without pre-averaging to the treatment level
  • cells (levels below the level of the applied treatment) would only be included if multiple measures are taken per cell
  • Either a nested ANOVA with treatment as fixed factor and brain & well as error terms in their proper nesting structure …
  • … or a mixed effects model with brain & well as nested random factors are suitable to efficiently analyse such a setting.

What I am implicitly also taking from this discussion is that isolating treatment variability indeed seems to be a good idea when thinking about biological effects. I assume it is nevertheless critical to report and visualize all variabilities at play when presenting such data. I am very fond of the idea of SuperPlots in this respect, although using them for more than two levels might require some creativitiy. Do you have other/better recommendations on reporting results from such a setting?

Some last-minute questions:

  • Is averaging data of cellular readouts per well and just accepting between-mouse variability a safely recommendable way to go for less advanced data analysts?
  • I am also asking because practically one could easily upscale at the treatment level and treat dozens of wells (maybe analyse fewer cells per well) while still only looking at data from 2-3 mice. While this probably would very well capture treatment variability in these mice, it somehow still does not seem like a very robust description of the biology to me. Is it “just” a question of generalizability to include enough entities of the levels above the treatment level (e.g. mice)? Maybe putting it more practically: am I pseudoreplicating, if I describe a treatment effect by analysing 20 wells per condition of 2 mice or would I “just” be overgeneralizing when making strong claims about this treatment effect? I just fear that analysing at the treatment level (/using the suggested more efficient/powerful statistical tools) might provide some false security even if applied correctly.
  • Re-reading @stanlazic’s second suggestion for further interesting analyses now, it seems like the answer goes into the direction “analyse and report both”?

Thank you again for all your input and the great references and practical codes! :slight_smile: I have been using R for my analyses so far, but Genstat’s style of specifying the experimental design and analysis does indeed look very intuitive. I will definitely investigate further and recommend it to others. Also thank you @arcaldwell49 for translating the Genstat code to R, I certainly would have asked for that! As an info for my colleagues maybe, is GraphPad Prism capable of executing these analyses?

(*) More specific info to experimental settings I encountered with this nesting structure

  • Effects on the morphology of cells: cells of (typically less than 10) mice are grown in wells where treatments are applied. The readout (by microscopy) is one value per cell at max.
  • Changes in the morphology of cells: same nesting structure, just that cells would now indeed have multiple mesurements that are spaced over time, adding another level of complexity
  • Effects on the amount of a protein: cells (not necessarily from mice, but there are other ‘higher-level’ effects such as ‘passage’/age) are treated in wells, proteins are visualized either per cell (microscopy-based) or quantified in so-called western blots which quantify at the well-level (but could be repeated in principle to reduce technical artifacts).

I think it depends on whether you are interested in causal or predcitive analysis. If you want to show the effect that treatments can have the tight local control is very helpful even if it is at the expense of possible generalisability. If you are interested in prediction more variable material may be useful. This was all addressed a long time ago (more than 80 years) by Yates and Cochran in which they refer to scientific research (causal) being easier than technical research (predictive of what will happen in practice). See Yates F, Cochran WG. The analysis of groups of experiments. Journal of Agricultural Science. 1938;28(4):556-580.


Fantastic thread. I used to work with animal research, and always found this subject troublesome.

I can imagine “Raincloud plots” being useful in your case. A single-colored density plot, but depicting the different clusters with the geom_halfpoint (inspired by superplots). Great tutorial for R here


SuperPlots don’t scale to very large datasets. Spike histograms with additional bold tick marks for selected quantiles, work for all N.


True. Though I find highly unlikely a very large sample size in his research field.

Again, I agree with everything above. GraphPad is capable of some of these analyses but, as with most GUI based software, there are limitations.

Now let’s go point by point through some questions.

  1. Is averaging data of cellular readouts per well and just accepting between-mouse variability a safely recommendable way to go for less advanced data analysts?

No, I would go back and check the first two articles I posted. The only scenario I would do this in is if the within well ICC is very high (in which case you will likely get similar results anyways). GraphPad links out to this site which provides some links to other evidence as to why “pooling” is not a good approach. My big thing is that you are leaving valuable information out of your analysis.

  1. On overgeneralizing/pseudoreplicating:

I defer to Dr. Senn’s judgement here and would also recommend reading Yates & Cochran (I can send a copy if needed). However, if you are interested in the treatment-by-brain response (maybe similar to what Dr. Senn describes in crossover replicate trials) you may want to include more brains at that level to measure that component of variance. For that design I would suggest having many different brains to sample from where each brain is exposed to both treatments in multiple wells.

I have a gist with a simulated a data set and some analyses options in R if you want to check them out Simulate multilevel brain data · GitHub


Amazing! Thank you so much again. I am learning lots and I’m in the middle of rewiring my stats-brain.

I really need to wrap my head around this. Reading about the distinction of causation and prediction more now, opens a new world (or can of worms). Somehow to me causation was always extremely tightly connected with replicability (which I probably even merged with some aspects of generalizability) but maybe it is not that simple after all… very interesting concepts!

Thank you for all the great reading material (I did find a copy of the Yates & Cochran after some digging, it has proven more difficult than expected to get my hands on it though). I will probably spend some time reading this and the other references now. On my search I also found this which although loosely connected I found an interesting read as a non-statistician “Using Biological Insight and Pragmatism When Thinking about Pseudoreplication”, Colegrave & Ruxton 2017.

Thank you especially for this. I have been trying to simulate some data myself lately, this is perfect timing! And with the analysis options it is gold!! I am very much a hands-on learner…
This forum is such a helpful place! Thank you!

I’m late to the discussion, having just learned about it from the summary notice in my e-mail. I’m hoping to expand the discussion a bit to be more general. Senn suggested an example with both treatment replication within block (two wells per brain:treatment combination) and with subsampling (4 cells per well). I want to ask, what would be the models if we didn’t have replication at one level, the other, or neither? And this raises a question that I have about Senn’s model.

  1. RCBD with no subsampling – No treatment replication within brain or subsampling with well. This is classic RCBD. This is an additive two-way ANOVA with brain as a random factor. A lmm for this would be

lmm1: y ~ treatment + (1 | brain)

  1. GRCBD with no subsampling – replication within brain but no subsampling. This would be two-way ANOVA with brain and interaction as random factors. An lmm version of this is

lmm2: y ~ treatment + (1 | brain) + (1 | brain:treatment)

  1. RCBD with subsampling – subsampling within well. Hmm. QUESTION 1: Would this also be a two-way ANOVA with brain and brain by treatment interaction as random factors? In #2, the correlated error (in addition to that in #1) comes from the same brain x treatment combination being used in multiple wells so there is variance due to well. Here the additional correlated error comes from multiple measures from a single brain x treatment combination (so any variance due to well isn’t there). So is this no different from #2 in how we statistically model the design?

lmm2: y ~ treatment + (1 | brain) + (1 | brain:treatment)

  1. GRCBD with subsampling (Senn’s example) – treatment replication within brain and subsampling within well. Senn suggested cell nested within well nested within brain. The lmm equivalent is

lmm3: y ~ treatment + (1 | brain/well)

or using the : notation (which I prefer)

lmm3: y ~ treatment + (1 | brain) + (1 | brain:well)

Note that aov(y ~ treatment + Error(brain / well), which looks like lmm3,
doesn’t work given how the variable well is coded in the Senn data. To get this to work, we’d need to make a new “well by brain” column and use that.

My suggested model for #4 is

lmm4: y ~ treatment + (1 | brain) + (1 | brain:treatment) + (1 | brain:treatment:well)

Note that (1 | brain:treatment:well) is the same as (1 | brain:well) and so lmm4 the same as Senn’s (lmm3) model except lmm4 adds an intercept for each brain by treatment combination. This addition in lmm4 models the interaction between brain and treatment, which is what is what is done in lmm2. QUESTION 2: If we model the interaction in a two-way ANOVA with replication of the random factor, why wouldn’t we when we have this plus subsampling within the replicates?

Update – Joachim Fuchs also asked “Is averaging data of cellular readouts per well and just accepting between-mouse variability a safely recommendable way to go for less advanced data analysts?”

If the design is perfectly balanced (including same number of subsamples in every well), then averaging the subsampled measures within a well and fitting a reduced model gives the exact same inference as fitting the nested model on the full data. For Senn’s example, these are the same

aov2 = aov(y ~ treatment + Error(brain / well_id),
          data = senn)

aov2b = aov(y ~ treatment + Error(brain),
          data = senn_agg)

where “well_id” is a column of the well by brain combination and senn_agg is the aggregated data (the four measures in each well_id are averaged)


It’s no use asking me since I think that the approach in R is fundamentally misguided and I have no interest in it. I would only use it if forced and as long as I have a Genstat license, I am not forced to do so. The Genstat approach is simple and general. The code looks like this
ANOVA Outcome
What happens next depends on how the declared TREATMENTSTRUCTURE maps onto the BLOCKSTRUCTURE.
Does the treatment vary between brains, between wells or between cells? The algorithm “sees” this from the “design matrix”. Theres is no need to build a model.
This powerful approach based on Nelder’s work in 1965 desetves to be better know. I tried to explain what it does in this blog S. Senn: To infinity and beyond: how big are your data, really? (guest post) | Error Statistics Philosophy

1 Like

My two questions are very general and framed within the language of ANOVA. The R code is there to show an implementation of the ANOVA in R - this may help some readers here. In the example you gave, it looks like an F test for Treatment would have df of 3, 11 and would have the residual MS in the denominator. My question 2 is about your model, not about R, which is why is there no stratum for Treatment.Brain? I would think the F test for treatment should have the MS of Treatment:Brain in the denominator, with 3 df. But maybe I am misunderstanding the Genstat output.