I am working with a dataset where the goal is to evaluate the effect of Mother’s reported income (A, exposure) on Child’s Test Score (y, outcome).
The scientists want the data to be one row per subject. The subject here is a Mother-Child pair, or dyad.
There are more than one observations per Mother, a mother reported her income more than once during different years, her child was tested only once. In some cases, the mother reported the income once but the child was tested twice on two different dates. So many observations per Mother-Child dyad.
The goal is to reduce the datasets with multiple observations per Mother-Child to one single observation.
What is the right thing to do, should I take the average of income, if income is reported more than once or take the average of test scores if the child is tested more that once? Or should I take the Mother-Child observation that is closest to each other ? I am leaning towards retaining the mother-child rows that are closest to each other.
re ‘closest to each other’, i guess you mean in time. But the mother’s income should come before the test score if it is to influence it. How the multiple obs are handled might depend on how many mothers have income recorded more than once. If it’s only a few you could do it on a case-by-case basis? Eg, if there is a large salary increase but this is occuring soon before the test score it is doubtful this new salary is as relevant as the older salary. If the salary increase is very large you may have to wonder if it’s data error, etc
If I were to guess, they are asking you to convert from “long” to “wide” format. To do this, you should create one variable/column for income in year t, another column for income in year t+1, etc. This results in a single observation of multiple variables, and retains all the information from the original dataset.
I guess it is possible that they also want you to create a single variable representing some summary of all the income variables, but unless they told you specifically how to do this, I think it is unlikely that this is what they intended. If they are interested in using a variable in their model that represents “average income” or “most recent measure of ncome at the time test score was measured”, they can very easily construct those variables themselves after you convert it to wide format.
Yes Sudhi needs to clarify if the desire is for a summary statistic approach vs. a multiple variable “wide dataset” approach. The latter is hard to do when time intervals vary across participants. In general analysis of raw data will provide the highest statistical information, but sometimes summary variable approaches, when intervals are almost uniform, don’t lose a great deal of power.
Thanks pmbrown, what you mentioned is relevant, and I have addressed all those issues during preprocessing of my dataset.
Andres, they just want to use one Exposure and one outcome per mother-child dyad, not long to wide. I met with the scientists today and convinced them not to take the average and use the entire dataset. I am going to apply mixed effects model, just like what @f2harrell Frank and others suggested.
Frank , you are right. I met with the scientists today and we are going to use the entire dataset with repeat observations , not taking average