In analyzing assay data, the steps are as follows:
- first , collecting urine specimens from subjects.
- second , these urine samples are shipped to labs that run some tests to detect the concentrations of various chemicals in each subject’s urine sample. If the interested chemical, say BPA, is not detected then an entry is made to indicate that BPA was below the limit of detection(LOD) for that subject. A constant value say 0.01/2 is used for values that are below the LOD.
- third, is adjusting the chemical concentration for urinary dilution. Some subjects may have highly diluted urine and some subjects not so much dilution in urine. So the detected chemical concentration (BPA) value is adjusted for dilution in urine sample.
Question is, does it make sense to perform UD adjustment for values that are below LOD (Limit of Detection) ? Curious to know your thoughts. Thanks in advance.
Based on the above, my answer is no. The value of 0.01 is not a real measurement, it is an indicator of left censoring. No “math” should be performed on that value. How the data are handled downstream is dependent on the purpose of the study, but it would be wise to maintain some indicator of left-censoring for those records (eg, creating a new column with an indicator variable).
Thanks Christopher, that makes sense. 0.01 is a 0.01 ng/mL (I think), which is the lowest machines can detect for BPA from urine samples. This varies from machine to machine but this is just an example. Your suggestion makes sense. Thanks.
The LoD is a property of the assay, not the machine it is measured on per se. Some machines may report concentrations <LoD (perhaps down to the Limit of Blank, LoB). Normal hospital laboratory practice is to take any concentrations the machine returns as <LoD and only report to the clinicians that the result is <LoD (rather than the numerical value returned). Sometimes the data I receive has values <LoD, sometimes it is truncated at the LoD. It is worth checking.
To some extent the LoD is an arbitrary threshold based on a threshold 1.645x the standard deviation above the LoB. If a machine is reporting values <LoD it is possible to use them - it is just that they are not as precise measures as above the LoD. However, if the purpose is to analyse the data for something like risk stratification or prediction as would be used clinically, then I agree with Christopher to not use concentrations <LoD and to maintain an indicator variable. One of the tricky things that remains is that if, for example, you are running a logistic regression model, while you can include the indicator variable a choice still needs to be made about what number to put when the concentration is <LoD. I see you chose half the LoD. Other options are just below the LoD or some kind of random distribution under the LoD. If these samples constitute a large proportion of the data set, these choices can make a difference.
I think that the indicator variable makes the choice of fill-in value not relevant.
For linear models something different happens: the residual variance may need to be modeled with a separate parameter (and will have a higher variance) for those observations with the indicator equal to one.
@kiwiskiNZ Thanks KiwiskiNZ, I thought LOD threshold value was an artifact of machines/methods used to analyze these urine samples. CDC in Georgia uses a different lab method while the labs in Wadsworth NY use different method. For the same chemical (BPA), I am seeing 4 different threshold values depending on where the samples were sent for analysis , 0.01, 0.1, 0.15 and 0.2 ng/mL
@f2harrell , Thank you Frank, I will dig more into modeling residual variance.
This is tricky. You say you are seeing 4 different thresholds. Are these merely different reporting thresholds, but the same assay & measurement technique, or are they 4 different assays, probably from different manufacturers (with different LoDs)?
If it is the former and you want to be consistent, then perhaps use the highest threshold for your dummy variable.
If it is the latter, you need to get hold of the package inserts for the assays. It is likely that you won’t be able to combine the data from the different assays unless there is very high correlation and no bias between them (even then I would add a dummy variable for the assay and treat results with caution).
I think it is the later. I will dig deeper to figure out the exact scenario. Thanks for the suggestions KiwiskiNZ,