Predictive model for drug related problems (DRPs)

I would like to ask you concerning a predictive model for drug related problems (DRPs) at discharge. I want to develop a predictive model for drug related problems (DRPs) at discharge for four internal at a secondary, 490 beds teaching hospital in Hadera, Israel.

I am planning to use logistic regression and random forest. 10-20 candidate parameters will be considered. I am planning to calculate minimum sample size that satisfies the criterions recommended in " Stat Med. 2019 Mar 30; 38(7): 1276–1296.". My question is, if approximately 16,000 patients are discharged a year from these departments and if I go for 10 candidate parameters and a shrinkage of 0.9 and R2 CS-adj=0.1, I will need at least 900 to 1000 individuals . I thought to choose the individuals randomly and inorder to represent all departments and all seasons I calculated that I should randomly select 21 patients per department per month.

I would like to ask if this logic is sound and how would you advise to handle this study

If you’re saying that you have 16,000 patient records you could use, then you should use all of them for your modeling effort. (Between Train/Test/Validate.) That’s nowhere near huge enough that you’d want to downsize it for manageability purposes, and throw away more 90% of your data.



Thanks for your reply.
I should have added more details. Discharge letters are not usually reviewed by clinical pharmacist due to lack of time. So the purpose of the model is to make the list of discharge letters smaller and manageable. Basically, two clinical pharmacist will review all the discharge letter in the sample and where there will be disagreement an internal disease specialist will decide. So it is not practical to review 16,000 discharge letters. It will take years…

Ah so, that does makes a considerable difference.

But … depending on how the discharge letters are generated … you may not have 16,000 unique sets of text, after ignoring patient name, address, etc. Some chunks may be generated by an automated process, some may be copy & paste replicates of prior letters.

So a process to identify duplicate letters may reduce the person-power needed to review the letters and/or let you apply the results of the letter review to a larger group of patient records, that you could then include in your analysis.