Estimates from binning should be ignored. You can easily manipulate the shape of the calibration plot by changing the binning method, and it ignores risk heterogeneity within bins.
The minimum sample size needed to estimate a probability with a margin of error of \pm 0.1 is n=96 at a 0.95 confidence level. So it is impossible to do a model validation with n=49 even were all the predictions to be at a single risk value.
It is not appropriate to split data into training and test sets unless n > 20,000 because of the luck (or bad luck) of the split. What is your original total sample size and number of events? And how may candidate predictors were there?