Lower bound for applying a concordance index (or Somers' D)

TiborKiss · August 8, 2025, 5:46am

I have a question concerning Somers’ Dxy, or C: Is there a lower bound of data points that are required to make Dxy (or C) a sensible measure? Specifically, I am investigating control items in an experimental study under review. The study uses a slider between 0 and 1 to categorize „good“ or „bad“ items, so that continuous values bounded in [0, 1] can be assigned. For each participant, there are six control items, which are considered to be clearly good or bad. As it currently stands, the experimenters require that five out of six control items must be classified correctly, which they take to hold if good items receive >= 0.5, and bad ones < 0.5. They thus use a fixed threshold, which I find problematic for known reasons.

I would suggest a re-implementation of the filtering of the control items using Hmisc::somers2, but it would apply to only six items per participant. I am curious to know whether Somers’ Dxy or C can be applied to such a small set.

I am looking forward to your answers. Thanks a lot.

f2harrell · August 8, 2025, 2:25pm

Welcome to Datamethods Tibor!

It would be good to provide a few sentences about the ultimate goal of the analysis.

It takes about 400 observations to well estimate a correlation coefficient. Whether a large problem with precision of the rank correlation estimates is a problem for you will depend on the intended use. Also, there may be a parametric summary measure for your situation that has more precision. I look forward to seeing more information.

TiborKiss · August 9, 2025, 8:30am

Hi Frank,

thanks for pointing out datamethods.org to me. Let me elaborate a bit: there is a subfield in experimental linguistics that is concerned with acceptability judgments, based on the assumptions that native speakers are able to discern acceptable from unacceptable sentences in their language (acceptable: A quick fox jumped over the lazy dog. unacceptable: Quick fox a jumpered over lazy dog the.) Experimental linguists carry out experiments in which they design examples to meet or refute assumptions about grammar. Participants, however, may not be considered reliable for a variety of reasons (for a recent discussion see Pieper et al. 2025, https://doi.org/10.3138/jrds-2024-0002), so experimental studies usually include control items, which are deemed uncontroversially acceptable or unacceptable. A disagreement of the participants on these usually flags a problem and will presumably lead to not considering the results from the participants, depending on some threshold.
In the paper I am reviewing, acceptability/unacceptability is measured by a slider (closer to 1 acceptable/closer to 0 unacceptable). Hence the question emerges how responses to control items match to their categorical status (0, 1). The authors have decided on the 0.5 threshold, so everything above (and including) 0.5 counts as acceptable, everything below as unacceptable. There are six control items for each participant.
I thought about replacing the fixed threshold by C, but suspected that six instances are too few, as you have confirmed. I have now come up with the alternative to determine the median and the median absolute deviation for each of the six items across all participants. Then, one could take median + MAD for bad examples, and median - MAD for good examples, and take the respective values as thresholds.
But perhaps there are other ideas around!

f2harrell · August 10, 2025, 11:58am

With n=6 per participant the data are too precious to even remotely consider the use of the thresholds. Those authors are simply destroying a large proportion of the information. If you just want to summarize the six [0, 1] items, summaries based on simple means and variances will likely be optimal. That’s because the data are “tied down” and so there can be no problems with overly-influential observations. For comparisons, I guess difference in means between good and bad examples works best.