Using Critical Success Index or Gilbert Skill Score as composite measures of positive predictive value and sensitivity in diagnostic accuracy studies: Weather forecasting informing epilepsy research

The Critical Success Index (CSI) and Gilbert Skill score (GS) are verification measures that are commonly used to check the accuracy of weather forecasting. In this article, we propose that they can also be used to simplify the joint interpretation of positive predictive value (PPV) and sensitivity estimates across diagnostic accuracy studies of epilepsy data. This is because CSI and GS each provide a single measure that takes the weather forecasting equivalent of PPV and sensitivity into account. We have re‐analysed data from our recent systematic review of diagnostic accuracy studies of administrative epilepsy data using CSI and GS. We summarise the results and benefits of this approach.

We recently published a systematic review ascertaining the accuracy of using administrative data to identify epilepsy cases. 1 This showed that the commonest reported diagnostic accuracy measures across studies validating administrative epilepsy data (n = 30) are positive predictive value (PPV) and sensitivity (Sens); 28 studies reported PPV, of which 14 also reported Sens and two reported Sens alone. In contrast, negative predictive value (NPV) and specificity (Spec) were only reported in 14 studies, and one study reported Spec alone. It was difficult to identify the optimal diagnostic algorithms by NPV and Spec, as these were nearly 100% across most studies due to very high numbers of true negatives (TN), often far outnumbering true positives (TP), false positives (FP), and false negatives (FN). Instead, we identified the optimal diagnostic algorithms by ranking them in order of PPV and Sens and making a judgment on which had the best balance of both high PPV and Sens, selecting a threshold of >80% to represent accuracy. 1 In markedly imbalanced datasets like these, where mostly PPVs and Sens have been reported, it may be logical to apply a single diagnostic accuracy measure that encompasses both PPV and Sens to aid interpretation and ranking. We are not aware of any such single measures currently used in medical literature, 2 and we propose the use of the Critical Success Index (CSI) 3

or Gilbert Skill
Score (GS) 4 for this purpose. Both are used in weather forecasting. 5 CSI (also known as threat score 5,6 or ratio of verification 4 ) eschews TN, such that 7 : In signal detection theory, CSI is defined as the ratio of hits to the sum of hits, false alarms, and misses. 8,9 CSI may also be expressed in terms of Sens and PPV 7 : CSI values range from 0 to 1, interpreted as 0 = unable to forecast and 1 = perfect forecast. 5,10 CSI is not unbiased, because CSI = TP/(sample size N -TN), giving lower scores for rarer events. 9 In such circumstances, GS (also known as equitable threat score 5,11 ) may be preferred, as it takes into account the number of hits due to chance (CH), 8   GS values range from −1/3 to 1. 5 As we previously demonstrated, 9 there is a monotonic relation between CSI and GS, where GS values are lower than CSI.
We reanalyzed data from the systematic review 1 to calculate CSI and GS scores for the 91 algorithms from 10 studies where base data (TP, FP, FN, TN) were reported. CSI and GS scores were plotted alongside PPV and Sens for each algorithm, as percentages (Figure 1).
The plot shows CSI and GS scores are conservative, always less than or equal to the lower of the corresponding PPV and Sens. For CSI scores ≥ .8, both PPV and Sens were ≥.8, whereas low CSI scores occurred when there was a large difference between PPV and Sens, even when one of these values was high (~.9). The monotonic relationship between CSI and GS was preserved, with GS scores remaining generally more conservative than CSI. However, epilepsy provides a real-world example of how any differences between CSI and GS become negligible in studies were large sample sizes (N) are driven by very high numbers of TN far outnumbering TP, FP, and FN to the extent of markedly lowering CH (all studies to the right of Holden 2005 in Figure 1; raw data available here: www. bit.ly/3WzcsqP).
Although CSI and GS are established prediction metrics in meteorological literature, 3-6,8,10-12 few texts have translated them into medical literature. 9,13 We provide here the first translation of CSI and GS into epilepsy literature. We suggest CSI may be an appropriate measure to complement Sens, Spec, PPV, and NPV, particularly as it allows combined interpretation of PPV and Sens while also avoiding the inflation of NPV and Spec when there are many TN. GS may be a better metric when there are fewer TN and more CH. Based on the current findings, we suggest a CSI of ≥.8 would be a reasonable threshold score for achieving diagnostic accuracy. Optimal diagnostic thresholds for GS remain to be elucidated.