What is a good Cohen’s kappa?
I have recently started reading a bunch of papers about how “reliable” psychiatric diagnoses really are, i.e. how often two psychiatrists will come up with the same diagnosis for the same patient. Typically this is measured using Cohen’s kappa, a statistic calculated as follows:
(𝑃A−𝑃C)/(1−𝑃C)
Where “PA is the proportion of pairwise agreements and PC = P2+P′2 and P the proportion of all ratings that equal to 1” (Kraemer 2015). Note that P′ = 1-P, because the apostrophe indicates that it is the “complement” of P. “Similar to correlation coefficients, it can range from −1 to +1, where 0 represents the amount of agreement that can be expected from random chance, and 1 represents perfect agreement between the raters” (McHugh 2012).
Note that this is the formula for one of two commonly used forms of Cohen’s kappa, namely the “intraclass kappa”, “…which measures agreement (reliability) of multiple independent measures of the same nonordered categorical construct” (Kraemer 2015).
Anyway, what I want to explore here is the question: what is the threshold for separating “good” from “bad” kappas? I.e. is there a number such that if kappa is less than that, it is “bad”, indicating low reliability, and vice versa if it is higher? Obviously it’s not quite that simple, and everyone seems to agree that just slicing values in the range into categories labeled as “good”, “fair”, etc. is inappropriate, as the reliability of any coefficient will depend on the exact study being analyzed. This is generally recognized for correlation coefficients, which statisticians advise not to blindly assume must always represent an equally strong relationship regardless of context. For example, Schober et al. (2018) note,
Several approaches have been suggested to translate the correlation coefficient into descriptors like “weak,” “moderate,” or “strong” relationship (see the Table for an example).3,18 These cutoff points are arbitrary and inconsistent and should be used judiciously. While most researchers would probably agree that a coefficient of <0.1 indicates a negligible and >0.9 a very strong relationship, values in-between are disputable…Rather than using oversimplified rules, we suggest that a specific coefficient should be interpreted as a measure of the strength of the relationship in the context of the posed scientific question.
There are many other examples of statisticians emphasizing not to put too much emphasis on a single number, and to take other factors other than the value of this number, for other coefficients as well. Higgins et al. (2003), for instance, caution that I2, often used to measure heterogeneity in meta-analyses, doesn’t fit into one-size-fits-all categories either: “A naive categorisation of values for I2 would not be appropriate for all circumstances…Meta-analysts must also consider the clinical implications of the observed degree of inconsistency across studies.”
So we have a clear picture: nothing can really be entirely perfectly captured by a single statistic, whether it’s kappa, Pearson’s r, or anything else.
Similarly, Kraemer (2015) notes:
A recurring issue is that of how large is large enough when intraclass kappa is used to measure the reliability of a measure. There is no one answer to this, for there is an upper limit on reliability imposed by the nature of the population, not the skill and care of the raters or the definitions of the categories. [Emphasis mine.]
But the lack of any “one answer” for a threshold of determining how high kappa must be doesn’t stop Kraemer from giving us a basic outline of just such an answer:
One view of this problem for evaluation of DSM‐5 diagnostic reliabilities7, based on comparisons with the reliabilities of binary medical diagnoses in general, suggests that 0.8 and above is nearly impossible to achieve, that 0.6–0.8 is very good, 0.4–0.6 is good, 0.2–0.4 is questionable and under 0.2 is unacceptable, standards set by examining the reliabilities of medical diagnoses. [Emphasis mine.]
This classification was originally proposed by Kraemer et al. (2012).
I will now seek to compare these categories with those described elsewhere. Cooper (2014) inspired this post by pointing out that definitions of “good”, “poor”, etc. kappa values have changed over time for what seem to be not-very-good reasons.
According to Mordal et al. (2010):
Cohen’s kappa values > 0.75 indicate excellent agreement; < 0.40 poor agreement and values between, fair to good agreement
This seems to be taken from a book by Fleiss, as cited in the paper referenced by Mordal et al. (namely, Shrout et al. 1987).
According to Brinkmann et al. (2019):
Landis and Koch’s (1977) guideline describes agreement as poor at a value of 0, as slight when κ = 0–.20, as fair when κ = .21–.40, as moderate when κ = .41–.60, as substantial when κ = .61–.80 and as almost perfect when κ = .81–1
According to Pies (2007),
Kappa values from 0.41 to 0.60 are usually considered moderate, and values above 0.60 are substantial.
Byrt (1996) also noted some inconsistencies across different rating scales for values of kappa. He cited the definition presented by Landis and Koch (1977), which has been commonly cited since it was first published. He then noted:
The descriptions have been slightly modified by Altman: <0.20 (Poor), 0.21–0.40 (Fair), 0.41–0.60 (Moderate), 0.61- 0.80 (Good), 0.81–1.00 (Very good). Alternatively, writers have used the suggestion of Fleiss: <0.40 (Poor agreement beyond chance), 0.40–0.75 (Fair to good agreement beyond chance), >0.75 (Excellent agreement beyond chance), which has been interpreted in at least one study to mean: <0.40 (Poor), 0.41–0.57 (Fair), 0.58–0.75(Good), >0.75 (Excellent).
Cooper (2014) also highlighted other definitions:
Field trials for the DSM-III [in 1980] included reliability tests. In these a Cohen’s kappa of 0.7 continued to be the threshold for “good agreement”…When it came to the DSM-5, however, the goal posts seemed to shift. Prior to the results being available, members of the DSM-5 taskforce declared that a kappa of over 0.8 would “be almost miraculous,” a kappa between 0.6 and 0.8 would be “cause for celebration,” values between 0.4 and 0.6 were a “realistic goal”, and those between 0.2 and 0.4 would be acceptable.
Cooper notes that the newer standards for acceptable kappa values were inspired in large part by those considered acceptable in other medical fields (or at least those obtained by reliability studies in such fields), writing that the newer standards were based on “[d]ata from a motley assortment of other reliability studies in medicine”. Clearly this is psychiatry trying to break into the “club” of accepted, objective medical sciences with valid, reliable, objective diagnoses (did I say “objective” enough times?).
Cooper notes that not only are the standards for acceptable kappa values lower for assessing the DSM-5 results than were those used for the DSM-III, kappa values are much lower in the DSM-5 validation studies than in those for the DSM-III. This seems like a way to sweep low kappa values under the rug by redefining what is considered an acceptably high value, don’t you think?
After comparing several different criteria for “good”/“fair”/etc. kappa values, Cooper concludes, uncontroversially, that
Clearly there are no universally agreed standards for what would count as a “good” Cohen’s kappa.
This is in keeping with Byrt (1996), who pointed out the following regarding different classification schemes of kappa values:
Clearly, these explanatory terms are arbitrary... It would be better if those who reported values of kappa and those who read the reports could manage without explanatory terms such as fair, good, etc, and develop a feeling for the coefficient itself, paying respect to the prevalence and bias. To do so, however, would be difficult for those who have little experience with kappa.
I like the snark in the last sentence. But I don’t really know what “develop a feeling” means — it sounds like he wants statisticians to “play” kappa like musicians play an instrument or something. But clearly kappa needs to be considered in combination with prevalence and bias of the outcome being studied. Specifically, as Stemler (2004) noted, “kappa values for different items or from different studies cannot be meaningfully compared unless the base rates are identical. Consequently, although the statistic gives some indication as to whether the agreement is better than that predicted by chance alone, it is difficult to apply the kinds of rules of thumb suggested by Landis and Koch for interpreting kappa across different circumstances.”
But screw that, I just want a “magic number” to tell me if diagnoses are reliable or not! I don’t wanna look at crucial contextual information. I just want a way to go directly and consistently from a kappa value to “strong” or “weak” (or some other adjective) interrater agreement. I don’t want to have to be bothered by the fact that “The choice of such benchmarks…is inevitably arbitrary, and the effects of prevalence and bias on kappa must be considered when judging its magnitude” (Sim & Wright 2005). So below I have compiled different standards for classifying kappa values in a convenient table you can view here.