# Fleiss' kappa in SPSS Statistics

## Introduction

Fleiss' kappa, κ (Fleiss, 1971; Fleiss et al., 2003), is a measure of inter-rater agreement used to determine the level of agreement between two or more raters (also known as "judges" or "observers") when the method of assessment, known as the response variable, is measured on a categorical scale. In addition, Fleiss' kappa is used when: (a) the targets being rated (e.g., patients in a medical practice, learners taking a driving test, customers in a shopping mall/centre, burgers in a fast food chain, boxes delivered by a delivery company, chocolate bars from an assembly line) are randomly selected from the population of interest rather than being specifically chosen; and (b) the raters who assess these targets are non-unique and are randomly selected from a larger population of raters. We explain these three concepts – random selection of targets, random selection of raters and non-unique raters – as well as the use of Fleiss' kappa in the example below.

As an example of how Fleiss' kappa can be used, imagine that the head of a large medical practice wants to determine whether doctors at the practice agree on when to prescribe a patient antibiotics. Therefore, four doctors were randomly selected from the population of all doctors at the large medical practice to examine a patient complaining of an illness that might require antibiotics (i.e., the "four randomly selected doctors" are the non-unique raters and the "patients" are the targets being assessed). The four randomly selected doctors had to decide whether to "prescribe antibiotics", "request the patient come in for a follow-up appointment" or "not prescribe antibiotics" (i.e., where "prescribe", "follow-up" and "not prescribe" are three categories of the nominal response variable, antibiotics prescription decision). This process was repeated for 10 patients, where on each occasion, four doctors were randomly selected from all doctors at the large medical practice to examine one of the 10 patients. The 10 patients were also randomly selected from the population of patients at the large medical practice (i.e., the "population" of patients at the large medical practice refers to all patients at the large medical practice). The level of agreement between the four non-unique doctors for each patient is analysed using Fleiss' kappa. Since the results showed a very good strength of agreement between the four non-unique doctors, the head of the large medical practice feels somewhat confident that doctors are prescribing antibiotics to patients in a similar manner. Furthermore, an analysis of the individual kappas can highlight any differences in the level of agreement between the four non-unique doctors for each category of the nominal response variable. For example, the individual kappas could show that the doctors were in greater agreement when the decision was to "prescribe" or "not prescribe", but in much less agreement when the decision was to "follow-up". It is also worth noting that even if raters strongly agree, this does not mean that their decision is correct (e.g., the doctors could be misdiagnosing the patients, perhaps prescribing antibiotics too often when it is not necessary). This is something that you have to take into account when reporting your findings, but it cannot be measured using Fleiss' kappa.

In this introductory guide to Fleiss' kappa, we first describe the basic requirements and assumptions of Fleiss' kappa. These are not things that you will test for statistically using SPSS Statistics, but you must check that your study design meets these basic requirements/assumptions. If your study design does not meet these basic requirements/assumptions, Fleiss' kappa is the incorrect statistical test to analyse your data. However, there are often other statistical tests that can be used instead. Next, we set out the example we use to illustrate how to carry out Fleiss' kappa using SPSS Statistics. This is followed by the Procedure section, where we illustrate the simple 6-step Reliability Analysis... procedure that is used to carry out Fleiss' kappa in SPSS Statistics. Next, we explain how to interpret the main results of Fleiss' kappa, including the kappa value, statistical significance and 95% confidence interval, which can be used to assess the agreement between your two or more non-unique raters. We also discuss how you can assess the individual kappas, which indicate the level of agreement between your two or more non-unique raters for each of the categories of your response variable (e.g., indicating that doctors were in greater agreement when the decision was the "prescribe" or "not prescribe", but in much less agreement when the decision was to "follow-up", as per our example above). In the final section, Reporting, we explain the information you should include when reporting your results. A Bibliography and Referencing section is included at the end for further reading. To continue with this introductory guide, go to the next section.

## Basic requirements and assumptions of Fleiss' kappa

Fleiss' kappa is just one of many statistical tests that can be used to assess the inter-rater agreement between two or more raters when the method of assessment (i.e., the response variable) is measured on a categorical scale (e.g., Scott, 1955; Cohen, 1960; Fleiss, 1971; Landis and Koch, 1977; Gwet, 2014). Each of these different statistical tests has basic requirements and assumptions that must be met in order for the test to give a valid/correct result. Fleiss' kappa is no exception. Therefore, you must make sure that your study design meets the basic requirements/assumptions of Fleiss' kappa. If your study design does not meet these basic requirements/assumptions, Fleiss' kappa is the incorrect statistical test to analyse your data. However, there are often other statistical tests that can be used instead. In this section, we set out six basic requirements/assumptions of Fleiss' kappa.

• Requirement/Assumption #1: The response variable that is being assessed by your two or more raters is a categorical variable. (i.e., you have an ordinal or nominal variable). A categorical variable can be either a nominal variable or an ordinal variable, but Fleiss' kappa does not take into account the ordered nature of an ordinal variable. Examples of nominal variables include gender (with two categories: "male" and "female"), ethnicity (with three categories: "African American", "Caucasian" and "Hispanic"), transport type (four categories: "cycle", "bus", "car" and "train"), and profession (five categories: "consultant", "doctor", "engineer", "pilot" and "scientist"). Examples of ordinal variables include educational level (e.g., with three categories: "high school", "college" and "university"), physical activity level (e.g., with four categories: "sedentary", "low", "moderate" and "high"), revision time (e.g., with five categories: "0-5 hours", "6-10 hours", "11-15 hours", "16-20 hours" and "21-25 hours"), Likert items (e.g., a 7-point scale from "strongly agree" through to "strongly disagree"), amongst other ways of ranking categories (e.g., a 5-point scale explaining how much a customer liked a product, ranging from "Not very much" to "Yes, a lot"). If these terms are unfamiliar to you, please see our guide on Types of Variable for further help.

For example, two raters could be assessing whether a patient's mole was "normal" or "suspicious" (i.e., two categories); four raters could be assessing whether the quality of service provided by a customer service agent was "above average", "average" or "below average" (i.e., three categories); or three raters could be assessing whether a person's physical activity level should be considered "sedentary", "low", "medium" or "high" (i.e., four categories).
• Requirement/Assumption #2: The two or more categories of the response variable that are being assessed by the raters must be mutually exclusive, which has two components. First, the two or more categories are mutually exclusive because no categories can overlap. For example, a rater, such as a dermatologist (i.e., a skin specialist), could only consider a patient's mole to be "normal" or "suspicious". The mole cannot be "normal" and "suspicious" at the same time. Second, the two or more categories are mutually exclusive because only one category can be selected for each response. For example, when assessing the patient's mole, the dermatologist must judge the mole to be either "normal" or "suspicious". The dermatologist cannot select more than one category for each patient.

Note: If you have a study design where the categories of your response variable are not mutually exclusive, Fleiss' kappa is not the correct statistical test. If you would like us to let you know when we can add a guide to the site to help with this scenario, please contact us.

• Requirement/Assumption #3: The response variable that is being assessed must have the same number of categories for each rater. In other words, all the raters must use the same rating scale. For example, if one rater was asked to assess whether the quality of service provided by a customer service agent was "above average", "average" or "below average" (i.e., three categories), a second rater cannot only be given two options: "above average" and "below average" (i.e., two categories).

Note: If you have a study design where each response variable does not have the same number of categories, Fleiss' kappa is not the correct statistical test. If you would like us to let you know when we can add a guide to the site to help with this scenario, please contact us.

• Requirement/Assumption #4: The two or more raters are non-unique. As Fleiss et al. (2003, pp. 610-611) state: "The raters responsible for rating one subject are not assumed to be the same as those responsible for rating another".

Note 1: As we mentioned above, Fleiss et al. (2003, pp. 610-11) stated that "the raters responsible for rating one subject are not assumed to be the same as those responsible for rating another". In this sense, there is no assumption that the five radiographers who rate one MRI slide are the same radiographers who rate another MRI slide. However, even though the five radiographers are randomly sampled from all 50 radiographers at the large health organisation, it is possible that some of the radiographers will be selected to rate more than one of the 20 MRI slides.

Note 2: If you have a study design where the two or more raters are not non-unique (i.e., they are unique), Fleiss' kappa is not the correct statistical test. If you would like us to let you know when we can add a guide to the site to help with this scenario, please contact us.

• Requirement/Assumption #5: The two or more raters are independent, which means that one rater's judgement does not affect another rater's judgement. For example, if the radiographers in the example above discuss their assessment of the MRI slides before recording their response or perhaps are simply in the same room when they make their assessment, this could influence the assessment they make. It is important that the potential for such bias is removed from the study design as much as possible.
• Requirement/Assumption #6: The targets being rated (e.g., patients in a medical practice, learners taking a driving test, customers in a shopping mall/centre, burgers in a fast food chain, boxes delivered by a delivery company, chocolate bars from an assembly line) ) are randomly selected from the population of interest rather than being specifically chosen.

For example, the randomly selected, non-unique radiographers in the example above rated 20 MRI slides. These 20 MRI slides were randomly selected from all MRI slides of patients' backs at the large health organisation (i.e., this is the total population of MRI slides from which the 20 MRI slides are randomly selected). The MRI slides from which 20 were selected were all of the same type. This is important because if some of MRI slides were taken with the latest equipment, whilst other MRI slides were taken with old equipment where the image was less clear, this will introduce bias. As another example, consider our first example of four randomly selected doctors in a large medical practice who assessed whether 10 patients should be prescribed antibiotics. These 10 patients had to be randomly selected from the total population of patients at the large medical practice (i.e., the "population" of patients at the large medical practice refers to all patients at the large medical practice).

Note: If you have a study design where the targets being rated are not randomly selected, Fleiss' kappa is not the correct statistical test. If you would like us to let you know when we can add a guide to the site to help with this scenario, please contact us.

Therefore, before carrying out a Fleiss' kappa analysis, it is critical that you first check whether your study design meets these six basic requirements/assumptions. If your study design does not met requirements/assumptions #1 (i.e., you have a categorical response variable), #2 (i.e., the two or more categories of this response variable are mutually exclusive), #3 (i.e., the same number of categories are assessed by each rater), #4 (i.e., the two or more raters are non-unique), #5 (i.e., the two or more raters are independent), and #6 (i.e., targets are randomly sample from the population), Fleiss' kappa is the incorrect statistical test to analyse your data.

When you are confident that your study design has met all six basic requirements/assumptions described above, you can carry out a Fleiss' kappa analysis. In the sections that follow we show you how to do this using SPSS Statistics, based on the example we set out in the next section: Example used in this guide.

## Example used in this guide

A local police force wanted to determine whether police officers with a similar level of experience were able to detect whether the behaviour of people in a clothing retail store was "normal", "unusual, but not suspicious" or "suspicious". In particular, the police force wanted to know the extent to which its police officers agreed in their assessment of individuals' behaviour fitting into one of these three categories (i.e., where the three categories were "normal", "unusual, but not suspicious" or "suspicious" behaviour). In other words, the police force wanted to assess police officers' level of agreement.

To assess police officers' level of agreement, the police force conducted an experiment where three police officers were randomly selected from all available police officers at the local police force of approximately 100 police officers. These three police offers were asked to view a video clip of a person in a clothing retail store (i.e., the people being viewed in the clothing retail store are the targets that are being rated). This video clip captured the movement of just one individual from the moment that they entered the retail store to the moment they exited the store. At the end of the video clip, each of the three police officers was asked to record (i.e., rate) whether they considered the personâ€™s behaviour to be "normal", "unusual, but not suspicious" or "suspicious" (i.e., where these are three categories of the nominal response variable, behavioural_assessment). Since there must be independence of observations, which is one of the assumptions/basic requirements of Fleiss' kappa, as explained earlier, each police officer rated the video clip in a room where they could not influence the decision of the other police officers to avoid possible bias.

This process was repeated for a total of 23 video clips where: (a) each video clip was different; and (b) a new set of three police officers were randomly selected from all 100 police officers each time (i.e., three police officers were randomly selected to assess video clip #1, another three police officers were randomly selected to assess video clip #2, another three police officers were randomly selected to assess video clip #3, and so forth, until all 23 video clips had been rated). Therefore, the police officers were considered non-unique raters, which is one of the assumptions/basic requirements of Fleiss' kappa, as explained earlier. After all of the 23 video clips had been rated, Fleiss' kappa was used to compare the ratings of the police officers (i.e., to compare police officers' level of agreement).

Note: Please note that this is a fictitious study being used to illustrate how to carry out and interpret Fleiss' kappa.

## SPSS Statistics procedure to carry out a Fleiss' kappa analysis

The procedure to carry out Fleiss' kappa, including individual kappas, is different depending on whether you have version 26 or the subscription version of SPSS Statistics or version 25 or earlier. If you are unsure which version of SPSS Statistics you are using, see our guide: Identifying your version of SPSS Statistics. In this section, we show you how to carry out Fleiss' kappa using the 6-step Reliability Analysis... procedure in SPSS Statistics, which is an "built-in" procedure that you can use if you have SPSS Statistics version 26 (or the subscription version of SPSS Statistics). If you have SPSS Statistics version 25 or earlier, please see the Note below:

Note: If you have SPSS Statistics version 25 or earlier, you cannot use the Reliability Analysis... procedure. However, you can use the FLEISS KAPPA procedure, which is a simple 3-step procedure. Unfortunately, FLEISS KAPPA is not a built-in procedure in SPSS Statistics, so you need to first download this program as an "extension" using the Extension Hub in SPSS Statistics. You can then run the FLEISS KAPPA procedure using SPSS Statistics.

Therefore, if you have SPSS Statistics version 25 or earlier, our enhanced guide on Fleiss' kappa in the members' section of Laerd Statistics includes a page dedicated to showing how to download the FLEISS KAPPA extension from the Extension Hub in SPSS Statistics and then carry out a Fleiss' kappa analysis using the FLEISS KAPPA procedure. You can access this enhanced guide by subscribing to Laerd Statistics.

1. Click Analyze > Scale > Reliability Analysis... on the top menu, as shown below:

Published with written permission from SPSS Statistics, IBM Corporation.

You will be presented with the following Reliability Analysis dialogue box:

Published with written permission from SPSS Statistics, IBM Corporation.

2. Transfer your two or more variables, which in our example are non_unique_rater_1, non_unique_rater_2 and non_unique_rater_3, into the Ratings: box, using the bottom button. You will end up with a screen similar to the one below:

Published with written permission from SPSS Statistics, IBM Corporation.

3. Click on the button. You will be presented with the Reliability Analysis: Statistics dialogue box, as shown below:

Published with written permission from SPSS Statistics, IBM Corporation.

4. Select the Display agreement on individual categories option in the –Interrater Agreement: Fleiss' Kappa– area, as shown below:

Published with written permission from SPSS Statistics, IBM Corporation.

5. Click on the button. This will return you to the Reliability Analysis... dialogue box.
6. Click on the button to generate the output for Fleiss' kappa.

Now that you have run the Reliability Analysis... procedure, we show you how to interpret the results from a Fleiss' kappa analysis in the next section.

## Interpreting the results from a Fleiss' kappa analysis

Fleiss' kappa (κ) is a statistic that was designed to take into account chance agreement. In terms of our example, even if the police officers were to guess randomly about each individual's behaviour, they would end up agreeing on some individual's behaviour simply by chance. However, you do not want this chance agreement affecting your results (i.e., making agreement appear better than it actually is). Therefore, instead of measuring the overall proportion of agreement, Fleiss' kappa measures the proportion of agreement over and above the agreement expected by chance (i.e., over and above chance agreement).

After carrying out the Reliability Analysis... procedure in the previous section, the following Overall Kappa table will be displayed in the IBM SPSS Statistics Viewer, which includes the value of Fleiss' kappa and other associated statistics:

Published with written permission from SPSS Statistics, IBM Corporation.

The value of Fleiss' kappa is found under the "Kappa" column of the table, as highlighted below:

Published with written permission from SPSS Statistics, IBM Corporation.

You can see that Fleiss' kappa is .557. This is the proportion of agreement over and above chance agreement. Fleiss' kappa can range from -1 to +1. A negative value for kappa (κ) indicates that agreement between the two or more raters was less than the agreement expected by chance, with -1 indicating that there was no observed agreement (i.e., the raters did not agree on anything), and 0 (zero) indicating that agreement was no better than chance. However, negative values rarely actually occur (Agresti, 2013). Alternately, kappa values increasingly greater that 0 (zero) represent increasing better-than-chance agreement for the two or more raters, to a maximum value of +1, which indicates perfect agreement (i.e., the raters agreed on everything).

There are no rules of thumb to assess how good our kappa value of .557 is (i.e., how strong the level of agreement is between the police officers). With that being said, the following classifications have been suggested for assessing how good the strength of agreement is when based on the value of Cohen's kappa coefficient. The guidelines below are from Altman (1999), and adapted from Landis and Koch (1977):

Value of κ Strength of agreement
< 0.20Poor
0.21-0.40Fair
0.41-0.60Moderate
0.61-0.80Good
0.81-1.00Very good
Table: Classification of Cohen's kappa.

Using this classification scale, since Fleiss' kappa (κ)=.557, this represents a moderate strength of agreement. However, the value of kappa is heavily dependent on the marginal distributions, which are used to calculate the level (i.e., proportion) of chance agreement. As such, the value of kappa will differ depending on the marginal distributions. This is one of the greatest weaknesses of Fleiss' kappa. It means that you cannot compare one Fleiss' kappa to another unless the marginal distributions are the same.

It is also good to report a 95% confidence interval for Fleiss' kappa. To do this, you need to consult the "Lower 95% Asymptotic CI Bound" and the "Upper 95% Asymptotic CI Bound" columns, as highlighted below:

Published with written permission from SPSS Statistics, IBM Corporation.

You can see that the 95% confidence interval for Fleiss' kappa is .389 to .725. In other words, we can be 95% confident that the true population value of Fleiss' kappa is between .389 and .725.

We can also report whether Fleiss' kappa is statistically significant; that is, whether Fleiss' kappa is different from 0 (zero) in the population (sometimes described as being statistically significantly different from zero). These results can be found under the "Z" and "P Value" columns, as highlighted below:

Published with written permission from SPSS Statistics, IBM Corporation.

You can see that the p-value is report as .000, which means that p < .0005 (i.e., the p-value is less than .0005). If p < .05 (i.e., if the p-value is less than .05), you have a statistically significant result and your Fleiss' kappa coefficient is statistically significantly different from 0 (zero). If p > .05 (i.e., if the p-value is greater than .05), you do not have a statistically significant result and your Fleiss' kappa coefficient is not statistically significantly different from 0 (zero). In our example, p =.000, which actually means p < .0005 (see the note below). Since a p-value less than .0005 is less than .05, our kappa (κ) coefficient is statistically significantly different from 0 (zero).

Note: If you see SPSS Statistics state that the "P Value" is ".000", this actually means that p < .0005; it does not mean that the significance level is actually zero. Where possible, it is preferable to state the actual p-value rather than a greater/less than p-value statement (e.g., p =.023 rather than p < .05, or p =.092 rather than p > .05). This way, you convey more information to the reader about the level of statistical significance of your result.

However, it is important to mention that because agreement will rarely be only as good as chance agreement, the statistical significance of Fleiss' kappa is less important than reporting a 95% confidence interval.

Therefore, we know so far that there was moderate agreement between the officers' judgement, with a kappa value of .557 and a 95% confidence interval (CI) between .389 and .725. We also know that Fleiss' kappa coefficient was statistically significant. However, we can go one step further by interpreting the individual kappas.

The individual kappas are simply Fleiss' kappa calculated for each of the categories of the response variable separately against all other categories combined. In our example, the following comparisons would be made:

• A. The "Normal" behaviour category would be compared to the "Unusual, but not suspicious" behaviour category and the "Suspicious" behaviour category combined.
• B. The "Unusual, but not suspicious" behaviour category would be compared to the "Normal" behaviour category and the "Suspicious" behaviour category combined.
• C. The "Suspicious" behaviour category would be compared to the "Normal" behaviour category and the "Unusual, but not suspicious" behaviour category combined.

We can use this information to assess police officers' level of agreement when rating each category of the response variable. For example, these individual kappas indicate that police officers are in better agreement when categorising individual's behaviour as either normal or suspicious, but far less in agreement over who should be categorised as having unusual, but not suspicious behaviour. These individual kappa results are displayed in the Kappas for Individual Categories table, as shown below:

Published with written permission from SPSS Statistics, IBM Corporation.

If you are unsure how to interpret the results in the Kappas for Individual Categories table, our enhanced guide on Fleiss' kappa in the members' section of Laerd Statistics includes a section dedicated to explaining how to interpret these individual kappas. You can access this enhanced guide by subscribing to Laerd Statistics. However, to continue with this introductory guide, go to the next section where we explain how to report the results from a Fleiss' kappa analysis.

## Reporting the results from a Fleiss' kappa analysis

When you report the results of a Fleiss' kappa analysis, it is good practice to include the following information:

• A. An introduction to the analysis you carried out, which includes: (a) the statistical test being used to analyse your data (i.e., Fleiss' kappa); (b) the raters whose level of agreement is being assessed (e.g., police officers in our example); (c) the targets who are being rated (e.g., individuals in a clothing retail store in our example); and (d) the categories of your response variable (e.g., the "Normal", "Unusual, but not suspicious", and "Suspicious" categories in our example), to highlight that the response variable is a categorical variable, as discussed in Requirement/Assumption #1.
• B. Information about your sample, including: (a) the number of non-unique raters and the population from which these were randomly selected; and (b) the number of targets and the population from which these were randomly selected. Including terms such as non-unique and randomly selected indicates to the reader that Fleiss' kappa has been used appropriately, as per Requirement/Assumption #4 and Requirement/Assumption #6 respectively.
• C. A statement to indicate how you helped to ensure independence of observations in order to reduce potential bias, as discussed in Requirement/Assumption #5.
• D. A statement to indicate that: (a) each rater was presented with the same number of categories; and (b) the categories were mutually exclusive, as per Requirement/Assumption #3 and Requirement/Assumption #2 respectively.
• E. The results from the Fleiss' kappa analysis, including: (a) the Fleiss' kappa coefficient, κ (i.e., shown under the "Kappa" column in the Overall Kappa table), together with the 95% confidence interval (CI) (i.e., shown under the "Lower 95% Asymptotic CI Bound" and the "Upper 95% Asymptotic CI Bound" columns); and (b) the p-value given for the test (i.e., shown under the "P Value" column). You can also consider including (c), the level of agreement in terms of a general guideline, such as the classifications of "poor", "fair", "moderate", "good" or "very good" agreement, suggested by Altman (1999) for the Cohen's kappa coefficient, and adapted from Landis and Koch (1977).
• F. The results from the individual kappa analysis, including: (a) the Fleiss' kappa coefficient, κ, for each category of the response variable separately against all other categories combined; and (b) a statement of the relative level of agreement between raters for each category.
• G. A table of your results, showing how the raters scored for each category of the response variable, assuming that data is anonymous and meets other relevant ethical standards of care. Providing such a table is important, where possible, because it allows others to: (a) check that you have carried out your analysis correctly; and (b) analyse your data using alternative methods of inter-rater agreement.

In the example below, we show how to report the results from your Fleiss' kappa analysis in line with five of the seven reporting guidelines above (i.e., A, B, C, D and E). If you are interested in understanding how to report your results in line with the two remaining reporting guidelines (i.e., F, in terms of individual kappas, and G, using a table), we show you how to do this in our enhanced guide on Fleiss' kappa in the members' section of Laerd Statistics. You can access this enhanced guide by subscribing to Laerd Statistics. However, if you are simply interested in reporting guidelines A to E, see the reporting example below:

• General

Fleiss' kappa was run to determine if there was agreement between police officers' judgement on whether 23 individuals in a clothing retail store were exhibiting either normal, unusual but not suspicious, or suspicious behaviour, based on a video clip showing each shopper's movement through the clothing retail store. Three non-unique police officers were chosen at random from a group of 100 police officers to rate each individual. Each police officer rated the video clip in a separate room so they could not influence the decision of the other police officers. When assessing an individual's behaviour in the clothing retail store, each police officer could select from only one of the three categories: "normal", "unusual but not suspicious" or "suspicious behaviour". The 23 individuals were randomly selected from all shoppers visiting the clothing retail store during a one-week period. Fleiss' kappa showed that there was moderate agreement between the officers' judgements, κ=.557 (95% CI, .389 to .725), p < .0005.

Note: When you report your results, you may not always include all seven reporting guidelines mentioned above (i.e., A, B, C, D, E, F and G) in the "Results" section, whether this is for an assignment, dissertation/thesis or journal/clinical publication. Some of the seven reporting guidelines may be included in the "Results" section, whilst others may be included in the "Methods/Study Design" section. However, we would recommend that all seven are included in at least one of these sections.