Creating dummy variables in SPSS Statistics
Introduction
If you are analysing your data using multiple regression and any of your independent variables were measured on a nominal or ordinal scale, you need to know how to create dummy variables and interpret their results. This is because nominal and ordinal independent variables, more broadly known as categorical independent variables, cannot be directly entered into a multiple regression analysis. Instead, they need to be converted into dummy variables. The exception is ordinal independent variables that are entered into a multiple regression as continuous independent variables, which do not need to be converted into dummy variables. Therefore, in this guide we show you how to create dummy variables when you have categorical independent variables.
First, we set out the example we use to show how to create dummy variables in SPSS Statistics, before explaining how to set up your data in the Variable View and Data View windows of SPSS Statistics so that you can create dummy variables. If you are unfamiliar with the use of dummy variables, we recommend that you then read about some of the basic principles of dummy variables and dummy coding, including: (a) the number of dummy variables you need to create in your analysis; and (b) how to create dummy variables and dummy coding. In the Procedure section that follows, we set out the simple, 3-step Create Dummy Variables procedure in SPSS Statistics that can be used to create dummy variables. Finally, we explain the SPSS Statistics output after running the Create Dummy Variables procedure, including how your dummy variables will now be set up in the Variable View and Data View windows of SPSS Statistics.
Note 1: The data setup and procedure that follow are identical for SPSS Statistics versions 22 to 28, as well as the subscription version of SPSS Statistics, with version 28 and the subscription version being the latest versions of SPSS Statistics. However, in version 27 and the subscription version, SPSS Statistics introduced a new look to their interface called "SPSS Light", replacing the previous look for versions 26 and earlier versions, which was called "SPSS Standard". Therefore, if you have SPSS Statistics versions 27 or 28 (or the subscription version of SPSS Statistics), the images that follow will be light grey rather than blue. However, the data setup and procedure are identical.
Note 2: If you find that the procedures in this guide do not cover the type of dummy variables you want to create, please contact us. We may be able to add another guide to the site to help.
SPSS Statistics
Example used in this guide
In this guide we will be using the example of 10 triathletes who were asked to select their favourite sport from the three sports they perform when doing a triathlon: swimming, cycling and running. Their answers were recorded in the nominal independent variable, favourite_sport, which has three categories: "swimming", "cycling" and "running". This nominal independent variable, favourite_sport, was to be included in a multiple regression analysis that also had a number of continuous independent variables. Since this independent variable was categorical (i.e., nominal variables and ordinal variables can be broadly classified as categorical variables), dummy variables had to be created before it could be entered into the multiple regression analysis.
Important: Notice that favourite_sport is a nominal variable, but you can also create dummy variables for an ordinal variable. Furthermore, the process for creating dummy variables is the same irrespective of whether you have an ordinal or nominal variable, with the exception of one small change you have to make when setting up your data, which is explained below.
Note 1: The "categories" of a categorical independent variable are also referred to as "groups" or "levels", but the term "levels" is usually reserved for categories that have an order (e.g., the ordinal independent variable, "fitness level", could have three levels: "low", "moderate" and "high"). However, these three terms – "categories", "groups" and "levels" – can be used interchangeably. In this guide, we will refer to them as categories, but you could refer to them as groups or levels if you prefer.
Note 2: The term "factors" is sometimes used instead of "categorical independent variables" (i.e., independent variables that are "ordinal" or "nominal"). However, these two terms – "categorical independent variables" and "factors" – can be used interchangeably. In this guide, we will refer to them as categorical independent variables and you will also see SPSS Statistics refer to them as independent variables rather than factors in its multiple regression procedure. However, you can refer to them as factors if you prefer.
SPSS Statistics
Setting up your data in SPSS Statistics
When creating dummy variables, you will start with a single categorical independent variable (e.g., favourite_sport). To set up this categorical independent variable, SPSS Statistics has a Variable View where you define the types of variable you are analysing and a Data View where you enter your data for this variable. In this section, we first show you how to set up a categorical independent variable in the Variable View window of SPSS Statistics, before showing you how to enter your data into the Data View window. We do this using our categorical independent variable, favourite_sport, which has three categories: "swimming", "cycling" and "running".
The Variable View in SPSS Statistics
For a single categorical independent variable (e.g., favourite_sport), your Variable View window will look like the one below:
Note: You can access the Variable View window in SPSS Statistics by clicking on the tab in the bottom left-hand corner of the SPSS Statistics software.
Published with written permission from SPSS Statistics, IBM Corporation.
The name of your categorical independent variable should be entered in the cell under the column (e.g., "favourite_sport" in row to represent our categorical independent variable, favourite_sport. There are certain "illegal" characters that cannot be entered into the cell. Therefore, if you get an error message and you would like us to add an SPSS Statistics guide to explain what these illegal characters are, please contact us.
Note: For your own clarity, you can also provide a label for your variables in the column. For example, the label we entered for "favourite_sport" was "Triathlete's favourite sport".
The cell under the column should contain the information about the categories of your categorical independent variable (e.g., "swimming", "cycling" and "running" for favourite_sport. To enter this information, click into the cell under the column for your independent variable. The button will appear in the cell. Click on this button and the Value Labels dialogue box will appear. You now need to give each category of your independent variable a "value", which you enter into the Value: box (e.g., "1"), as well as a "label", which you enter into the Label: box (e.g., "swimming"). By clicking the button the coding will appear in the main box (e.g., "1.00="swimming" for favourite_sport). The setup for our categorical independent variable is shown in the Value Labels dialogue box below:
Published with written permission from SPSS Statistics, IBM Corporation.
The cell under the column should show if you have a nominal independent variable (e.g., favourite_sport, as in our example) or if you have an ordinal independent variable (e.g., imagine an ordinal variable such as "Body Mass Index" (BMI), BMI), which has four levels: "Underweight", "Healthy/Normal Weight", "Overweight", and "Obese"). Finally, the cell under the column should show .
Note: We suggest changing the cell under the column from to , but you do not have to make this change. We suggest that you do because there are certain analyses in SPSS Statistics where the setting results in your variables being automatically transferred into certain fields of the dialogue boxes you are using. Since you may not want to transfer these variables, we suggest changing the setting to so that this does not happen automatically.
You have now successfully entered all the information that SPSS Statistics needs to know about your categorical independent variable into the Variable View window. In the next section, we show you how to enter your data into the Data View window.
The Data View in SPSS Statistics
Based on the file setup for your categorical independent variable in the Variable View window above, the Data View window show look as follows:
Note: You can access the Data View window in SPSS Statistics by clicking on the tab in the bottom left-hand corner of the SPSS Statistics software.
Published with written permission from SPSS Statistics, IBM Corporation.
Your categorical independent variable will be displayed in the first column since this was the order we entered the variable into the Variable View window. In our example, the responses of the 10 triathletes are presented under the column. Now, you simply have to enter your data into the cells under this first column. Remember that each row represents one case (e.g., a case could be a single participant). Therefore, in row of our example, the first case represented a triathlete whose favourite sport was "swimming". Since these cells will initially be empty, you need to click into the cells to enter your data. You will notice that when you click into the cells under the column, SPSS Statistics will give you a drop-down option with your categories already populated.
Now that you have set up your data in the Variable View and Data View windows of SPSS Statistics, we recommend reading next section: Understanding dummy variables and dummy coding, where we explain the basic principles of dummy variables and dummy coding. However, if you already familiar with the fundamentals of dummy variables and dummy coding, you can skip this section and go straight to the Procedure section where we set out the Create Dummy Variables procedure in SPSS Statistics that is used to create dummy variables.
SPSS Statistics
Understanding dummy variables and dummy coding
As we mentioned in the Introduction, if you are analysing your data using multiple regression and any of your independent variables were measured on a nominal or ordinal scale, you need to know how to create dummy variables and interpret their results. This is because categorical independent variables (i.e., nominal and ordinal independent variables) cannot be directly entered into a multiple regression. Instead, they need to be converted into dummy variables. The exception is ordinal independent variables that are entered into a multiple regression as continuous independent variables, which do not need to be converted into dummy variables. In the sections below, we explain: (a) the number of dummy variables you need to create; and (b) how to create dummy variables and dummy coding.
The number of dummy variables you need to create
The number of dummy variables you need to create will depend on how many categories your categorical independent variable has. As a general rule, you will create one less dummy variable than the number of categories in your categorical independent variable. For example, if you have a categorical independent variable with three categories (e.g., favourite_sport, with the following three categories: "swimming", "cycling" and "running"), you will create two dummy variables and select one category to act as a reference category (e.g., "swimming" and "cycling" become dummy variables and "running" becomes the reference category). We explain more about reference categories after the following table, which provides some examples of categorical independent variables and the number of dummy variables that need to be created:
Name of the categorical independent variable | Type of variable | Number of categories | Number of dummy variables | ||||
---|---|---|---|---|---|---|---|
1 | Gender | Nominal | Two (Males & Females) | One=Males "Females" is the reference category | |||
2 | Height | Ordinal | Two (Under 180cm & 180cm and above) | One=Under 180cm "180cm and above" is the reference category | |||
3 | Ethnicity | Nominal | Three (African American, Caucasian & Hispanic) | Two=African American & Caucasian "Hispanic" is the reference category | |||
4 | Physical activity level | Ordinal | Three (Low, Moderate & High) | Two=Low & Moderate "High" is the reference category | |||
5 | Profession | Nominal | Four (Surgeon, Doctor, Nurse & Therapist) | Three=Surgeon, Doctor & Nurse "Therapist" is the reference category | |||
6 | Level of agreement | Ordinal | Four (Strongly agree, Agree, Disagree, Strongly disagree) | Three=Strongly agree, Agree & Disagree "Strongly disagree" is the reference category | |||
7 | Subject area | Nominal | Five (Business studies, Psychology, Biological sciences, Engineering & Law) | Four=Business studies, Psychology, Biological sciences & Engineering "Law" is the reference category | |||
8 | Age | Ordinal | Five (Under 18, 19-30, 31-40, 41-50, 51-60) | Four=Under 18, 19-30, 31-40 & 41-50 "51-60" is the reference category | |||
Table: Examples of categorical independent variables and their respective dummy variables |
As shown in the table above, you only need to create one less dummy variable than the number of categories in your categorical independent variable. This is because you only need to (and should) transfer this number of dummy variables into a multiple regression when you have a categorical independent variable. However, there are good reasons to create a dummy variable for every category of the categorical independent variable: (a) it is more flexible and (b) it allows multiple comparisons to be made (see the note below). In other words, if your categorical independent variable has three categories you would create three dummy variables, not just two.
Fortunately, the Create Dummy Variables procedure in SPSS Statistics versions 22 to 28 (and the subscription version of SPSS Statistics) automatically creates a dummy variable for every category of your categorical independent variable. However, this is not the case for the Recode into Different Variables procedure in SPSS Statistics version 21 or earlier versions of SPSS Statistics. Therefore, under normal circumstances, you will have created the following setup in SPSS Statistics, depending on whether you have version 21 or earlier or version 22 and above:
Published with written permission from SPSS Statistics, IBM Corporation.
Note: As mentioned above, creating a dummy variable for every category of the categorical independent variable is beneficial for two reasons: (a) it is more flexible and (b) it allows multiple comparisons to be made. We briefly touch on these benefits below:
It is more flexible:
When you have created a dummy variable for every category of your categorical independent variable, you can then consider any category as a reference category. In our example, we considered the "running" category as the reference category, which means we would have transferred "swimming" and "cycling" into the multiple regression equation. However, if we later changed our mind about our choice of reference category, we would have to run the dummy variable procedure again (unless you have SPSS Statistics version 22 or above). For example, let's assume we now wanted to consider the "cycling" category as the reference category. We could now transfer the "swimming" and "running" dummy variables into the multiple regression equation because we also have the "running" dummy variable.
It allows multiple comparisons to be made:
The coefficient of a dummy variable represents the difference between the category that dummy variable represents and the reference category. For example, with "running" as the reference category, the coefficient of the "swimming" dummy variable represents the difference in the dependent variable between the "swimming" and "running" categories. Using this method, not all combinations of categories will be possible. This problem can be solved by using different reference categories. This is possible if all categories of the categorical variable have a dummy variable.
How to create dummy variables and dummy coding
There are two steps to successfully set up dummy variables in a multiple regression: (1) create dummy variables that represent the categories of your categorical independent variable; and (2) enter values into these dummy variables – known as dummy coding – to represent the categories of the categorical independent variable. We explain this process below using the example we set out above.
Explanation: Dummy variables are simply new variables that act as "placeholders" for a particular coding scheme. They do not contain any data at all, per se. Instead, data/values need to be added to these dummy variables so that they can fulfil their purpose of representing the categories of your categorical independent variable. There are many different types of coding scheme that will dictate the values that are entered into dummy variables, but we use a very common coding scheme called dummy coding or, alternatively, indicator coding (N.B., do not get confused because dummy variables and dummy coding are not the same thing). Dummy coding works by using each dummy variable to identify a specific category of a categorical independent variable with the exception of a reference category, which we explain below.
Let's start by considering our example categorical independent variable, favourite_sport, which has three categories: "swimming", "cycling" and "running". Since there are three categories, there needs to be two dummy variables representing two of the categories, and a reference category representing the third category.
Note: Remember from the discussion above that a multiple regression requires you to transfer one less dummy variable than the number of categories in your categorical independent variable (i.e., two in our example). However, you can create a dummy variable for every category of the categorical independent variable for the purposes of greater flexibility and the ability to make multiple comparisons. Nonetheless, in the discussion below we only highlight what is required for a multiple regression; that is, the creation of one less dummy variable than the number of categories in your categorical independent variable with the category that is not directly represented becoming the "reference category".
For example, let dummy variable #1 represent the "swimming" category and dummy variable #2 represent the "cycling" category. This leaves no dummy variable for the "running" category. This "missing" category is the reference category and it is not needed. Furthermore, it is entirely your decision which category you want to use as the reference category. We could have just as easily chosen the "swimming" category as the reference category rather than the "running" category. The only reason we didn't is that by default SPSS Statistics uses the last category you have coded in the Variable View for your categorical independent variable as the reference category (see the note below).
Note: As explained in the Data Setup section earlier and as shown below in the Value Labels dialogue box, the third and final category of our categorical independent variable was "running" (i.e., 3="running").
There was no theoretical or statistical reason for us to make the "running" category the third and final category, which made it the reference category in SPSS Statistics by default. We simply did it this way because when triathletes take part in a triathlon, they first do the swim, then undertake a cycle, before finally running to the finish line. Therefore, it seemed logical to code our categorical independent variable this way. However, we could have coded it as 1=cycling, 2=running and 3=swimming; it would have made no difference except for the fact that as the third and final category, "swimming" would have become our reference category by default in SPSS Statistics.
When you create dummy variables you should give them a meaningful name. Since each of our dummy variables represents a category of our categorical independent variable, it is customary to refer to each dummy variable by the name of the category it represents. Therefore, we have called dummy variable #1 "swimming" as it represents the swimming category. Similarly, we have called dummy variable #2 "cycling" as it represents the cycling category. By creating these two dummy variables, we will have two new columns in our data set in SPSS Statistics, as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
Now that we have created two dummy variables and given them appropriate names, we need to enter values into these variables so that each dummy variable really does represent its category of the categorical independent variable. With dummy coding this is very simple. You enter a "1" to represent any case (e.g., a participant in your data set) that has the category and enter a "0" (zero) if they do not have the category. First, consider the "swimming" dummy variable, as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
If one of the triathletes stated that "swimming" was their "favourite" sport, we would enter a "1" into the cell under the swimming dummy variable column () for that triathlete who stated that swimming was their "favourite" sport. Alternatively, if one of the triathletes stated that "cycling" or "running" was their "favourite" sport, we would enter a "0" into the cell under the swimming dummy variable column () for that triathlete who stated that swimming was "not" their favourite sport (i.e., this means that either "cycling" or "running" was that triathlete's favourite sport). This is highlighted below for all 10 triathletes:
Published with written permission from SPSS Statistics, IBM Corporation.
We repeat this process for the other dummy variable, "cycling", as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
If one of the triathletes stated that "cycling" was their "favourite" sport, we would enter a "1" into the cell under the cycling dummy variable column () for that triathlete who stated that cycling was their "favourite" sport. Alternatively, if one of the triathletes stated that "swimming" or "running" was their "favourite" sport, we would enter a "0" into the cell under the cycling dummy variable column () for that triathlete who stated that cycling was "not" their favourite sport (i.e., this means that either "swimming" or "running" was that triathlete's favourite sport). This is highlighted below for all 10 triathletes:
Published with written permission from SPSS Statistics, IBM Corporation.
By entering "1"s and "0"s into your dummy variables in this manner, you will have created a set of dummy variables that you can enter into a multiple regression analysis. In the Procedure section that follows, we show you how to create these dummy variables using the Create Dummy Variables procedure.
SPSS Statistics
Procedure in SPSS Statistics to create dummy variables
There are two procedures in SPSS Statistics to create dummy variables: the Create Dummy Variables procedure and the Recode into Different Variables procedure. In this guide, we show you how to use the Create Dummy Variables procedure, which is a simple 3-step procedure. However, it is only available if you have SPSS Statistics version 22 or later, with version 28 and the subscription version of SPSS Statistics being the latest versions of SPSS Statistics. If you are unsure which version of SPSS Statistics you are using, see our guide: Identifying your version of SPSS Statistics. If you have SPSS Statistics version 21 or earlier or are interested in making multiple comparisons when carrying out your multiple regression analysis, please see the Note below:
Note: If you have SPSS Statistics version 21 or earlier, you cannot use the Create Dummy Variables procedure. Therefore, the Recode into Different Variables procedure at least enables you to create dummy variables in SPSS Statistics. Whilst you can also use the Recode into Different Variables procedure to create dummy variables if you have SPSS Statistics version 22 or later, we set out the Create Dummy Variables procedure in this guide because it is dedicated to creating dummy variables and is a lot easier and quicker to use. For example, it requires just 3 steps to create dummy variables for the example used in this guide compared to 28 steps for the same example using the Recode into Different Variables procedure.
Therefore, if you have SPSS Statistics version 21 or earlier, our enhanced guide on Creating dummy variables in the members section on Laerd Statistics includes a page dedicated to showing how to carry out this 28-step Recode into Different Variables procedure. You can access this enhanced guide by subscribing to Laerd Statistics. Alternatively, you can simply use the Create Dummy Variables procedure below.
To create dummy variables when you have SPSS Statistics version 22 or later, follow the 3-step Create Dummy Variables procedure below:
- Click Transform > Create Dummy Variables on the main menu, as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
You will be presented with the Create Dummy Variables dialogue box, as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
- Transfer the categorical independent variable, favourite_sport, into the Create Dummy Variables for: box by selecting it (by clicking on it) and then clicking on the button. Also, enter a "root" name that can represent all of the new dummy variables into the Root Names (One Per Selected Variable): box in the –Main Effect Dummy Variables– area. We entered the root name "fs" as an abbreviation for our categorical independent variable, "favourite_sport", as shown below:
Published with written permission from SPSS Statistics, IBM Corporation.
Note: SPSS Statistics will add a sequential number (i.e., 1, 2, 3, 4, etc.) onto the end of the root name you choose to represent your categorical independent variable. A sequential number will be created for each of the dummy variables you want to create (e.g., if you have two dummy variables, a 1 and 2 will be added onto the end of the root name, but if you had six dummy variables, a 1, 2, 3, 4, 5 and 6 would be added onto the end of the root name). This is shown for our example in the Variable View window below:
Since our categorical independent variable, favourite_sport, had three categories (i.e., swimming, cycling and running), the Create Dummy Variables procedure creates three dummy variables (i.e., one for swimming, one for cycling and one for running). These three dummy variables are highlighted in the column above: "fs_1" (for swimming), "fs_2" (for cycling) and "fs_3" (for running). You can rename these later so that they make more sense. We are just highlighting this so that you know how the Root Names (One Per Selected Variable): box above works.
Also, the root name you enter into the Root Names (One Per Selected Variable): box cannot be the same as the name of your categorical independent variable, as shown below (i.e., where we have entered the root name, "favourite_sport", to illustrate what we could not call our root name):
If the root name you enter is the same as the name of your categorical independent variable, as shown above, when you click on the button, you will get the following warning: - Click on the button.
After carrying out the 3-step Create Dummy Variable procedure above you will have created dummy variables for your categorical independent variable. In the next section, highlight the output that is created in the Variable View and Data View of SPSS Statistics after running this Create Dummy Variables procedure.
SPSS Statistics
Output and data setup in SPSS Statistics after creating dummy variables
After creating your dummy variables, SPSS Statistics produces the following Variable Creation table its IBM SPSS Statistics Viewer:
Published with written permission from SPSS Statistics, IBM Corporation.
The Variable Creation table confirms that you have successfully created dummy variables. There should be as many rows as there are new dummy variables. Since we created three dummy variables, there are three rows in the table, "fs_1", "fs_2" and "fs_3", which reflect the root name and sequential numbering entered in Step 2 of the Create Dummy Variables procedure in the previous section. For each of these dummy variables, a label is provided in the table to make it clear which category of the categorical independent variable each dummy variable represents. For example, the label, "favourite_sport=swimming", is provided for "fs_1", indicating that "fs_1" is the dummy variable for the "swimming" category of the categorical independent variable, favourite_sport.
Next, go to the Variable View window of SPSS Statistics by clicking on the tab. The three dummy variables will have been added, as shown below (i.e., the dummy variables, "fs_1", "fs_2" and "fs_3", in the column):
Published with written permission from SPSS Statistics, IBM Corporation.
Note: You can change the names of the dummy variables in the column to make it clearer what these are. For example, we have changed "fs_1" to "swimming", "fs_2" to "cycling" and "fs_3" to "running", as shown below:
Finally, go to the Data View window of SPSS Statistics by clicking on the tab. The dummy coding is shown under each of the dummy variables that have been created. For example, in the rows under the "fs_1" column, the category, "swimming", is coded as "1.00", whereas the categories, "cycling" and "running", are coded as ".00", as shown below. If you are unsure why these dummy variables are dummy coded in this way, see the section: Understanding dummy variables and dummy coding.
Published with written permission from SPSS Statistics, IBM Corporation.
Note 1: Due to the default settings of SPSS Statistics, your dummy variables will be coded "1.00" or ".00" instead of "1" or "0", respectively. They are identical. However, you will often see dummy coding written in terms of 1's and 0's rather than including decimals.
Note 2: If you changed the names of the dummy variables in the column of the Variable View window above, these will also have been changed in the columns of the Data View window, as shown below (e.g., the column heading is now entitled ):