Laerd Statistics LoginCookies & Privacy

Pearson's Correlation using Stata

Introduction

The Pearson product-moment correlation coefficient, often shortened to Pearson correlation or Pearson's correlation, is a measure of the strength and direction of association that exists between two continuous variables. The Pearson correlation generates a coefficient called the Pearson correlation coefficient, denoted as r. A Pearson's correlation attempts to draw a line of best fit through the data of two variables, and the Pearson correlation coefficient, r, indicates how far away all these data points are to this line of best fit (i.e., how well the data points fit this new model/line of best fit). Its value can range from -1 for a perfect negative linear relationship to +1 for a perfect positive linear relationship. A value of 0 (zero) indicates no relationship between two variables.

For example, you could use a Pearson's correlation to understand whether there is an association between exam performance and time spent revising (i.e., your two variables would be "exam performance", measured from 0-100 marks, and "revision time", measured in hours). If there was a moderate, positive association, we could say that more time spent revising was associated with better exam performance. Alternately, you could use a Pearson's correlation to understand whether there is an association between length of unemployment and happiness (i.e., your two variables would be "length of unemployment", measured in days, and "happiness", measured using a continuous scale). If there was a strong, negative association, we could say that the longer the length of unemployment, the greater the unhappiness.

In this guide, we show you how to carry out a Pearson's correlation using Stata, as well as interpret and report the results from this test. However, before we introduce you to this procedure, you need to understand the different assumptions that your data must meet in order for a Pearson's correlation to give you a valid result. We discuss these assumptions next.

Stata

Assumptions

There are four "assumptions" that underpin a Pearson's correlation. If any of these four assumptions are not met, analysing your data using a Pearson's correlation might not lead to a valid result. Since assumption #1 relates to your choice of variables, it cannot be tested for using Stata. However, you should decide whether your study meets this assumption before moving on.

Fortunately, you can check assumptions #2, #3 and #4 using Stata. When moving on to assumptions #2, #3 and #4, we suggest testing them in this order because it represents an order where, if a violation to the assumption is not correctable, you will no longer be able to use a Pearson's correlation. In fact, do not be surprised if your data fails one or more of these assumptions since this is fairly typical when working with real-world data rather than textbook examples, which often only show you how to carry out a Pearson's correlation when everything goes well. However, don't worry because even when your data fails certain assumptions, there is often a solution to overcome this (e.g., transforming your data or using another statistical test instead). Just remember that if you do not check that you data meets these assumptions or you do not test for them correctly, the results you get when running a Pearson's correlation might not be valid.

In practice, checking for assumptions #2, #3 and #4 will probably take up most of your time when carrying out a Pearson's correlation. However, it is not a difficult task, and Stata provides all the tools you need to do this.

In the section, Test Procedure in Stata, we illustrate the Stata procedure required to perform a Pearson's correlation assuming that no assumptions have been violated. First, we set out the example we use to explain the Pearson's correlation procedure in Stata.

Stata

Example

Studies show that exercising can help prevent heart disease. Within reasonable limits, the more you exercise, the less risk you have of suffering from heart disease. One way in which exercise reduces your risk of suffering from heart disease is by reducing a fat in your blood, called cholesterol. The more you exercise, the lower your cholesterol concentration. Furthermore, it has recently been shown that the amount of time you spend watching TV – an indicator of a sedentary lifestyle – might be a good predictor of heart disease (i.e., that is, the more TV you watch, the greater your risk of heart disease).

Therefore, a researcher decided to determine if cholesterol concentration was related to time spent watching TV in otherwise healthy 45 to 65 year old men (an at-risk category of people). For example, as people spent more time watching TV, did their cholesterol concentration also increase (a positive relationship); or did the opposite happen?

To carry out the analysis, the researcher recruited 100 healthy male participants between the ages of 45 and 65 years old. The amount of time spent watching TV (i.e., the variable, time_tv) and cholesterol concentration (i.e., the variable, cholesterol) were recorded for all 100 participants. Expressed in variable terms, the researcher wanted to correlate cholesterol and time_tv.

Note: The example and data used for this guide are fictitious. We have just created them for the purposes of this guide.

Stata

Setup in Stata

In Stata, we created two variables: (1) time_tv, which is the average daily time spent watching TV in minutes; and (2) cholesterol, which is the cholesterol concentration in mmol/L.

Note: It does not matter which variable you create first.

After creating these two variables – time_tv and cholesterol – we entered the scores for each into the two columns of the Data Editor (Edit) spreadsheet (i.e., the time in hours that the participants watched tv in the left-hand column (i.e., time_tv), and participants' cholesterol concentration in mmol/L in the right-hand column (i.e., cholesterol)), as shown below:

Data editor for the Pearson's correlation in Stata

Published with written permission from StataCorp LP.

Stata

Test Procedure in Stata

In this section, we show you how to analyse your data using a Pearson's correlation in Stata when the four assumptions in the previous section, Assumptions, have not been violated. You can carry out a Pearson's correlation using code or Stata's graphical user interface (GUI). After you have carried out your analysis, we show you how to interpret your results. First, choose whether you want to use code or Stata's graphical user interface (GUI).


Code

The basic code to run a Pearson's correlation takes the form:

pwcorr VariableA VariableB

However, if you also want Stata to produce a p-value (i.e., the statistical significance level of your result), you need to add sig to the end of the code, as shown below:

pwcorr VariableA VariableB, sig

If you also want Stata to let you know whether your result is statistically significant at a particular level (e.g., where p < .05), you can set this p-value by adding it to the end of the code (e.g., (.05) where p < .05 or (.01) where p < .01), preceded by sig star (e.g., sig star(.05)), which places a star next to the correlation score if your result is statistically significant at this level. The code would take the form:

pwcorr VariableA VariableB, sig star(.05)

Finally, if you want Stata to display the number of observations (i.e., your sample size, N), you can do this by adding obs to the end of the code, as shown below:

pwcorr VariableA VariableB, sig star(.05) obs

Whatever code you choose to include should be entered into the box below:

Command box in Stata

Published with written permission from StataCorp LP.

Using our example where one variable is cholesterol and the other variable is time_tv, the required code would be one of the following:

pwcorr cholesterol time_tv

pwcorr cholesterol time_tv, sig

pwcorr cholesterol time_tv, sig star(.05)

pwcorr cholesterol time_tv, sig star(.05) obs

Since we wanted to include (a) the correlation coefficient, (b) the p-value at the .05 level and (c) the sample size (i.e., the number of observations), as well as (d) being notified whether our result was statistically significant at the .05 level, we entered the code, pwcorr cholesterol time_tv, sig star(.05) obs, and pressed the "Return/Enter" button on our keyboard, as shown below:

Command box for the Pearson's correlation in Stata

Published with written permission from StataCorp LP.

You can see the Stata output that will be produced here.


Graphical User Interface (GUI)

The three steps required to carry out a Pearson's correlation in Stata 12 and 13 are shown below:

Stata

Output of a Pearson's correlation in Stata

If your data passed assumption #2 (i.e., there was a linear relationship between your two variables), assumption #3 (i.e., there were no significant outliers) and assumption #4 (i.e., your two variables were approximately normally distributed), which we explained earlier in the Assumptions section, you will only need to interpret the following Pearson's correlation output in Stata:

Output for a Pearson's correlation in Stata (including statistical significance and observations)

Published with written permission from StataCorp LP.

The output contains three important pieces of information: (1) the Pearson correlation coefficient; (2) the level of statistical significance; and (3) the sample size. These three pieces of information are explained in more detail below:

Note: We present the output from the Pearson's correlation above. However, since you should have tested your data for the assumptions we explained earlier in the Assumptions section, you will also need to interpret the Stata output that was produced when you tested for these assumptions. This includes: (a) the scatterplots you used to check if there was a linear relationship between your two variables (i.e., Assumption #2); (b) the same scatterplots that you will have used to check there were no significant outliers (i.e., Assumption #3); and (c) the Shapiro-Wilk test of normality to check whether your two variables were approximately normally distributed (i.e., Assumption #4). Also, remember that if your data failed any of these assumptions, the output that you get from the Pearson's correlation procedure (i.e., the output we discuss above) will no longer be relevant, and you may have to carry out a different statistical test to analyse your data.

Stata

Reporting the output of a Pearson's correlation

When you report the output of your Pearson's correlation, it is good practice to include:

Based on the results above, we could report the results of this study as follows:

A Pearson's product-moment correlation was run to assess the relationship between cholesterol concentration and daily time spent watching TV in 100 males aged 45 to 65 years. There was a moderate positive correlation between daily time spent watching TV and cholesterol concentration, r(98) = .371, p < .0005, with time spent watching TV explaining 14% of the variation in cholesterol concentration.

In addition to reporting the results as above, a diagram can be used to visually present your results. For example, you could do this using a scatterplot. This can make it easier for others to understand your results and is easily produced in Stata.

1