Independent-samples t-test using R, Excel and RStudio

Introduction

The independent-samples t-test, also known as the independent t-test, independent-measures t-test, between-subjects t-test or unpaired t-test, is used to determine whether there is a difference between two independent, unrelated groups (e.g., employed versus unemployed people, males versus females, low versus high anxiety students, etc.) in terms of the mean of a continuous dependent variable (e.g., salary, running speed, exam score, etc.). More specifically, the independent-samples t-test is used to determine whether the mean difference between these two groups is statistically significant.

For example, you could use an independent-samples t-test to understand whether the number of hours teenagers watch television each week differs based on gender (i.e., the dependent variable is "weekly tv time", measured in minutes, and the independent variable is "gender", which has two groups: "males" and "females"). Alternatively, you could use an independent-samples t-test to understand whether there is a difference in 10 km running performance between athletes consuming a carbohydrate drink compared to athletes consuming water (i.e., the dependent variable is "10 km running performance", measured in minutes and seconds, and the independent variable is "type of drink", which has two groups: "carbohydrate drink" and "water").

In this introductory guide to the independent-samples t-test, we first set out a couple of study designs where the independent-samples t-test is most often used. Next, we set out the assumptions of the independent-samples t-test. Making sure that your study design, variables and data pass these assumptions is critical because if they do not, the independent-samples t-test is likely to be the incorrect statistical test to use. On page 2 of this introductory guide we set out the example we use to illustrate how to carry out an independent-samples t-test using R, before showing how to set up your data using Microsoft Excel, R and RStudio. On page 3 we demonstrate the R code that can be used in RStudio to carry out an independent-samples t-test, including useful descriptive statistics. Finally, on page 4 of this introductory guide we explain how to interpret the main results of the independent-samples t-test where you will determine whether there is a statistically significant difference between your two independent, unrelated groups in terms of the mean of your dependent variable. To continue with this introductory guide, go to the next section.

SPSS Statistics

Study Designs

An independent-samples t-test is most often used to analyse the results of three different types of study design: (a) determining if there is a mean difference between two independent groups; (b) determining if there is a mean difference between two interventions; and (c) determining if there is a mean difference between two change scores (also known as gain scores). To learn more about the first two of these three types of study design where the independent-samples t-test can be used, see the examples below:

Note: Whilst an independent-samples t-test can be used to determine if there is a mean difference between two change scores, a one-way ANCOVA is more commonly recommended.

Difference between two INDEPENDENT GROUPS
Difference between two TREATMENT/EXPERIMENTAL GROUPS

Some degree courses include mandatory 1-year internships (also known as placements), which are considered to help students’ job prospects after graduating. Therefore, imagine that a researcher wanted to determine whether students who enrolled in a 3-year degree course that included a mandatory 1-year internship (also known as a placement) got better graduate salaries than students who did not undertake an internship. The researcher was specifically interested in students who undertook a Finance degree.

A total of 60 first-year graduates who had undertaken a Finance degree were recruited to the study. Of these 60 graduates, 30 had undertaken a 3-year Finance degree that included a mandatory 1-year internship. This group of 300 graduates represented the "internship group". The other 30 had undertaken a 3-year Finance degree that did not include an internship. This group of 30 graduates represented the "no internship group". The first-year graduate salaries of all 60 graduates were recorded in US dollars.

Therefore, in this study the dependent variable was "salary", measured in US dollars, and the independent variable was "course type", which had two independent groups: "internship group" and "no internship group". The two groups were independent because no graduate could be in more than one group and the students in the two groups could not influence each other’s salaries.

The researcher analysed the data collected to determine whether salaries were greater (or smaller) in the internship group compared to the no internship group. An independent-samples t-test was used to determine whether there was a statistically significant difference in the salaries between the internship group and the no internship group.

Difference between two TREATMENT/EXPERIMENTAL GROUPS

Some parents use financial rewards (i.e., money) as an incentive to encourage their children to get top marks in their exams (e.g., an "A" grade or what might be a score of 80 or more out of 100). Therefore, imagine that an educational psychologist wanted to determine whether financial rewards increased academic performance amongst school children.

A total of 26 students were randomly assigned to one of two groups. In one group, the school children were offered $500 if they got an "A" grade in their maths exam. This is called the "experimental group". In the other group, the school children are not offered anything, irrespective of how well they performed in the same maths exam. This is called the "control group". All 26 students undertook the same maths exam. After the students have taken the maths exam, their scores (between 0 and 100 marks) were recorded.

Therefore, in this study the dependent variable was "exam result", measured from 0 to 100 marks, and the independent variable was "financial reward", which had two independent groups: "experimental group" and "control group". The two groups were independent because no student could be in more than one group and the students in the two groups were unable to influence each other’s exam results.

The researcher analysed the data collected to determine whether the exam results were better (or worse) amongst students in the experimental group compared to the control group. An independent-samples t-test was used to determine whether there was a statistically significant difference in the exam results between the experimental group and control group.

In this "quick start" guide we show you how to carry out an independent-samples t-test using R, with the help of Microsoft Excel (Excel) and RStudio. We also show you how to interpret and report the results from this test. However, before we show you how to carry out an independent-samples t-test using R, you need to understand the different assumptions that your data must meet for an independent-samples t-test to give you a valid result. We discuss these assumptions in the next section.

R and RStudio

Assumptions:
Can I use the independent-samples t-test?

The first and most important step in an independent-samples t-test analysis is to check whether it is appropriate to use this statistical test. After all, the independent-samples t-test will only give you valid/accurate results if your study design and data "pass" six assumptions that underpin the independent-samples t-test.

In many cases, the independent-samples t-test will be the incorrect statistical test to use because your data "violates" (i.e., does not meet) one or more of these assumptions. This is not uncommon when working with real-world data, which is often "messy", as opposed to textbook examples. However, there is often a solution, whether this involves using a different statistical test, or making adjustments to your data so that you can continue to use an independent-samples t-test.

Before discussing these options further, we briefly set out the six assumptions of the independent-samples t-test, three of which relate to your study design and how you measured your variables (i.e., Assumptions #1, #2 and #3 below), and three which relate to the characteristics of your data (i.e., Assumptions #4, #5 and #6 below):

Assumption #1: You have a dependent variable that is measured on a continuous scale (i.e., it is measured at the interval or ratio level). Examples of continuous variables include salary (measured in US dollars), height (measured in cm), test score (measured from 0 to 100), intelligence (measured using IQ score), age (measured in years), and so forth.
Assumption #2: You have an independent variable that consists of two categorical, independent groups (i.e., you have a dichotomous variable). A dichotomous variable can be either ordinal or nominal. Ordinal variables with two groups, also referred to as levels, include income level (two levels: "low income" and "high income"), exam result (two levels: "pass" and "fail"), intelligence (two levels: "below average IQ" and "above average IQ"), age group (two levels: "under 21 years old" and "21 years old and over"), educational level (two levels: "undergraduate" and "postgraduate"), and so forth. Nominal variables with two groups include gender (two groups: "male" and "female"), drug trial (two groups: "drug A" and "drug B"), choice of transport (two groups: "car" and "bus"), employment status (two groups: "employed" and "unemployed"), credit card application (two groups: "granted" and "denied"), presence of heart disease (two groups: "yes" and "no"), and so forth.
Note: You can learn more about the differences between dependent and independent variables, as well as continuous, ordinal, nominal and dichotomous variables in our guide: Types of variable.
Assumption #3: There should be independence of observations, which means that there is no relationship between: (a) the observations in each group; and (b) the groups themselves.

Since assumptions #1, #2 and #3 relate to your study design and how you measured your variables, if any of these three assumptions are not met (i.e., if any of these assumptions do not fit with your research), the independent-samples t-test is the incorrect statistical test to analyse your data. It is likely that there will be other statistical tests you can use instead, but the independent-samples t-test is not the correct test.

After checking if your study design and variables meet assumptions #1, #2 and #3, you should now check if your data also meets assumptions #4, #5 and #6 below:

Assumption #4: There should be no significant outliers in your data. An outlier is a single case/observation in your data set that does not follow the usual pattern. For example, imagine a study comparing income levels between male and female graduates in full-time employment in the United Kingdom in their first year after leaving university. Of the 60 graduates in the study, salaries ranged between £16,000 and £48,000, except for one graduate who earnt more than £1,500,000 (e.g., she had started a tech firm at university and sold a stake in this during her first year after graduation; or he was working for his father’s family business who could afford to pay an extremely high salary that was not linked to his work at the business). In the event, you would be unlikely to know the reason why the graduate earnt more than £1,500,000, only that this salary does not fit the unusual pattern of salaries amongst the sample of 60 graduates in the study (and most likely not the wider population of first year graduates). When using an independent-samples t-test, this would be considered an outlier.

Outliers can be problematic because they can disproportionately influence the assumptions and result of the independent-samples t-test, and lead to invalid conclusions. Therefore, you need to detect if there are any significant outliers in your data before running an independent-samples t-test. Fortunately, there are several methods to detect outliers using R, as well as methods to deal with outliers when you have any in your data.

Note: Outliers are not inherently "bad" (i.e., an outlier is not bad simply because it is an outlier). Therefore, when deciding how to deal with outliers in your data, you not only need to consider the statistical implications of any outliers, but also theoretical factors that relate to your research goals and study design.
Assumption #5: Your dependent variable should be approximately normally distributed for each category of your independent variable. In other words, the distribution of scores of your dependent variable should approximately follow a normal distribution in each category of your independent variable. Taking the example of male and female first-year graduate salaries above, the distribution of graduate salaries should be approximately normally distributed for "males" and approximately normally distributed for "females". Therefore, before you run an independent-samples t-test, you need to check whether these two groups are approximately normally distributed using a mix of numeric and graphical methods, all of which can be carried out using R. If your data is not normally distributed there are methods to deal with this (e.g., applying a transformation to your data), and after applying these methods, it may still be possible to use an independent-samples t-test. If your data is not normally distributed and no methods are able to "coax" your data towards normality, the independent-samples t-test may be the incorrect statistical test to analyse your data (although there are some exceptions to this).

Note: Technically, it is the residuals that must be approximately normally distributed within each group rather than the data within each group, but in an independent-samples t-test, the results will be the same.
Assumption #6: There needs to be homogeneity of variances, which means that the (population) variance for each category of your independent variable is the same. You can test whether your data meets this assumption using R. If it does not, you can simply run a different t-test known as the Welch t-test that makes an adjustment for unequal variances. The Welch t-test can also be run using R.

Therefore, before running an independent-samples t-test it is critical that you first check whether your data meets assumptions #4, #5 and #6. In some cases, failure to meet one or more of these assumptions will make the independent-samples t-test the incorrect statistical test to use. In other cases, you may simply have to make some adjustments to your data before continuing to analyse it using an independent-samples t-test.

When you are confident that your data has met all six assumptions described above, you can carry out an independent-samples t-test to determine whether there is a difference between the two groups of your independent variable in terms of the mean of your dependent variable. In the sections that follow we show you how to do this using R (with Excel and RStudio), based on the example we set out on the next page.

1 2 3 4