How to check correlation between two categorical variables in SAS

Often we use the Pearson Correlation Coefficient to calculate the correlation between continuous numerical variables.

However, we must use a different metric to calculate the correlation between categorical variables – that is, variables that take on names or labels such as:

  • Marital status (single, married, divorced)
  • Smoking status (smoker, non-smoker)
  • Eye color (blue, brown, green)

There are three metrics that are commonly used to calculate the correlation between categorical variables:

1. Tetrachoric Correlation: Used to calculate the correlation between binary categorical variables.

2. Polychoric Correlation: Used to calculate the correlation between ordinal categorical variables.

3. Cramer’s V: Used to calculate the correlation between nominal categorical variables.

The following sections provide an example of how to calculate each of these three metrics.

Metric 1: Tetrachoric Correlation

Tetrachoric correlation is used to calculate the correlation between binary categorical variables. Recall that binary variables are variables that can only take on one of two possible values.

The value for tetrachoric correlation ranges from -1 to 1 where -1 indicates a strong negative correlation, 0 indicates no correlation, and 1 indicates a strong positive correlation.

For example, suppose want to know whether or not gender is associated with political party preference so we take a simple random sample of 100 voters and survey them on their political party preference.

The following table shows the results of the survey:

We would use tetrachoric correlation in this scenario because each categorical variable is binary – that is, each variable can only take on two possible values.

We can use the following code in R to calculate the tetrachoric correlation between the two variables:

library(psych) #create 2x2 table data = matrix(c(19, 12, 30, 39), nrow=2) #view table data #calculate tetrachoric correlation tetrachoric(data) tetrachoric correlation [1] 0.27

The tetrachoric correlation turns out to be 0.27. This value is fairly low, which indicates that there is a weak association (if any) between gender and political party preference.

Metric 2: Polychoric Correlation

Polychoric correlation is used to calculate the correlation between ordinal categorical variables. Recall that ordinal variables are variables whose possible values have a natural order.

The value for polychoric correlation ranges from -1 to 1 where -1 indicates a strong negative correlation, 0 indicates no correlation, and 1 indicates a strong positive correlation.

For example, suppose want to know whether or not two different movie ratings agencies have a high correlation between their movie ratings.

We ask each agency to rate 20 different movies on a scale of 1 to 3 with 1 indicating “bad”, 2 indicating “mediocre”, and 3 indicating “good.”

The following table shows the results:

We can use the following code in R to calculate the polychoric correlation between the ratings of the two agencies:

library(polycor) #define movie ratings x <- c(1, 1, 2, 2, 3, 2, 2, 3, 2, 3, 3, 2, 1, 2, 2, 1, 1, 1, 2, 2) y <- c(1, 1, 2, 1, 3, 3, 3, 2, 2, 3, 3, 3, 2, 2, 2, 1, 2, 1, 3, 3) #calculate polychoric correlation between ratings polychor(x, y) [1] 0.7828328

The polychoric correlation turns out to be 0.78. This value is quite high, which indicates that there is a strong positive association between the ratings from each agency.

Metric 3: Cramer’s V

Cramer’s V is used to calculate the correlation between nominal categorical variables. Recall that nominal variables are ones that take on category labels but have no natural ordering.

The value for Cramer’s V ranges from 0 to 1, with 0 indicating no association between the variables and 1 indicating a strong association between the variables.

For example, suppose we want to know if there is a correlation between eye color and gender so we survey 50 individuals and obtain the following results:

We can use the following code in R to calculate Cramer’s V for these two variables:

library(rcompanion) #create table data = matrix(c(6, 9, 8, 5, 12, 10), nrow=2) #view table data [,1] [,2] [,3] [1,] 6 8 12 [2,] 9 5 10 #calculate Cramer's V cramerV(data) Cramer V 0.1671

Cramer’s V turns out to be 0.1671. This value is quite low, which indicates that there is a weak association between gender and eye color.

Additional Resources

Introduction to the Pearson Correlation Coefficient
Introduction to Tetrachoric Correlation
Categorical vs. Quantitative Variables: What’s the Difference?
Levels of Measurement: Nominal, Ordinal, Interval and Ratio

When two random variables under consideration are dichotomous variables or ordinal categorical variables, we might need to compute the tetrachoric/polychoric correlations. The calculation of tetrachoric/polychoric correlation is under the assumption that the two dichotomous variables represent underlying normal distributions. When both variables are binary, the correlation is called tetrachoric correlation and in a more general case it is called polychoric correlation. In SAS, proc freq is used to obtain tetrachoric/polychoric correlation.

In the following examples, we will use data set hsb2.sas7bdat.

Example 1: Computing tetrachoric correlation between two dichotomous variables

We specify the plcorr option in the tables statement to request for polychoric correlation. The two variables of interest are female and honors (= write>=60) which is created in the data step below.

data hsb2; set ats.hsb2; honors = (write>=60); run; proc freq data = hsb2; tables honors*female /plcorr; run; Table of honors by female honors female Frequency| Percent | Row Pct | Col Pct | 0| 1| Total ---------+--------+--------+ 0 | 73 | 74 | 147 | 36.50 | 37.00 | 73.50 | 49.66 | 50.34 | | 80.22 | 67.89 | ---------+--------+--------+ 1 | 18 | 35 | 53 | 9.00 | 17.50 | 26.50 | 33.96 | 66.04 | | 19.78 | 32.11 | ---------+--------+--------+ Total 91 109 200 45.50 54.50 100.00 Statistic Value ASE ------------------------------------------------------ Gamma 0.3146 0.1503 Kendall's Tau-b 0.1391 0.0684 Stuart's Tau-c 0.1223 0.0607 Somers' D C|R 0.1570 0.0770 Somers' D R|C 0.1233 0.0612 Pearson Correlation 0.1391 0.0684 Spearman Correlation 0.1391 0.0684 Tetrachoric Correlation 0.2362 0.1156 Lambda Asymmetric C|R 0.0000 0.0000 Lambda Asymmetric R|C 0.0000 0.0000 Lambda Symmetric 0.0000 0.0000 Uncertainty Coefficient C|R 0.0143 0.0142 Uncertainty Coefficient R|C 0.0170 0.0169 Uncertainty Coefficient Symmetric 0.0155 0.0154 Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits ----------------------------------------------------------------- Case-Control (Odds Ratio) 1.9182 0.9974 3.6890 Cohort (Col1 Risk) 1.4622 0.9712 2.2015 Cohort (Col2 Risk) 0.7623 0.5930 0.9799 Sample Size = 200

Example 2: Computing polychoric correlation among two or more ordinal categorical variables

We will use SAS ODS to output the polychoric correlation to a data set. SAS can produce a number of output data sets based on the output from a procedure using ODS (Output Delivery System). Tetrachoric and polychoric correlations are in the data set called measures since SAS put it with all other measures of associations together. We can subset it to only contain tetrachoric and polychoric correlations using the where statement in the process of creating this data set.

proc freq data = hsb2; tables (female ses honors)*(female ses honors) /plcorr; ods output measures=mycorr (where=(statistic="Tetrachoric Correlation" or statistic="Polychoric Correlation") keep = statistic table value); run; proc print data = mycorr; run; Obs Table Statistic Value 1 Table female * female Tetrachoric Correlation 1.0000 2 Table ses * female Polychoric Correlation -0.1741 3 Table honors * female Tetrachoric Correlation 0.2362 4 Table female * ses Polychoric Correlation -0.1741 5 Table ses * ses Polychoric Correlation 1.0000 6 Table honors * ses Polychoric Correlation 0.2769 7 Table female * honors Tetrachoric Correlation 0.2362 8 Table ses * honors Polychoric Correlation 0.2769 9 Table honors * honors Tetrachoric Correlation 1.0000

Example 3: Obtaining a polychoric correlation matrix for a group of variables

The example above shows how to obtain polychoric correlations for multiple variables. But the output is not in matrix format and this can be a problem if further analysis is to be performed using the correlation matrix. In this example, we show some data steps to convert the output into a data set  of correlation matrix type. In the data step below, we created three variables, group, x and y. Since there are three variables, the correlation matrix will have three rows and three columns. This is what the group variable is going to be used for. Each correlation involves two variables, the name of the first variable is stored in variable x and the second one in y.  

proc freq data = hsb2; tables (female ses honors)*(female ses honors) /plcorr; ods output measures=mycorr (where=(statistic="Tetrachoric Correlation" or statistic="Polychoric Correlation") keep = statistic table value); run; data mycorrt; set mycorr ; group = floor((_n_ - 1)/3); x = scan(table, 2, " *"); y = scan(table, 3, " *"); keep group value table x y; run; proc print data = mycorrt; run; Obs Table Value group x y 1 Table female * female 1.0000 0 female female 2 Table ses * female -0.1741 0 ses female 3 Table honors * female 0.2362 0 honors female 4 Table female * ses -0.1741 1 female ses 5 Table ses * ses 1.0000 1 ses ses 6 Table honors * ses 0.2769 1 honors ses 7 Table female * honors 0.2362 2 female honors 8 Table ses * honors 0.2769 2 ses honors 9 Table honors * honors 1.0000 2 honors honors


Now we are ready to transpose the data set up to a matrix format.

proc transpose data = mycorrt out=mymatrix (drop = _name_ group) ; id x; by group; var value ; run; proc print data = mymatrix; run; Obs female ses honors 1 1.0000 -0.1741 0.2362 2 -0.1741 1.0000 0.2769 3 0.2362 0.2769 1.0000

Toplist

Latest post

TAGs