Best chart for two numeric variables

Best chart for two numeric variables

  • STAT:4580
  • Syllabus
  • Outline
  • Homework
  • Notes
  • Resources
  • -->

  • Slope Graphs
    • Tufte’s Slope Graph
    • Barley Mean Yields
    • Father-Son Heights
  • Scatter Plots
    • Barley Yields
    • Father and Son Heights
    • Old Faithful Eruptions

Slope Graphs

The most used graph for visualizing the relationship between two numeric variables is the scatter plot.

But there is one alternative that can be useful and is increasingly popular: the slope chart or slope graph.

Tufte’s Slope Graph

Two articles on slope graphs with examples:

  • http://charliepark.org/slopegraphs/
  • http://www.visualisingdata.com/2013/12/in-praise-of-slopegraphs/

Tufte showed this example in The Visual Display of Quantitative Information:

Best chart for two numeric variables

Some features of the data that are easy to see:

  • order of the countries within each year;

  • how each country’s values changed;

  • how the rates of change compare;

  • the country (Britain) that does not fit the general pattern.

The chart uses no non-data ink.

The chart in this form is well suited for small data sets or summaries with modest numbers of categories.

Scalability in this full form is limited, but better if labels and values are dropped.

The idea can be extended to multiple periods, though two periods or levels is most common when labeling is used. Without labeling this becomes a parallel coordinates plot.

Best chart for two numeric variables

Barley Mean Yields

A slope graph for the average yields at each experiment station for the two years 1931 and 1932:

theme_set(theme_minimal() + theme(text = element_text(size = 16)))
library(ggrepel)
barley_site_year <- group_by(barley, site, year) %>%
    summarize(avg_yield = mean(yield)) %>%
    mutate(year = fct_rev(year))
barley_site_year_1932 <- filter(barley_site_year, year == "1932")
ggplot(barley_site_year, aes(x = year, y = avg_yield, group = site)) +
    geom_line() +
    geom_text_repel(aes(label = site),
                    data = barley_site_year_1932,
                    hjust = "left", direction = "y") +
    scale_x_discrete(expand = expand_scale(mult = c(0.1, .25)),
                     position = "top") +
    labs(x = NULL, y = "Average Yield")

Best chart for two numeric variables

The anomalous result for Morris pops out very clearly.

This graph departs from the classic Tufte style:

  • it uses an axis instead of showing the numbers;
  • only show labels on one side.

This is similar to the style used here.

Creating the Graph

The first step is to compute the averages:

barley_site_year <- group_by(barley, site, year) %>%
    summarize(avg_yield = mean(yield))
head(barley_site_year, 2)

The year variable is a factor with the levels in the wrong order, so we need to fix that:

levels(barley_site_year$year)
barley_site_year <- mutate(barley_site_year, year = fct_rev(year))
levels(barley_site_year$year)

Set the default theme to theme_minimal with larger text:

theme_set(theme_minimal() + theme(text = element_text(size = 16)))

The core of a slope graph is produced by

p <- ggplot(barley_site_year, aes(x = year, y = avg_yield, group = site)) +
    geom_line()
p

Adding the labels on the 1932 side can be done as

barley_site_year_1932 <- filter(barley_site_year, year == "1932")
p + geom_text(aes(label = site),
              data = barley_site_year_1932,
              hjust = "left")

The label positions could use further adjusting.

Using geom_text_repel from the ggrepel package handles this well:

library(ggrepel)
p <- p + geom_text_repel(aes(label = site),
                         data = barley_site_year_1932,
                         hjust = "left", direction = "y")
p

Adjust the x scale:

p <- p + scale_x_discrete(expand = expand_scale(mult = c(0.1, .25)),
                          position = "top")
p

Final theme adjustments:

p + labs(x = NULL, y = "Average Yield")

Father-Son Heights

The father.son data set has 1078 observations, which is too large for the labeled slope graph, but the basic representation is useful:

fs <- mutate(father.son, id = seq_len(nrow(father.son))) %>%
    pivot_longer(1:2, names_to = "which", values_to = "height")
ggplot(fs, aes(x = which, y = height)) +
    geom_line(aes(group = id), alpha = 0.1) +
    scale_x_discrete(expand = expand_scale(mult = c(.1, .1)),
                     labels = c("Father", "Son"),
                     position = "top") +
    labs(x = NULL, y = "Height (Inches)")

Best chart for two numeric variables

This very clearly shows the famous regression to the mean effect:

  • taller parents tend to be taller than their children;
  • shorter parents tend to be shorter than their children.

Conversely,

  • taller children tend to be taller than their parents;
  • shorter children tend to be shorter than their parents.

Creating the Graph

To make creating the graph easier we can convert the data frame into a longer form with variables

  • height, the height measurement
  • which, fheight or sheight
  • id, identifying the pair:

Add the id variable:

fs <- mutate(father.son, id = seq_len(nrow(father.son)))
head(fs)

Pivot to the longer format:

fs <- pivot_longer(fs, 1:2, names_to = "which", values_to = "height")
head(fs)

The basic plot is quite simple:

ggplot(fs, aes(x = which, y = height, group = id)) + geom_line()

With an alpha adjustment to reduce over-plotting:

p <- ggplot(fs, aes(x = which, y = height, group = id)) +
    geom_line(alpha = 0.1)
p

With an axis adjustment and using a reduced alpha level:

p + scale_x_discrete(expand = expand_scale(mult = c(.1, .1)),
                     labels = c("Father", "Son"),
                     position = "top") +
    labs(x = NULL, y = "Height (Inches)")

Scatter Plots

A scatter plot of two variables maps the values of one variable to the vertical axis and the other to the horizontal axis of a cartesian coordinate system and places a mark for each observation at the resulting point.

Conventions:

  • Plot A versus/against B means A is mapped to the vertical, or \(y\), axis, and B to the horizontal, or \(x\) axis.

  • If we can think of variation in A as being partly explained by B then we usually plot A against B.

  • If we can think of B as helping to predict A, then we usually plot A against B.

Barley Yields

For a scatter plot of mean yield in 1932 against mean yield in 1931 for the different sites it is useful to have a data frame containing variables for each year.

This requires converting the data frame to a wider format.

wide_barley_site_year <- pivot_wider(barley_site_year,
                                     names_from = "year",
                                     names_prefix = "avg_yield_",
                                     values_from = "avg_yield")
head(wide_barley_site_year)
## # A tibble: 6 x 3
## # Groups:   site [6]
##   site            avg_yield_1932 avg_yield_1931
##   <fct>                    <dbl>          <dbl>
## 1 Grand Rapids              20.8           29.1
## 2 Duluth                    25.7           30.3
## 3 University Farm           29.5           35.8
## 4 Morris                    41.5           29.3
## 5 Crookston                 31.2           43.7
## 6 Waseca                    41.9           54.3

The basic scatter plot of y = avg_yield_1932 against x = avg_yield_year1931:

p <- ggplot(wide_barley_site_year,
            aes(x = avg_yield_1931, y = avg_yield_1932)) +
    geom_point()
p

Best chart for two numeric variables

Adding labels using geom_text_repel identifies the Morris site:

p <- p + geom_text_repel(aes(label = site), vjust = "top")
p

Best chart for two numeric variables

To recognize the reversal for Morris we can add the 45 degree line:

p + geom_abline(aes(intercept = 0, slope = 1), linetype = 2)

Best chart for two numeric variables

A 45 degree line also helps when viewing the full data:

bw <- pivot_wider(barley,
                  names_from = "year", names_prefix = "yield_",
                  values_from = "yield")
ggplot(bw, aes(x = yield_1931, y = yield_1932)) +
    geom_point() +
    geom_abline(intercept = 0, slope = 1, linetype = 2) +
    geom_point(data = filter(bw, site == "Morris"), color = "red") +
    ggtitle("Barley Yields", "Values for Morris are shown in red.") +
    labs(x = "1931", y = "1932")

Best chart for two numeric variables

A recent blog post discusses the value of reference lines as plot annotations.

If the primary goal is to show the change from one year to the next then a mean-difference plot is a good choice:

ggplot(wide_barley_site_year,
       aes(x = (avg_yield_1932 + avg_yield_1931) / 2,
           y = avg_yield_1932 - avg_yield_1931)) +
    geom_point() +
    geom_text_repel(aes(label = site), vjust = "top") +
    geom_abline(aes(intercept = 0, slope = 0), linetype = 2)

Best chart for two numeric variables

For the full data:

ggplot(bw,
       aes(x = (yield_1932 + yield_1931) / 2,
           y = yield_1932 - yield_1931)) +
    geom_point() +
    geom_abline(aes(intercept = 0, slope = 0), linetype = 2)

Best chart for two numeric variables

The comparison of changes is now an aligned axis comparison.

Mean-difference plots are also known as

  • Tukey mean-difference plots;
  • MA-plots;
  • Bland-Altman plots.

Plotting the difference against the x variable is also often useful:

ggplot(bw,
       aes(x = yield_1931,
           y = yield_1932 - yield_1931)) +
    geom_point() +
    geom_point(data = filter(bw, site == "Morris"), color = "red") +
    geom_abline(aes(intercept = 0, slope = 0), linetype = 2) +
    ggtitle("Barley Yield Differences", "Values for Morris are shown in red.") +
    labs(x = "Yield in 1931", y = "Difference in Yield for 1932")

Best chart for two numeric variables

Father and Son Heights

The basic scatter plot:

p0 <-  ggplot(father.son, aes(x = fheight, y = sheight))
p1 <- p0 + geom_point()
p1

Best chart for two numeric variables

Adding a line with slope one helps identify the regression to the mean phenomenon:

p2 <- p1 + geom_abline(aes(intercept = mean(sheight) - mean(fheight),
                           slope = 1),
                       color = "red", size = 1.5)
p2

Best chart for two numeric variables

Adding a regression line helps further:

p2 + geom_smooth(method = "lm")

Best chart for two numeric variables

But for showing the regression effect it is hard to beat the scatter plot of sheight - fheight against fheight:

ggplot(father.son) +
    geom_point(aes(x = fheight, y = sheight - fheight)) +
    geom_hline(aes(yintercept = 0), linetype = 2)

Best chart for two numeric variables

Old Faithful Eruptions

A scatter plot of the waiting times until the next eruption against the duration of the current eruption for the faithful data set shows the two clusters corresponding to the short and long eruptions:

ggplot(faithful) + geom_point(aes(x = eruptions, y = waiting))

Best chart for two numeric variables

For the geyser data set from the MASS package a plot of the two variables shows a different pattern:

ggplot(geyser) + geom_point(aes(x = duration, y = waiting))

Best chart for two numeric variables

The reason for the difference is that in the geyser data set the waiting time reflects the time since the previous eruption, not the time until the next one.

For this ordering it is more natural to plot duration against waiting:

ggplot(geyser) + geom_point(aes(x = waiting, y = duration))

Best chart for two numeric variables

How well does the waiting time predict whether the duration will be longer or shorter?

The question the park service is more interested in is how well duration predict waiting time until the next eruption.

We can adjust these data to pair durations with waiting times until the next eruption using the lag function from dplyr. This produces the same basic pattern as for the faithful data set:

ggplot(geyser) + geom_point(aes(x = lag(duration), y = waiting))
## Warning: Removed 1 rows containing missing values (geom_point).

Best chart for two numeric variables

---
title: "Visualizing Two Numeric Variables"
output:
  html_document:
    toc: yes
    code_download: true
    code_folding: "hide"
---

```{r global_options, include=FALSE}
knitr::opts_chunk$set(collapse=TRUE)
```

```{r, include = FALSE}
library(UsingR)
library(lattice)
library(tidyverse)
library(gridExtra)
set.seed(12345)
```


## Slope Graphs

The most used graph for visualizing the relationship between two
numeric variables is the _scatter plot_.

But there is one alternative that can be useful and is increasingly
popular: the _slope chart_ or _slope graph_.


### Tufte's Slope Graph

Two articles on slope graphs with examples:

* http://charliepark.org/slopegraphs/
* http://www.visualisingdata.com/2013/12/in-praise-of-slopegraphs/

Tufte showed this example in _The Visual Display of Quantitative
Information_:

![](img/tufteslope.gif)
<!-- http://charliepark.org/images/slopegraphs/slopegraph.gif -->

Some features of the data that are easy to see:

* order of the countries within each year;

* how each country's values changed;

* how the rates of change compare;

* the country (Britain) that does not fit the general pattern.

The chart uses no non-data ink.

The chart in this form is well suited for small data sets or summaries
with modest numbers of categories.

Scalability in this full form is limited, but better if labels and
values are dropped.

The idea can be extended to multiple periods, though two periods or
levels is most common when labeling is used. Without labeling this
becomes a _parallel coordinates plot_.

![](img/cancer_survival_nash.gif)

<!--
http://charliepark.org/images/slopegraphs/cancer_survival_nash.gif
-->


### Barley Mean Yields

A slope graph for the average yields at each experiment station for
the two years 1931 and 1932:

```{r}
theme_set(theme_minimal() + theme(text = element_text(size = 16)))
library(ggrepel)
barley_site_year <- group_by(barley, site, year) %>%
    summarize(avg_yield = mean(yield)) %>%
    mutate(year = fct_rev(year))
barley_site_year_1932 <- filter(barley_site_year, year == "1932")
ggplot(barley_site_year, aes(x = year, y = avg_yield, group = site)) +
    geom_line() +
    geom_text_repel(aes(label = site),
                    data = barley_site_year_1932,
                    hjust = "left", direction = "y") +
    scale_x_discrete(expand = expand_scale(mult = c(0.1, .25)),
                     position = "top") +
    labs(x = NULL, y = "Average Yield")
```

The anomalous result for Morris pops out very clearly.

This graph departs from the classic Tufte style:

* it uses an axis instead of showing the numbers;
* only show labels on one side.

This is similar to the style used
[here](https://serialmentor.com/dataviz/visualizing-associations.html#associations-paired-data).


#### Creating the Graph

The first step is to compute the averages:

```{r, include = FALSE}
eval_howto <- FALSE
```
```{r, class.source = "fold-show", eval = eval_howto}
barley_site_year <- group_by(barley, site, year) %>%
    summarize(avg_yield = mean(yield))
head(barley_site_year, 2)
```

The `year` variable is a factor with the levels in the wrong order, so
we need to fix that:

```{r, class.source = "fold-show", eval = eval_howto}
levels(barley_site_year$year)
barley_site_year <- mutate(barley_site_year, year = fct_rev(year))
levels(barley_site_year$year)
```

Set the default theme to `theme_minimal` with larger text:

```{r, class.source = "fold-show", eval = eval_howto}
theme_set(theme_minimal() + theme(text = element_text(size = 16)))
```

The core of a slope graph is produced by

```{r, class.source = "fold-show", eval = eval_howto}
p <- ggplot(barley_site_year, aes(x = year, y = avg_yield, group = site)) +
    geom_line()
p
```

Adding the labels on the 1932 side can be done as

```{r, class.source = "fold-show", eval = eval_howto}
barley_site_year_1932 <- filter(barley_site_year, year == "1932")
p + geom_text(aes(label = site),
              data = barley_site_year_1932,
              hjust = "left")
```

The label positions could use further adjusting.

Using `geom_text_repel` from the `ggrepel` package handles this well:

```{r, class.source = "fold-show", eval = eval_howto}
library(ggrepel)
p <- p + geom_text_repel(aes(label = site),
                         data = barley_site_year_1932,
                         hjust = "left", direction = "y")
p
```

Adjust the `x` scale:

```{r, class.source = "fold-show", eval = eval_howto}
p <- p + scale_x_discrete(expand = expand_scale(mult = c(0.1, .25)),
                          position = "top")
p
```

Final theme adjustments:

```{r, class.source = "fold-show", eval = eval_howto}
p + labs(x = NULL, y = "Average Yield")
```


### Father-Son Heights

The `father.son` data set has `r nrow(father.son)` observations, which
is too large for the labeled slope graph, but the basic representation
is useful:

```{r}
fs <- mutate(father.son, id = seq_len(nrow(father.son))) %>%
    pivot_longer(1:2, names_to = "which", values_to = "height")
ggplot(fs, aes(x = which, y = height)) +
    geom_line(aes(group = id), alpha = 0.1) +
    scale_x_discrete(expand = expand_scale(mult = c(.1, .1)),
                     labels = c("Father", "Son"),
                     position = "top") +
    labs(x = NULL, y = "Height (Inches)")
```

This very clearly shows the famous _regression to the mean_ effect:

* taller parents tend to be taller than their children;
* shorter parents tend to be shorter than their children.

Conversely,

* taller children tend to be taller than their parents;
* shorter children tend to be shorter than their parents.

<!--
fs <- mutate(fs, TorSF = abs(height - mean(height)) > 1.5 * sd(height) &
                             which == "fheight")
ggplot(fs) +
geom_line(aes(x = which, y = height, group = id, color = TorSF), alpha = 0.1) +
scale_x_discrete(expand = c(.1, 0)) +
scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black"))
-->


#### Creating the Graph

To make creating the graph easier we can convert the data frame into
a longer form with variables

* `height`, the height measurement
* `which`, `fheight` or `sheight`
* `id`, identifying the pair:

Add the `id` variable:
    
```{r, class.source = "fold-show", eval = eval_howto}
fs <- mutate(father.son, id = seq_len(nrow(father.son)))
head(fs)
```

Pivot to the longer format:

```{r, class.source = "fold-show", eval = eval_howto}
fs <- pivot_longer(fs, 1:2, names_to = "which", values_to = "height")
head(fs)
```

The basic plot is quite simple:

```{r, class.source = "fold-show", eval = eval_howto}
ggplot(fs, aes(x = which, y = height, group = id)) + geom_line()
```

With an `alpha` adjustment to reduce over-plotting:

```{r, class.source = "fold-show", eval = eval_howto}
p <- ggplot(fs, aes(x = which, y = height, group = id)) +
    geom_line(alpha = 0.1)
p
```

With an axis adjustment and using a reduced `alpha` level:

```{r, class.source = "fold-show", eval = eval_howto}
p + scale_x_discrete(expand = expand_scale(mult = c(.1, .1)),
                     labels = c("Father", "Son"),
                     position = "top") +
    labs(x = NULL, y = "Height (Inches)")
```


## Scatter Plots

A scatter plot of two variables maps the values of one variable
to the vertical axis and the other to the horizontal axis of a
cartesian coordinate system and places a mark for each observation
at the resulting point.

Conventions:

* Plot `A` versus/against `B` means `A` is mapped to the vertical, or
  $y$, axis, and `B` to the horizontal, or $x$ axis.

* If we can think of variation in `A` as being partly explained by `B`
  then we usually plot `A` against `B`.

* If we can think of `B` as helping to predict `A`, then we usually
  plot `A` against `B`.


### Barley Yields

For a scatter plot of mean yield in 1932 against mean yield in 1931
for the different sites it is useful to have a data frame containing
variables for each year.

This requires converting the data frame to a wider format.

```{r}
wide_barley_site_year <- pivot_wider(barley_site_year,
                                     names_from = "year",
                                     names_prefix = "avg_yield_",
                                     values_from = "avg_yield")
head(wide_barley_site_year)
```

The basic scatter plot of `y = avg_yield_1932` against `x =
avg_yield_year1931`:

```{r}
p <- ggplot(wide_barley_site_year,
            aes(x = avg_yield_1931, y = avg_yield_1932)) +
    geom_point()
p
```

Adding labels using `geom_text_repel` identifies the Morris site:

```{r}
p <- p + geom_text_repel(aes(label = site), vjust = "top")
p
```

To recognize the reversal for Morris we can add the 45 degree line:

```{r}
p + geom_abline(aes(intercept = 0, slope = 1), linetype = 2)
```

A 45 degree line also helps when viewing the full data:
  
```{r}
bw <- pivot_wider(barley,
                  names_from = "year", names_prefix = "yield_",
                  values_from = "yield")
ggplot(bw, aes(x = yield_1931, y = yield_1932)) +
    geom_point() +
    geom_abline(intercept = 0, slope = 1, linetype = 2) +
    geom_point(data = filter(bw, site == "Morris"), color = "red") +
    ggtitle("Barley Yields", "Values for Morris are shown in red.") +
    labs(x = "1931", y = "1932")
```

A recent [blog
post](https://eagereyes.org/blog/2020/in-praise-of-the-diagonal-reference-line)
discusses the value of reference lines as plot annotations.

If the primary goal is to show the change from one year to the next
then a mean-difference plot is a good choice:
  
```{r}
ggplot(wide_barley_site_year,
       aes(x = (avg_yield_1932 + avg_yield_1931) / 2,
           y = avg_yield_1932 - avg_yield_1931)) +
    geom_point() +
    geom_text_repel(aes(label = site), vjust = "top") +
    geom_abline(aes(intercept = 0, slope = 0), linetype = 2)
```

For the full data:
    
```{r}
ggplot(bw,
       aes(x = (yield_1932 + yield_1931) / 2,
           y = yield_1932 - yield_1931)) +
    geom_point() +
    geom_abline(aes(intercept = 0, slope = 0), linetype = 2)
```

The comparison of changes is now an aligned axis comparison.

Mean-difference plots are also known as

* Tukey mean-difference plots;
* MA-plots;
* [Bland-Altman plots](https://en.wikipedia.org/wiki/Bland%E2%80%93Altman_plot).

Plotting the difference against the `x` variable is also often useful:

```{r}
ggplot(bw,
       aes(x = yield_1931,
           y = yield_1932 - yield_1931)) +
    geom_point() +
    geom_point(data = filter(bw, site == "Morris"), color = "red") +
    geom_abline(aes(intercept = 0, slope = 0), linetype = 2) +
    ggtitle("Barley Yield Differences", "Values for Morris are shown in red.") +
    labs(x = "Yield in 1931", y = "Difference in Yield for 1932")
```


### Father and Son Heights

The basic scatter plot:

```{r}
p0 <-  ggplot(father.son, aes(x = fheight, y = sheight))
p1 <- p0 + geom_point()
p1
```

Adding a line with slope one helps identify the regression to the mean
phenomenon:


```{r}
p2 <- p1 + geom_abline(aes(intercept = mean(sheight) - mean(fheight),
                           slope = 1),
                       color = "red", size = 1.5)
p2
```

Adding a regression line helps further:

```{r}
p2 + geom_smooth(method = "lm")

```

But for showing the regression effect it is hard to beat the scatter
plot of `sheight - fheight` against `fheight`:

```{r}
ggplot(father.son) +
    geom_point(aes(x = fheight, y = sheight - fheight)) +
    geom_hline(aes(yintercept = 0), linetype = 2)
```


### Old Faithful Eruptions

A scatter plot of the waiting times until the next eruption against
the duration of the current eruption for the `faithful` data set shows
the two clusters corresponding to the short and long eruptions:

```{r}
ggplot(faithful) + geom_point(aes(x = eruptions, y = waiting))
```

For the `geyser` data set from the `MASS` package a plot of the two
variables shows a different pattern:

```{r}
ggplot(geyser) + geom_point(aes(x = duration, y = waiting))
```

The reason for the difference is that in the `geyser` data set the
waiting time reflects the time since the _previous_ eruption, not the
time until the _next_ one.

For this ordering it is more natural to plot `duration` against `waiting`:

```{r}
ggplot(geyser) + geom_point(aes(x = waiting, y = duration))
```

How well does the waiting time predict whether the duration will be
longer or shorter?

The question the park service is more interested in is how well
duration predict waiting time until the next eruption.

We can adjust these data to pair durations with waiting times until
the next eruption using the `lag` function from `dplyr`. This produces
the same basic pattern as for the `faithful` data set:

```{r}
ggplot(geyser) + geom_point(aes(x = lag(duration), y = waiting))
```

<!--
Local Variables: 
mode: poly-markdown+R
mode: flyspell
End:
-->


What chart is best for two variables?

A scatter plot or scattergram chart will show the relationship between two different variables or reveals distribution trends.

How do you graph two numerical variables?

The most used graph for visualizing the relationship between two numeric variables is the scatter plot. But there is one alternative that can be useful and is increasingly popular: the slope chart or slope graph.

What graph do you use for two numerical data?

A scatter plot displays values on two numeric variables using points positioned on two axes: one for each variable. Scatter plots are a versatile demonstration of the relationship between the plotted variables—whether that correlation is strong or weak, positive or negative, linear or non-linear.

Which plot is best for numerical variables?

Single Continuous Numeric Variable A box plot will show selected quantiles effectively, and box plots are especially useful when stratifying by multiple categories of another variable. Histograms are also possible.