Visualizing One Quantitative Variable

Bonds

A dataset for illustrating the various available visualizations needs a certain degree of richness with manageable size. The dataset on Bonds contains three categorical and a few quantitative indicators sufficient to show what we might wish.

Loading the Data

Bonds <- read.csv(url("https://raw.githubusercontent.com/robertwwalker/DADMStuff/master/BondFunds.csv"))

A Summary

library(skimr)
Bonds %>%
    skim()
Table 1: Data summary
Name Piped data
Number of rows 184
Number of columns 9
_______________________
Column type frequency:
character 4
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Fund.Number 0 1 4 6 0 184 0
Type 0 1 20 23 0 2 0
Fees 0 1 2 3 0 2 0
Risk 0 1 7 13 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Assets 0 1 910.65 2253.27 12.40 113.72 268.4 621.95 18603.50 ▇▁▁▁▁
Expense.Ratio 0 1 0.71 0.26 0.12 0.53 0.7 0.90 1.94 ▂▇▅▁▁
Return.2009 0 1 7.16 6.09 -8.80 3.48 6.4 10.72 32.00 ▁▇▅▁▁
X3.Year.Return 0 1 4.66 2.52 -13.80 4.05 5.1 6.10 9.40 ▁▁▁▅▇
X5.Year.Return 0 1 3.99 1.49 -7.30 3.60 4.3 4.90 6.80 ▁▁▁▅▇

Most data types are represented. There is no time variable so dates and the visualizations that go with time series are omitted.

Data Visualization

First, let us look at visualizations for one quantitative variable. Let me focus on assets..

geom_histogram()

A histogram divides the data into categories and counts the observations per category. The width of the categories [on x] is determined by binwidth= or the binwidth can be calculated as a function of the range and the number of bins bin=. I will define it as Gen.Hist.

A Base Histogram

Gen.Hist <- Bonds %>%
    ggplot() + aes(x = Assets) + geom_histogram()
Gen.Hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histograms [bins]

We can choose more bins. 50? That is far more than the default of 30.

Bin50.Hist <- Bonds %>%
    ggplot() + aes(x = Assets) + geom_histogram(bins = 50)
Bin50.Hist

We can also choose fewer bins. I will choose 10.

Bin10.Hist <- Bonds %>%
    ggplot() + aes(x = Assets) + geom_histogram(bins = 10)
Bin10.Hist

Histograms [binwidth]

We can also set the width of bins in the metric of x; I will choose 500 (bigger).

BinW500.Hist <- Bonds %>%
    ggplot() + aes(x = Assets) + geom_histogram(binwidth = 500)
BinW500.Hist

We can also set the width of bins in the metric of x; I will choose 50 (smaller width makes more bins).

BinW50.Hist <- Bonds %>%
    ggplot() + aes(x = Assets) + geom_histogram(binwidth = 50)
BinW50.Hist

geom_dotplot()

geom_dotplot() places a dot for every observation in the relevant bin. We can control the size of the bins [in the original metric] with binwidth=.

Small binwidth

Bonds %>%
    ggplot() + aes(x = Assets) + geom_dotplot(binwidth = 10)

Large binwidth

Bonds %>%
    ggplot() + aes(x = Assets) + geom_dotplot(binwidth = 1000)

An ?optimal? binwidth

Each dot represents a datum with bins of size 100.

Bonds %>%
    ggplot() + aes(x = Assets) + geom_dotplot(binwidth = 100) + labs(y = "")

geom_freqpoly()

geom_freqpoly() is the line equivalent of a histogram. The arguments are similar, the output doesn’t include the bars as it does in the histogram.

Bonds %>%
    ggplot(., aes(x = Assets)) + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

More bins

Bonds %>%
    ggplot(., aes(x = Assets)) + geom_freqpoly(bins = 50)

Fewer bins

Bonds %>%
    ggplot(., aes(x = Assets)) + geom_freqpoly(bins = 10)

geom_area()

Is a relative of the histogram with lines connecting the midpoints of the bins and an associated fill from zero.

Defaults to 30 bins

Bonds %>%
    ggplot(., aes(x = Assets)) + geom_area(stat = "bin")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Small binwidth with a large number of bins

I will color in the area with magenta and clean up the theme.

Bonds %>%
    ggplot(., aes(x = Assets)) + geom_area(stat = "bin", bins = 100, fill = "magenta") +
    theme_minimal()

geom_density()

A relative of the histogram and the area plots above, the density plot smooths out the blocks of a histogram with a moving window [known as the bandwidth].

geom_density() outlines

Bonds %>%
    ggplot(., aes(x = Assets)) + geom_density(outline.type = "upper")

Bonds %>%
    ggplot(., aes(x = Assets)) + geom_density(outline.type = "lower")

Bonds %>%
    ggplot(., aes(x = Assets)) + geom_density(outline.type = "full")

geom_density() adjust

Adjust applies a numeric correction to the bandwidth. Numbers greater than 1 make the bandwidth bigger [and the graphic smoother] and numbers less than 1 [but greater than zero] make the bandwidth smaller and the graphic more jagged.

Bonds %>%
    ggplot(., aes(x = Assets)) + geom_density(adjust = 2)

Bonds %>%
    ggplot(., aes(x = Assets)) + geom_density(adjust = 1/2)

geom_boxplot

A boxplot shows a box of the first and third quartiles and a notch at the median. The dots above or below denote points outside the hinges. The hinges [default to 1.5*IQR] show a range of expected data while the individual dots show possible outliers outside the hinges. To adjust the hinges, the argument coef=1.5 can be adjusted.

Bonds %>%
    ggplot(., aes(x = Assets)) + geom_boxplot()

geom_qq()

To compare empirical and theoretical quantiles. Comparing a distribution to the normal or others is common and this provides the tool for doing so. The default is a normal.

The empirical cumulative distribution function arises when we sort a quantitative variable and show the percentiles below said value.

Bonds %>%
    ggplot(aes(sample = Assets)) + geom_qq()

stat_ecdf(geom = )

We could do this with most geometries. I will show a few.

stat_ecdf(geom = "step")

Bonds %>%
    ggplot(aes(x = Assets)) + stat_ecdf(geom = "point") + stat_ecdf(geom = "step",
    alpha = 0.1) + labs(y = "ECDF: Proportion less than Assets") + theme_minimal()

stat_ecdf(geom = "point")

Bonds %>%
    ggplot(aes(x = Assets)) + stat_ecdf(geom = "point") + stat_ecdf(geom = "step",
    alpha = 0.1) + labs(y = "ECDF: Proportion less than Assets") + theme_minimal()

Combining two

Bonds %>%
    ggplot(aes(x = Assets)) + stat_ecdf(geom = "point") + stat_ecdf(geom = "step",
    alpha = 0.1) + labs(y = "ECDF: Proportion less than Assets") + theme_minimal()

stat_ecdf(geom = "line")

Bonds %>%
    ggplot(aes(x = Assets)) + stat_ecdf(geom = "line") + labs(y = "ECDF: Proportion less than Assets") +
    theme_minimal()

stat_ecdf(geom = "area")

Bonds %>%
    ggplot(aes(x = Assets)) + stat_ecdf(geom = "area", alpha = 0.2) + labs(y = "ECDF: Proportion less than Assets") +
    theme_minimal()

Avatar
Robert W. Walker
Associate Professor of Quantitative Methods

My research interests include causal inference, statistical computation and data visualization.

Next
Previous

Related