Visualizing One Quantitative Variable
Bonds
A dataset for illustrating the various available visualizations needs a certain degree of richness with manageable size. The dataset on Bonds contains three categorical and a few quantitative indicators sufficient to show what we might wish.
Loading the Data
Bonds <- read.csv(url("https://raw.githubusercontent.com/robertwwalker/DADMStuff/master/BondFunds.csv"))
A Summary
library(skimr)
Bonds %>%
skim()
Name | Piped data |
Number of rows | 184 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Fund.Number | 0 | 1 | 4 | 6 | 0 | 184 | 0 |
Type | 0 | 1 | 20 | 23 | 0 | 2 | 0 |
Fees | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Risk | 0 | 1 | 7 | 13 | 0 | 3 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Assets | 0 | 1 | 910.65 | 2253.27 | 12.40 | 113.72 | 268.4 | 621.95 | 18603.50 | ▇▁▁▁▁ |
Expense.Ratio | 0 | 1 | 0.71 | 0.26 | 0.12 | 0.53 | 0.7 | 0.90 | 1.94 | ▂▇▅▁▁ |
Return.2009 | 0 | 1 | 7.16 | 6.09 | -8.80 | 3.48 | 6.4 | 10.72 | 32.00 | ▁▇▅▁▁ |
X3.Year.Return | 0 | 1 | 4.66 | 2.52 | -13.80 | 4.05 | 5.1 | 6.10 | 9.40 | ▁▁▁▅▇ |
X5.Year.Return | 0 | 1 | 3.99 | 1.49 | -7.30 | 3.60 | 4.3 | 4.90 | 6.80 | ▁▁▁▅▇ |
Most data types are represented. There is no time variable so dates and the visualizations that go with time series are omitted.
Data Visualization
First, let us look at visualizations for one quantitative variable. Let me focus on assets..
geom_histogram()
A histogram divides the data into categories and counts the observations per category. The width of the categories [on x] is determined by binwidth=
or the binwidth can be calculated as a function of the range and the number of bins bin=
. I will define it as Gen.Hist.
A Base Histogram
Gen.Hist <- Bonds %>%
ggplot() + aes(x = Assets) + geom_histogram()
Gen.Hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histograms [bins]
We can choose more bins. 50? That is far more than the default of 30.
Bin50.Hist <- Bonds %>%
ggplot() + aes(x = Assets) + geom_histogram(bins = 50)
Bin50.Hist
We can also choose fewer bins. I will choose 10.
Bin10.Hist <- Bonds %>%
ggplot() + aes(x = Assets) + geom_histogram(bins = 10)
Bin10.Hist
Histograms [binwidth]
We can also set the width of bins in the metric of x
; I will choose 500 (bigger).
BinW500.Hist <- Bonds %>%
ggplot() + aes(x = Assets) + geom_histogram(binwidth = 500)
BinW500.Hist
We can also set the width of bins in the metric of x
; I will choose 50 (smaller width makes more bins).
BinW50.Hist <- Bonds %>%
ggplot() + aes(x = Assets) + geom_histogram(binwidth = 50)
BinW50.Hist
geom_dotplot()
geom_dotplot()
places a dot for every observation in the relevant bin. We can control the size of the bins [in the original metric] with binwidth=
.
Small binwidth
Bonds %>%
ggplot() + aes(x = Assets) + geom_dotplot(binwidth = 10)
Large binwidth
Bonds %>%
ggplot() + aes(x = Assets) + geom_dotplot(binwidth = 1000)
An ?optimal? binwidth
Each dot represents a datum with bins of size 100.
Bonds %>%
ggplot() + aes(x = Assets) + geom_dotplot(binwidth = 100) + labs(y = "")
geom_freqpoly()
geom_freqpoly()
is the line equivalent of a histogram. The arguments are similar, the output doesn’t include the bars as it does in the histogram.
Bonds %>%
ggplot(., aes(x = Assets)) + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
More bins
Bonds %>%
ggplot(., aes(x = Assets)) + geom_freqpoly(bins = 50)
Fewer bins
Bonds %>%
ggplot(., aes(x = Assets)) + geom_freqpoly(bins = 10)
geom_area()
Is a relative of the histogram with lines connecting the midpoints of the bins and an associated fill from zero.
Defaults to 30 bins
Bonds %>%
ggplot(., aes(x = Assets)) + geom_area(stat = "bin")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Small binwidth with a large number of bins
I will color in the area with magenta and clean up the theme.
Bonds %>%
ggplot(., aes(x = Assets)) + geom_area(stat = "bin", bins = 100, fill = "magenta") +
theme_minimal()
geom_density()
A relative of the histogram and the area plots above, the density plot smooths out the blocks of a histogram with a moving window [known as the bandwidth].
geom_density()
outlines
Bonds %>%
ggplot(., aes(x = Assets)) + geom_density(outline.type = "upper")
Bonds %>%
ggplot(., aes(x = Assets)) + geom_density(outline.type = "lower")
Bonds %>%
ggplot(., aes(x = Assets)) + geom_density(outline.type = "full")
geom_density()
adjust
Adjust applies a numeric correction to the bandwidth. Numbers greater than 1 make the bandwidth bigger [and the graphic smoother] and numbers less than 1 [but greater than zero] make the bandwidth smaller and the graphic more jagged.
Bonds %>%
ggplot(., aes(x = Assets)) + geom_density(adjust = 2)
Bonds %>%
ggplot(., aes(x = Assets)) + geom_density(adjust = 1/2)
geom_boxplot
A boxplot shows a box of the first and third quartiles and a notch at the median. The dots above or below denote points outside the hinges. The hinges [default to 1.5*IQR] show a range of expected data while the individual dots show possible outliers outside the hinges. To adjust the hinges, the argument coef=1.5
can be adjusted.
Bonds %>%
ggplot(., aes(x = Assets)) + geom_boxplot()
geom_qq()
To compare empirical and theoretical quantiles. Comparing a distribution to the normal or others is common and this provides the tool for doing so. The default is a normal.
The empirical cumulative distribution function arises when we sort a quantitative variable and show the percentiles below said value.
Bonds %>%
ggplot(aes(sample = Assets)) + geom_qq()
stat_ecdf(geom = )
We could do this with most geometries. I will show a few.
stat_ecdf(geom = "step")
Bonds %>%
ggplot(aes(x = Assets)) + stat_ecdf(geom = "point") + stat_ecdf(geom = "step",
alpha = 0.1) + labs(y = "ECDF: Proportion less than Assets") + theme_minimal()
stat_ecdf(geom = "point")
Bonds %>%
ggplot(aes(x = Assets)) + stat_ecdf(geom = "point") + stat_ecdf(geom = "step",
alpha = 0.1) + labs(y = "ECDF: Proportion less than Assets") + theme_minimal()
Combining two
Bonds %>%
ggplot(aes(x = Assets)) + stat_ecdf(geom = "point") + stat_ecdf(geom = "step",
alpha = 0.1) + labs(y = "ECDF: Proportion less than Assets") + theme_minimal()
stat_ecdf(geom = "line")
Bonds %>%
ggplot(aes(x = Assets)) + stat_ecdf(geom = "line") + labs(y = "ECDF: Proportion less than Assets") +
theme_minimal()
stat_ecdf(geom = "area")
Bonds %>%
ggplot(aes(x = Assets)) + stat_ecdf(geom = "area", alpha = 0.2) + labs(y = "ECDF: Proportion less than Assets") +
theme_minimal()