The Basic Idea
Two classes of statistics have known distributions.
Means have a t distribution.
Proportions have a normal distribution if the expected number of both categories exceeds 5.
The Mean: t
The t distribution
- is entirely defined by degrees of freedom.
- has as metric, the standard error [in this case of the mean]
The equations:
\[
\Large
t = \frac{\overline{x} - \mu}{\frac{s}{\sqrt{n}}} \]
and
\[
\Large
\mu = \overline{x} + t(\frac{s}{\sqrt{n}})
\]
The true mean is symmetric about the sample mean with t defining the number of standard errors of the mean above and below.
telco Churn
The loss of customers is known as churn. Here is some data on telephone companies that gives us a number of features of the customer and the billing history relevant to this.
library(tidyverse)
library(readr)
library(skimr)
telco <- read_csv(url("https://github.com/robertwwalker/DADMStuff/raw/master/WA_Fn-UseC_-Telco-Customer-Churn.csv"))
skim(telco)
Table 1: Data summary
Name
telco
Number of rows
7043
Number of columns
21
_______________________
Column type frequency:
character
17
numeric
4
________________________
Group variables
None
Variable type: character
Probability: The Logic of Science
Jaynes presents a few core ideas and requirements for his rational system. Probability emerges as the representation of circumstances in which any given realization of a process is either TRUE or FALSE but both are possible and expressable by probabilities
that sum to one for all events
are greater than or equal to zero for any given event
General Representation of Probability
Is of necessity two-dimensional,
Fast Food Data
These data came care of a Tidy Tuesday a while ago. The data consist of Fast Food menu items for a selection of fast food chains. The units are menu items. We have the chain [restaurant], item [the item name], and a series of variables (columns) representing sodium, cholesterol, fat, calories, and other information. Some is missing. The data can be imported from the tidytuesday website on github as .
There is wonderful documentation to the flexdashboard package. Furthermore, because it is built on underlying scripting, things like plotly just work. Here is an example.
Variance in the Outcome: The Black Box
Regression models engage an exercise in variance accounting. How much of the outcome is explained by the inputs, individually (slope divided by standard error is t) and collectively (Average explained/Average unexplained with averaging over degrees of freedom is F). This, of course, assumes normal errors. This document provides a function for making use of the black box. Just as in common parlance, a black box is the unexplained.
Fake Data
I will fake some data to work with according to the following equation.
\[ y = 2 + 2*x_{1} + 1*x_{2} -1*x_{3} + \epsilon \]
where each x and \(\epsilon\) are random draws from a standard normal distribution with mean zero and standard deviation 1.
x1 <- rnorm(100); x2 <- rnorm(100); x3 <- rnorm(100); e <- rnorm(100)
y <- 2 + 2*x1 + x2 - x3 + e
My.