Whenever we observe data, we are usually observing one or a few samples from a much larger population. For example, if we are looking at daily stock market returns for AAPL for last year, we are looking at only a small portion of the overall daily returns. More often than not, we are trying to understand the characteristics of that underlying population, which encompasses ALL of the daily stock market returns, from the sample that we collected. The process that we take to infer those characteristics is called statistical inference.
To draw conclusions on the underlying population, we must first define a few common terms:
- Mean: the average value in a sample or population
- Variance: the spread in a sample or population
- Standard Deviation: the square root of the variance
- Probability Distribution: a distribution of all values in a sample or population
- Bernoulli Distribution: a simple distribution from binary outcomes
- Normal Distribution: a common bell-shaped distribution based on continuous random variable, have several unique properties:
- Can be described by the mean and standard deviation
- Has the same mean median and mode
- 68%, 95.4%, 99.7% of the population is within 1, 2 and 3 standard deviations of the mean
Distribution of Sample Means and the Central Limit Theorem
When we draw a sample from the population, we usually do not know the shape of the underlying distribution. If we draw multiple samples from the population, and plotted the distribution of the means of each sample, we will get the distribution of sample means (sometimes called the sampling distribution of sample means). This distribution has a few special properties:
- The distribution is always normal, regardless of the shape of the underlying population
- The mean of the distribution is the mean of the population
- The standard deviation of the distribution is called the standard error
However, getting enough samples to create this distribution is time consuming. To address this issue, we can estimate the distribution of sample means from one sample by the following rule:
Standard Error = Standard Deviation / Square Root of Sample Size
We also need to make a few assumptions:
- The sample is randomly drawn
- The observations are independent
- The sample size is sufficiently large, n > 30 is a rule of thumb
Knowing standard error and the fact that the distribution is Normal, we only need the mean to build the distribution. For example, suppose that the average daily return of AAPL is -3.57% over the past year, and the standard deviation of the daily return is 1%. We can calculate standard error to be roughly 1% / sqrt(300), and given that we are 95% sure that the sample mean is +/- 1.96 * Standard Error around our true mean, we can infer the true mean to be +/- 1.96 * Standard Error away from our sample mean, at a 95% confidence level. The range described above is also called the confidence interval.
Understanding the true population is important, but insights are also driven by the relative difference between two sets of data. To test for the existence and the significance of a difference, we use hypothesis testing, an extension on what we did above. The steps to hypothesis testing are as follows:
- Form a good hypothesis: this will be driven by the problem we are trying to solve
- Formalize the hypothesis as Null and Alternative: this will define our tests
- Find test statistic: this can be a z-score or a t-score
- Find the p-value: this is what determines whether our difference is significant
- Decision: this is a written version of our conclusions
A good hypothesis should contain a few traits, namely:
For example, if we have a problem where we want to see if the AAPL returns are different from bank interest (say 2%) last year:
Step 1: Our hypothesis might be if AAPL returns are on average different from 2%.
Step 2: The null hypothesis is what we are looking to reject, and is always an equivalence: that the average returns are the same as 2%. The alternative hypothesis would then be the alternative: that the average returns are not the same as 2%.
Step 3: To find the test statistic, we would use the distribution of sample means with a mean of 2% to where AAPL returns lie. This step is typically automated by either Excel, or a software like Statstools or R. The test statistic will come out as a z-score, which is the number of standard deviations between the null hypothesis and the sample mean. We can compare this z-score to the critical value (usually 1.96 for 95% confidence) to determine statistical significance, but we usually move on to the next step.
Step 4: To find the p-value, Excel or another software package would handle this. The p-value is the percent of the distribution on the side of the sample mean that is farther away from the null hypothesis. For example, a p-value of 0.02 would mean that 2% of the population is to the left of -3.57%. We would then compare this p-value against a significant threshold called alpha. If the p-value is lower than alpha, we can say that the result is statistically significant, and reject the null hypothesis.
Step 5: We can then translate the statistical jargon to plain English: based on the tests, the returns from AAPL last years is on average lower than the returns from bank interest.
There are two types of hypothesis tests, we just described the two-tailed test. The alternative is the one-tailed test, where we would see if the sample mean is greater than (or less than) the hypothesized value. The one-tailed test is easier to reject at the same significance threshold (alpha), as the rejection zone is double of the one in the two-tailed test. As the two-tailed test is more conservative, it should be used unless:
- We have strong outside evidence (business reasons or practical constraints) that true mean can only deviate in one direction
- We only care about deviation in one direction, if the deviation was in the other direction it would be treated as if there was no deviation
For example, if we are testing to see if the returns on a high-yield bond, on average, are higher than the returns on a government bond, there is justification for a one-tailed test.