Statistical inference helps us understand the data, and hypothesis testing helps us understand if the data is different from another set of data. These techniques are important when exploring data sets, as they help us guide our analysis. However, these techniques are not enough. Most times, we are looking to understand the relationship between two sets of data, such as how AAPL moves with respect to the S&P 500. When we are looking to find the relationship between two sets of quantitative data, we can start with correlation and covariance.
Correlation and Covariance
When we want to describe the relationship between two sets of data, we can plot the data sets in a scatter plot and look at four characteristics:
- Direction: Are the data points sloping upwards or downwards?
- Form: Do the data points form a straight line or a curved line?
- Strength: Are the data points tightly clustered or spread out?
- Outliers: Are there data points far away from the main body of data?
The correlation coefficient can describe two of the four: the direction and strength of the relationship. Typically denoted as ρ (the Greek letter rho) or r, the equation for the correlation coefficient is:
ρxy = sxy / (sx * sy)
Where sxy is the covariance of x and y, or how they vary with respect to each other. The covariance is described by this equation:
sxy = 1/(n-1) ∑(xi – x̄)(yi – ȳ)
As we can see from the equation, the covariance sums the term (xi – x̄)(yi – ȳ) for each data point, where x̄ or x bar is the average x value, and ȳ or y bar is the average y value. The term becomes more positive if both x and y are larger than the average values in the data set, and becomes more negative if smaller. As the covariance accounts for every data point in the set, a positive covariance must mean that most, if not all, data points are in sync with respect to x and y (small y when x is small or large y when x is large). Conversely, a negative covariance must mean that most, if not all, data points are out of sync with respect to x and y (small y when x is large or large y when x is small).
Covariance is a useful measure at describing the direction of the linear association between two quantitative variables, but it has two weaknesses: a larger covariance does not always mean a stronger relationship, and we cannot compare the covariances across different sets of relationships.
Covariances depend in part on the size of x and y in the data — if x is large then the covariance will be large too. For example, if we were to compare the covariance of S&P 500 and AAPL to the covariance of MSFT and AAPL, we will find that the first covariance is much bigger. The difference would be mainly due to the fact that S&P 500 is in the thousands, where MSFT and AAPL are only in the hundreds, and does not speak to the strength of the linear association.
To account for the weakness, we normalize the covariance by the standard deviation of the x values and y values, to get the correlation coefficient. The correlation coefficient is a value between -1 and 1, and measures both the direction and the strength of the linear association. One important distinction to note is that correlation does not measure the slope of the relationship — a large correlation only speaks to the strength of the relationship. Some key points on correlation are:
- Correlation measures the direction and strength of the linear association between two quantitative variables
- Positive and negative indicates direction, large and small indicates the strength
- Outliers should be noted and may be treated
- Correlation has symmetry: correlation of x and y is the same as correlation of y and x
- Correlation is unitless and normalized
Correlation is often presented in a correlation matrix, where the correlations of the pairs of values are reported in table.
Correlation and covariance are quantitative measures of the strength and direction of the relationship between two variables, but they do not account for the slope of the relationship. In other words, we do not know how a change in one variable could impact the other variable. Regression is the technique that fills this void — it allows us to make the best guess at how one variable affects the other variables. The simplest linear regression allows us to fit a “line of best fit” to the scatter plot, and use that line (or model) to describe the relationship between the two variables. The equation for that line is:
y = β0 + β1x + ε
Where y is the dependent variable, and x is the independent variable. The betas are the coefficients (or constants) in the equation — β0 is the y-intercept of the line, and β1 is the slope of the line. The epsilon (ε) is the error (or residual) term. The regression minimizes the sum of squared errors between the actual y values and the y values predicted by the line of best fit.
You have probably seen this equation many times before, in high school (y = mx + b) and in the CAPM (E(ri) = rF + (E(rM) – rF) * βi). On a high level, the equation describes how the observed data is affected by systematic relationships (β0 + β1x), and by “randomness” (ε). Randomness could come from measurement error, random chance, or systematic relationships not accounted for in the variables present.
For example, if we regress AAPL returns on S&P 500 returns, we will find some sort of systematic relationship between the two, described by β1 or “beta”. We will also find that the relationship between the two is not perfectly described by the model, as there are firm specific risks involved. If Tim Cook smokes marijuana on a podcast and the stock price tanks, that cannot be accounted for by the variables present, and it goes into the error term. If Bloomberg glitches and reports a wrong number, that would also go into the error term.