2017年1月2日 星期一

Correlation and Regression (editing, R2 SSE etc)


Good Explanation of Covariance Matrix (aka Variance-Covariance matrix)
https://www.itl.nist.gov/div898/handbook/pmc/section5/pmc541.htm


Variance

Average Distance
Consider 1D point. For 2 points, a distance is given by the Euclidean distance
Consider a set of 1D points, there are many distances between each point and the mean point. We add them up to get the total distance. The total distance can be divided by n to get the average. This is called the average distance.

Why prefer average squares over average distance?
Using the squares of distance is better because the sign is removed. So for each points we calculate the square of distance and sum them up. The total squares can be divided by n to get the average. This is called the average square, also know as variance.

Sample Variance (average square)
varx = Sum(xi-xm)2 / (n-1)
It is the same as the average square, except the n-1.
It measures how the data vary compared to the center of data.

Taking a one dimension object as example. Variance measures the distance between the particle to the center of mass. The larger the variance, the bigger the object. A single dot has the minimal variance of 0, meaning all particles are at the same point which is the center of mass.

Taking a three dimensional balloon as example. Variance measures the distance between the air particles to the center of balloon. The larger the variance, the bigger the balloon. A very large variance means a giant of balloon.

Sample Standard Deviation
Just take the square root of sample variance. Square root means the edge of the square, or the base of the square.

Covariance

Covariance (average rectangle)
covxy = Sum(xi-xm)(yi-ym)/(n-1)
Consider the scatter plot of x and y to form a two dimensional plane.  Each data point and the mean point forms a rectangle. The area of the rectangle is given by width x height = (xi - xm) * (yi -ym). Taking the average of the sum of area of rectangles gives the average rectangular area. This is called the average rectangle, also know as covariance, and CFA material describes it as "cross product". But I think the term "rectangles" is better.

Covariance is similar to variance except the square becomes the rectangle. Covariance is more generalized than variance. It is because covariance measures how one set vary compared to another set, whereas variance measures how one set vary compared to itself. Variance can be considered as covariance of 2 sets of identical data set, i.e. xi = yi, and hence the rectangle becomes square. Consider variance is a square and covariance as rectangle, they are similar in the way that square is actually a specialized rectangle, and rectangle is a generalized form of square.

Correlation Coefficient
r = corr = cov / (sx * sy)

Regression

Linear Regression
Finding the only straight line to pass through the data points so that the distances between all points to the line are minimal.

Slope of Regression Line (cov / varx)
The straight line has general form of y = mx + c. m is the slope and c is the y intercept.
The slope of a straight line is the ratio between delta y and delta x. So m = Sum(xi-xm)(yi-ym)/Sum(xi-xm)(xi-xm), or in short Sxy/Sxx, or average rectangle / average square to gives the ratio of height to width.

Y-intercept of Regression Line
The straight line must pass through the mean of x and mean of y. The point-slope form gives they intercept with mean point = (xm, ym) and slope calculated.

Slope of Regression Line (cov, std)
It can also be calculated using the correlation coefficient and standard deviation.
Slope m = cov / varx = cov / (sx*sx) = rxy * (sy / sx)



沒有留言:

張貼留言