Correlation analysis

From Cybis Wiki
Revision as of 10:30, 1 July 2009 by Lars-Ake (talk | contribs)
Jump to navigation Jump to search

Correlation analysis is a type of scoring method where you calculate a value for the coovariance between two curves when they lay over each other at a certain position. I.e. the calculated value - the correlation coefficient - is a measure of how well the two curves match each other at that position. When using Curve sliding for the analysis during crossdating, the correlation coefficient is calculated at every possible overlapping position for the curves. Hopefully the highest value found does then correspond to the correct crossdating position.

The correlation coefficient is cumbersome to calculate so you really need a computer for this type of scoring.

Within CDendro the correlation coefficient used is the Pearson product-moment correlation coefficient (ref till Wikipedia).

A coefficient value of 1 means that both curves follow each other exactly. A value of -1 means that the curves behaves exactly contrary to each other, e.g. when the one curve goes up, the other goes down. Correlation coefficient values are always within the limits -1 to +1!

It should be noted that the statistical mathematics for the correlation coefficient are defined on the relations between random variables. It should then be observed that ring width values are not random - when a ring is thick, there is a high probability that the next years ring will also be thick. So the use of any correlation coefficient within dendrochronology should best be motivated by practical observations on its efficiency to find correct crossdatings and ... comment on error rate and reference to Torbjörns article on TTEST.

Note: When comparing ring width curves, we do the correlation coefficient mathematics on the normalized curves! When you document a best value from such a correlation calculus, you should also document the normalization method used, as the requirements on the level of the coefficient to acertain a dating, differs somewhat with the normalization method used (ref 1).

Definition of the correlation coefficient

Define X and Y as paired curve values. There is one X and one Y for each year when the curves lay at a certain position. Define Mx and My as the mean values of each curve, i.e. Mx = E(X) and My = E(Y). Calculate the standard deviations as Sx = Sqr( E (X-Mx)² ) and Sy = Sqr( E (Y-My)² ) (The standard deviation is a measure of a "normal" (typical) distance from a point on a curve to the mean value of that curve.)

Calculate the correlation coefficient as r = E( (X-Mx)*(Y-My)) / (Sx * Sy )

Overlapping

If we slide the curve of one sample so it hangs out a bit on either side of the other curve, it means that only a part of the first curve overlaps the other curve. It is usually not meaningfull to test the curve fitting when the overlap is less than 30. For proper crossdating overlaps less than 50-70 should not be considered.

TTest/T-score/T-value

The TTest value is based on the correlation value but it also takes into account that a match with a short overlap is less worth than a match with a longer overlap when correlation values are the same.

TTest values are calculated according to the formula below, where n is the number of overlapping years and r is the correlation coefficient value.

TTest = r * sqr( (n-2) / (1 - r² ) )