11 篇文章 0 订阅
34 篇文章 0 订阅
6 篇文章 0 订阅

# Pearson correlation coefficient 的简单理解

## 2. 定义

ρ X , Y = c o v ( X , Y ) σ X σ Y = E [ ( X − μ X ) ( Y − μ Y ) ] σ X σ Y \rho_{X,Y}=\frac {cov(X,Y)}{\sigma_X\sigma_Y}=\frac{E[(X-\mu_X)(Y-\mu_Y)]}{\sigma_X\sigma_Y}

r = ∑ i = 1 n ( X i − X ‾ ) ( Y i − Y ‾ ) ∑ i = 1 n ( X i − X ‾ ) 2 ∑ i = 1 n ( Y i − Y ‾ ) 2 r= \frac{\sum_{i=1}^{n}{(X_i-\overline{X})(Y_i-\overline{Y})}}{\sqrt{\sum_{i=1}^{n}{(X_i-\overline{X})^2}}\sqrt{\sum_{i=1}^{n}{(Y_i-\overline{Y})^2}}}

r也可以由 ( X i , Y i ) (X_i, Y_i) 样本点的标准分数均值估计，得到与上式等价的表达式

r = 1 n − 1 ∑ i = 1 n ( X i − X ‾ σ X ) ( Y i − Y ‾ σ Y ) r=\frac{1}{n-1}\sum_{i=1}^{n}{(\frac{X_i-\overline{X}}{\sigma_X})(\frac{Y_i-\overline{Y}}{\sigma_Y})}

## 3. 数学特性

c o r r ( X , Y ) = c o r r ( Y , X ) corr(X,Y)=corr(Y,X)

PCC有一个重要的数学特性是，因两个变量的位置和尺度的变化并不会引起该系数的改变，也就是说如果我们对X进行 a + b X a+bX 和把Y进行 c + d Y c+dY 进行平移和尺度缩放，其中a,b,c,d都是大于0的常数，这种操作并不会改变PCC的值

## 4. 代码实现

from scipy import stats
import numpy as np

def ComputePCC(x, y):
dtype = type(1.0 + x[0] + y[0])
xmean = x.mean(dtype=dtype)
ymean = y.mean(dtype=dtype)

xm = x.astype(dtype) - xmean
ym = y.astype(dtype) - ymean

normxm = np.linalg.norm(xm)
normym = np.linalg.norm(ym)
r = np.dot(xm / normxm, ym / normym)
return r

a = np.array([0, 0, 0, 1, 1, 1, 1])
b = np.arange(7)
print(stats.pearsonr(a, b)[0])
print(ComputePCC(a, b))


def pearsonr(x, y):
r"""
Pearson correlation coefficient and p-value for testing non-correlation.

The Pearson correlation coefficient [1]_ measures the linear relationship
between two datasets.  The calculation of the p-value relies on the
assumption that each dataset is normally distributed.  (See Kowalski [3]_
for a discussion of the effects of non-normality of the input on the
distribution of the correlation coefficient.)  Like other correlation
coefficients, this one varies between -1 and +1 with 0 implying no
correlation. Correlations of -1 or +1 imply an exact linear relationship.
Positive correlations imply that as x increases, so does y. Negative
correlations imply that as x increases, y decreases.

The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets.

Parameters
----------
x : (N,) array_like
Input array.
y : (N,) array_like
Input array.

Returns
-------
r : float
Pearson's correlation coefficient.
p-value : float
Two-tailed p-value.

Warns
-----
PearsonRConstantInputWarning
Raised if an input is a constant array.  The correlation coefficient
is not defined in this case, so np.nan is returned.

PearsonRNearConstantInputWarning
Raised if an input is "nearly" constant.  The array x is considered
nearly constant if norm(x - mean(x)) < 1e-13 * abs(mean(x)).
Numerical errors in the calculation x - mean(x) in this case might
result in an inaccurate calculation of r.

--------
spearmanr : Spearman rank-order correlation coefficient.
kendalltau : Kendall's tau, a correlation measure for ordinal data.

Notes
-----
The correlation coefficient is calculated as follows:

.. math::

r = \frac{\sum (x - m_x) (y - m_y)}
{\sqrt{\sum (x - m_x)^2 \sum (y - m_y)^2}}

where :math:m_x is the mean of the vector :math:x and :math:m_y is
the mean of the vector :math:y.

Under the assumption that :math:x and :math:m_y are drawn from
independent normal distributions (so the population correlation coefficient
is 0), the probability density function of the sample correlation
coefficient :math:r is ([1]_, [2]_):

.. math::

f(r) = \frac{{(1-r^2)}^{n/2-2}}{\mathrm{B}(\frac{1}{2},\frac{n}{2}-1)}

where n is the number of samples, and B is the beta function.  This
is sometimes referred to as the exact distribution of r.  This is
the distribution that is used in pearsonr to compute the p-value.
The distribution is a beta distribution on the interval [-1, 1],
with equal shape parameters a = b = n/2 - 1.  In terms of SciPy's
implementation of the beta distribution, the distribution of r is::

dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)

The p-value returned by pearsonr is a two-sided p-value.  For a
given sample with correlation coefficient r, the p-value is
the probability that abs(r') of a random sample x' and y' drawn from
the population with zero correlation would be greater than or equal
to abs(r).  In terms of the object dist shown above, the p-value
for a given r and length n can be computed as::

p = 2*dist.cdf(-abs(r))

When n is 2, the above continuous distribution is not well-defined.
One can interpret the limit of the beta distribution as the shape
parameters a and b approach a = b = 0 as a discrete distribution with
equal probability masses at r = 1 and r = -1.  More directly, one
can observe that, given the data x = [x1, x2] and y = [y1, y2], and
assuming x1 != x2 and y1 != y2, the only possible values for r are 1
and -1.  Because abs(r') for any sample x' and y' with length 2 will
be 1, the two-sided p-value for a sample of length 2 is always 1.

References
----------
.. [1] "Pearson correlation coefficient", Wikipedia,
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
.. [2] Student, "Probable error of a correlation coefficient",
Biometrika, Volume 6, Issue 2-3, 1 September 1908, pp. 302-310.
.. [3] C. J. Kowalski, "On the Effects of Non-Normality on the Distribution
of the Sample Product-Moment Correlation Coefficient"
Journal of the Royal Statistical Society. Series C (Applied
Statistics), Vol. 21, No. 1 (1972), pp. 1-12.

Examples
--------
>>> from scipy import stats
>>> a = np.array([0, 0, 0, 1, 1, 1, 1])
>>> b = np.arange(7)
>>> stats.pearsonr(a, b)
(0.8660254037844386, 0.011724811003954649)

>>> stats.pearsonr([1, 2, 3, 4, 5], [10, 9, 2.5, 6, 4])
(-0.7426106572325057, 0.1505558088534455)

"""
n = len(x)
if n != len(y):
raise ValueError('x and y must have the same length.')

if n < 2:
raise ValueError('x and y must have length at least 2.')

x = np.asarray(x)
y = np.asarray(y)

# If an input is constant, the correlation coefficient is not defined.
if (x == x[0]).all() or (y == y[0]).all():
warnings.warn(PearsonRConstantInputWarning())
return np.nan, np.nan

# dtype is the data type for the calculations.  This expression ensures
# that the data type is at least 64 bit floating point.  It might have
# more precision if the input is, for example, np.longdouble.
dtype = type(1.0 + x[0] + y[0])

if n == 2:
return dtype(np.sign(x[1] - x[0])*np.sign(y[1] - y[0])), 1.0

xmean = x.mean(dtype=dtype)
ymean = y.mean(dtype=dtype)

# By using astype(dtype), we ensure that the intermediate calculations
# use at least 64 bit floating point.
xm = x.astype(dtype) - xmean
ym = y.astype(dtype) - ymean

# Unlike np.linalg.norm or the expression sqrt((xm*xm).sum()),
# scipy.linalg.norm(xm) does not overflow if xm is, for example,
# [-5e210, 5e210, 3e200, -3e200]
normxm = linalg.norm(xm)
normym = linalg.norm(ym)

threshold = 1e-13
if normxm < threshold*abs(xmean) or normym < threshold*abs(ymean):
# If all the values in x (likewise y) are very close to the mean,
# the loss of precision that occurs in the subtraction xm = x - xmean
# might result in large errors in r.
warnings.warn(PearsonRNearConstantInputWarning())

r = np.dot(xm/normxm, ym/normym)

# Presumably, if abs(r) > 1, then it is only some small artifact of
# floating point arithmetic.
r = max(min(r, 1.0), -1.0)

# As explained in the docstring, the p-value can be computed as
#     p = 2*dist.cdf(-abs(r))
# where dist is the beta distribution on [-1, 1] with shape parameters
# a = b = n/2 - 1.  special.btdtr is the CDF for the beta distribution
# on [0, 1].  To use it, we make the transformation  x = (r + 1)/2; the
# shape parameters do not change.  Then -abs(r) used in cdf(-abs(r))
# becomes x = (-abs(r) + 1)/2 = 0.5*(1 - abs(r)).  (r is cast to float64
# to avoid a TypeError raised by btdtr when r is higher precision.)
ab = n/2 - 1
prob = 2*special.btdtr(ab, ab, 0.5*(1 - abs(np.float64(r))))

return r, prob



ps： 理解还是比较浅显的，主要参考的是wiki上的介绍，公式中没有使用加粗的形式表示向量和矩阵，说明一下哈。

• 0
点赞
• 0
评论
• 1
收藏
• 一键三连
• 扫一扫，分享海报

09-11 1312
08-11 2595
03-15 398
10-30 10万+
09-26 8450
10-02 905
05-15 6168
03-12 5万+
©️2020 CSDN 皮肤主题: Age of Ai 设计师:meimeiellie

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。