功能性模块:(9)Pearson Correlation coeffiicient的简单理解

功能性模块 同时被 3 个专栏收录
11 篇文章 0 订阅
34 篇文章 0 订阅
6 篇文章 0 订阅

Pearson correlation coefficient 的简单理解

1. 背景

最近LZ正在恶补一些nlp相关的知识,其中在判断句子文本相似性的时候,使用了Pearson correlation和Spearman’s rank correlation这两个统计学相关系数,还是做个笔记吧,不然时间久了,又会忘记了。

在统计学中,皮尔逊相关系数(Pearson correlation coefficient, PCC),有的时候也指Pearson’s r, 皮尔逊剂矩相关系数 (Pearson product-moment correlation coefficient, PPMCC),或者双变量相关(bivariate correlation),它主要是用来度量两组数据之间线性相关性的。

具体实现是两个变量的协方差,除以他们的标准差的乘积,因此他本质上是对协方差的归一化,使得结果始终具有介于-1到1之间的值。与协方差本身一样,这个度量只能反映变量的线性相关性,而忽略了许多其他类型的关系或相关性。举个简单的例子,通常家长都会期望初中高中时期孩子的年龄和身高具有相助大于0且小于1的PCC(通常情况下,1表示不切实际的完美相关性)

2. 定义

两个变量之间的PCC定义为两个变量的协方差除以他们标准差的乘积

ρ X , Y = c o v ( X , Y ) σ X σ Y = E [ ( X − μ X ) ( Y − μ Y ) ] σ X σ Y \rho_{X,Y}=\frac {cov(X,Y)}{\sigma_X\sigma_Y}=\frac{E[(X-\mu_X)(Y-\mu_Y)]}{\sigma_X\sigma_Y} ρX,Y=σXσYcov(X,Y)=σXσYE[(XμX)(YμY)]

上式定义了总体相关系数,常用希腊小写字母 ρ \rho ρ作为代表符号,估算样本的协方差和标准差,可以得到样本相关系数(PCC),常用英文小写字母 r r r表示:

r = ∑ i = 1 n ( X i − X ‾ ) ( Y i − Y ‾ ) ∑ i = 1 n ( X i − X ‾ ) 2 ∑ i = 1 n ( Y i − Y ‾ ) 2 r= \frac{\sum_{i=1}^{n}{(X_i-\overline{X})(Y_i-\overline{Y})}}{\sqrt{\sum_{i=1}^{n}{(X_i-\overline{X})^2}}\sqrt{\sum_{i=1}^{n}{(Y_i-\overline{Y})^2}}} r=i=1n(XiX)2 i=1n(YiY)2 i=1n(XiX)(YiY)

r也可以由 ( X i , Y i ) (X_i, Y_i) (Xi,Yi)样本点的标准分数均值估计,得到与上式等价的表达式

r = 1 n − 1 ∑ i = 1 n ( X i − X ‾ σ X ) ( Y i − Y ‾ σ Y ) r=\frac{1}{n-1}\sum_{i=1}^{n}{(\frac{X_i-\overline{X}}{\sigma_X})(\frac{Y_i-\overline{Y}}{\sigma_Y})} r=n11i=1n(σXXiX)(σYYiY)

其中 X i − X ‾ σ X \frac{X_i-\overline{X}}{\sigma_X} σXXiX, X ‾ \overline{X} X, σ X \sigma_X σX分别是 X i X_i Xi样本的标准分数、样本平均值和样本标准差。

3. 数学特性

数据总体和样本的PCC的绝对值是介于0和1之间,如果样本数据点准确的落在直线上(计算样本PCC的情况),或者双变量分布完全在直线上(计算总体PCC的情况),则相关系数等于+1或者等于-1,PCC是对称的,也就是说:

c o r r ( X , Y ) = c o r r ( Y , X ) corr(X,Y)=corr(Y,X) corr(X,Y)=corr(Y,X)

PCC有一个重要的数学特性是,因两个变量的位置和尺度的变化并不会引起该系数的改变,也就是说如果我们对X进行 a + b X a+bX a+bX和把Y进行 c + d Y c+dY c+dY进行平移和尺度缩放,其中a,b,c,d都是大于0的常数,这种操作并不会改变PCC的值

4. 代码实现

分别使用numpy自己实现对应函数,调用scipy中的函数,两者结果是一致的,自己实现的函数比较简单,并没有考虑很多异常情况,有兴趣的小伙伴可以去自己看下scipy中的函数实现。

from scipy import stats
import numpy as np


def ComputePCC(x, y):
    dtype = type(1.0 + x[0] + y[0])
    xmean = x.mean(dtype=dtype)
    ymean = y.mean(dtype=dtype)

    xm = x.astype(dtype) - xmean
    ym = y.astype(dtype) - ymean

    normxm = np.linalg.norm(xm)
    normym = np.linalg.norm(ym)
    r = np.dot(xm / normxm, ym / normym)
    return r


a = np.array([0, 0, 0, 1, 1, 1, 1])
b = np.arange(7)
print(stats.pearsonr(a, b)[0])
print(ComputePCC(a, b))

def pearsonr(x, y):
    r"""
    Pearson correlation coefficient and p-value for testing non-correlation.

    The Pearson correlation coefficient [1]_ measures the linear relationship
    between two datasets.  The calculation of the p-value relies on the
    assumption that each dataset is normally distributed.  (See Kowalski [3]_
    for a discussion of the effects of non-normality of the input on the
    distribution of the correlation coefficient.)  Like other correlation
    coefficients, this one varies between -1 and +1 with 0 implying no
    correlation. Correlations of -1 or +1 imply an exact linear relationship.
    Positive correlations imply that as x increases, so does y. Negative
    correlations imply that as x increases, y decreases.

    The p-value roughly indicates the probability of an uncorrelated system
    producing datasets that have a Pearson correlation at least as extreme
    as the one computed from these datasets.

    Parameters
    ----------
    x : (N,) array_like
        Input array.
    y : (N,) array_like
        Input array.

    Returns
    -------
    r : float
        Pearson's correlation coefficient.
    p-value : float
        Two-tailed p-value.

    Warns
    -----
    PearsonRConstantInputWarning
        Raised if an input is a constant array.  The correlation coefficient
        is not defined in this case, so ``np.nan`` is returned.

    PearsonRNearConstantInputWarning
        Raised if an input is "nearly" constant.  The array ``x`` is considered
        nearly constant if ``norm(x - mean(x)) < 1e-13 * abs(mean(x))``.
        Numerical errors in the calculation ``x - mean(x)`` in this case might
        result in an inaccurate calculation of r.

    See Also
    --------
    spearmanr : Spearman rank-order correlation coefficient.
    kendalltau : Kendall's tau, a correlation measure for ordinal data.

    Notes
    -----
    The correlation coefficient is calculated as follows:

    .. math::

        r = \frac{\sum (x - m_x) (y - m_y)}
                 {\sqrt{\sum (x - m_x)^2 \sum (y - m_y)^2}}

    where :math:`m_x` is the mean of the vector :math:`x` and :math:`m_y` is
    the mean of the vector :math:`y`.

    Under the assumption that :math:`x` and :math:`m_y` are drawn from
    independent normal distributions (so the population correlation coefficient
    is 0), the probability density function of the sample correlation
    coefficient :math:`r` is ([1]_, [2]_):

    .. math::

        f(r) = \frac{{(1-r^2)}^{n/2-2}}{\mathrm{B}(\frac{1}{2},\frac{n}{2}-1)}

    where n is the number of samples, and B is the beta function.  This
    is sometimes referred to as the exact distribution of r.  This is
    the distribution that is used in `pearsonr` to compute the p-value.
    The distribution is a beta distribution on the interval [-1, 1],
    with equal shape parameters a = b = n/2 - 1.  In terms of SciPy's
    implementation of the beta distribution, the distribution of r is::

        dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)

    The p-value returned by `pearsonr` is a two-sided p-value.  For a
    given sample with correlation coefficient r, the p-value is
    the probability that abs(r') of a random sample x' and y' drawn from
    the population with zero correlation would be greater than or equal
    to abs(r).  In terms of the object ``dist`` shown above, the p-value
    for a given r and length n can be computed as::

        p = 2*dist.cdf(-abs(r))

    When n is 2, the above continuous distribution is not well-defined.
    One can interpret the limit of the beta distribution as the shape
    parameters a and b approach a = b = 0 as a discrete distribution with
    equal probability masses at r = 1 and r = -1.  More directly, one
    can observe that, given the data x = [x1, x2] and y = [y1, y2], and
    assuming x1 != x2 and y1 != y2, the only possible values for r are 1
    and -1.  Because abs(r') for any sample x' and y' with length 2 will
    be 1, the two-sided p-value for a sample of length 2 is always 1.

    References
    ----------
    .. [1] "Pearson correlation coefficient", Wikipedia,
           https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
    .. [2] Student, "Probable error of a correlation coefficient",
           Biometrika, Volume 6, Issue 2-3, 1 September 1908, pp. 302-310.
    .. [3] C. J. Kowalski, "On the Effects of Non-Normality on the Distribution
           of the Sample Product-Moment Correlation Coefficient"
           Journal of the Royal Statistical Society. Series C (Applied
           Statistics), Vol. 21, No. 1 (1972), pp. 1-12.

    Examples
    --------
    >>> from scipy import stats
    >>> a = np.array([0, 0, 0, 1, 1, 1, 1])
    >>> b = np.arange(7)
    >>> stats.pearsonr(a, b)
    (0.8660254037844386, 0.011724811003954649)

    >>> stats.pearsonr([1, 2, 3, 4, 5], [10, 9, 2.5, 6, 4])
    (-0.7426106572325057, 0.1505558088534455)

    """
    n = len(x)
    if n != len(y):
        raise ValueError('x and y must have the same length.')

    if n < 2:
        raise ValueError('x and y must have length at least 2.')

    x = np.asarray(x)
    y = np.asarray(y)

    # If an input is constant, the correlation coefficient is not defined.
    if (x == x[0]).all() or (y == y[0]).all():
        warnings.warn(PearsonRConstantInputWarning())
        return np.nan, np.nan

    # dtype is the data type for the calculations.  This expression ensures
    # that the data type is at least 64 bit floating point.  It might have
    # more precision if the input is, for example, np.longdouble.
    dtype = type(1.0 + x[0] + y[0])

    if n == 2:
        return dtype(np.sign(x[1] - x[0])*np.sign(y[1] - y[0])), 1.0

    xmean = x.mean(dtype=dtype)
    ymean = y.mean(dtype=dtype)

    # By using `astype(dtype)`, we ensure that the intermediate calculations
    # use at least 64 bit floating point.
    xm = x.astype(dtype) - xmean
    ym = y.astype(dtype) - ymean

    # Unlike np.linalg.norm or the expression sqrt((xm*xm).sum()),
    # scipy.linalg.norm(xm) does not overflow if xm is, for example,
    # [-5e210, 5e210, 3e200, -3e200]
    normxm = linalg.norm(xm)
    normym = linalg.norm(ym)

    threshold = 1e-13
    if normxm < threshold*abs(xmean) or normym < threshold*abs(ymean):
        # If all the values in x (likewise y) are very close to the mean,
        # the loss of precision that occurs in the subtraction xm = x - xmean
        # might result in large errors in r.
        warnings.warn(PearsonRNearConstantInputWarning())

    r = np.dot(xm/normxm, ym/normym)

    # Presumably, if abs(r) > 1, then it is only some small artifact of
    # floating point arithmetic.
    r = max(min(r, 1.0), -1.0)

    # As explained in the docstring, the p-value can be computed as
    #     p = 2*dist.cdf(-abs(r))
    # where dist is the beta distribution on [-1, 1] with shape parameters
    # a = b = n/2 - 1.  `special.btdtr` is the CDF for the beta distribution
    # on [0, 1].  To use it, we make the transformation  x = (r + 1)/2; the
    # shape parameters do not change.  Then -abs(r) used in `cdf(-abs(r))`
    # becomes x = (-abs(r) + 1)/2 = 0.5*(1 - abs(r)).  (r is cast to float64
    # to avoid a TypeError raised by btdtr when r is higher precision.)
    ab = n/2 - 1
    prob = 2*special.btdtr(ab, ab, 0.5*(1 - abs(np.float64(r))))

    return r, prob

ps: 理解还是比较浅显的,主要参考的是wiki上的介绍,公式中没有使用加粗的形式表示向量和矩阵,说明一下哈。

  • 0
    点赞
  • 0
    评论
  • 1
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

相关推荐
<p style="color:#666666;"> <span style="font-size:14px;">本门课程重实战,将基础知识拆解到项目里,让你在项目情境里学知识。</span> </p> <p style="color:#666666;"> <span style="font-size:14px;">这样学习方式能让你保持兴趣、充满动力,时刻知道学东西能用在哪、能怎么用。</span> </p> <p style="color:#666666;"> <span style="font-size:14px;">平时不明白知识点,放在项目里去理解就恍然大悟了。</span> </p> <p style="color:#666666;"> <span></span> </p> <p style="color:#666666;"> <span style="font-size:14px;"> </span> </p> <p style="color:#666666;"> <span style="color:#FF0000;font-size:14px;"><strong>一、融汇贯通</strong></span> </p> <p style="color:#666666;"> <span style="font-size:14px;">本视频采用了前后端分离开发模式,前端使用Vue.js+Element UI实现了Web页面呈现,后端使用Python Django框架实现了数据访问接口,前端通过Axios访问后端接口获得数据。在学习完本章节后,真正理解前后端各自承担工作。</span> </p> <p style="color:#666666;"> <span style="font-size:14px;"> </span> </p> <p style="color:#666666;"> <span style="color:#FF0000;font-size:14px;"><strong>二、贴近实战</strong></span> </p> <p style="color:#666666;"> <span style="font-size:14px;">本系列课程为练手项目实战学生管理系统v4.0开发,项目包含了如下几个内容项目总体介绍、基本功能演示、Vuejs初始化、Element UI使用、在Django中实现针对数据增删改查接口、在Vuejs中实现前端增删改查调用、实现文件上传、实现表格分页、实现导出数据到Excel、实现通过Excel导入数据、实现针对表格批量化操作等等,所有功能都通过演示完成、贴近了实战</span> </p> <p style="color:#666666;"> <span style="font-size:14px;"> </span> </p> <p style="color:#666666;"> <span style="color:#FF0000;font-size:14px;"><strong>三、课程亮点</strong></span> </p> <p style="color:#666666;"> <span style="font-size:14px;">在本案例中,最大亮点在于前后端做了分离,真正理解前后端各自承担工作。前端如何和后端交互</span> </p> <p style="color:#666666;"> <span style="font-size:14px;"> </span> </p> <p style="color:#666666;"> <span style="color:#FF0000;font-size:14px;"><strong>适合人群</strong></span> </p> <p style="color:#666666;"> <span style="font-size:14px;">1、有Python语言基础、web前端基础,想要深入学习Python Web框架朋友;</span> </p> <p style="color:#666666;"> <span style="font-size:14px;">2、有Django基础,但是想学习企业级项目实战朋友;</span> </p> <p style="color:#666666;"> <span style="font-size:14px;">3、有MySQL数据库基础朋友</span> </p> <p style="color:#666666;"> <span style="font-size:14px;"> </span> </p> <p style="color:#666666;"> <span style="font-size:14px;"><img alt="" src="https://img-bss.csdnimg.cn/202009070752197496.png" /><br /> </span> </p> <p style="color:#666666;"> <span style="font-size:14px;"><br /> </span> </p>
©️2020 CSDN 皮肤主题: Age of Ai 设计师:meimeiellie 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值