全部实战！Numpy 和 Matpltlib 学习笔记

概念

正泰分布函数

一维正态分布函数
给定数列 $X=[x_1, x_2, ..., x_n]$ ，对应的概率为 $P=[p_1, p_2, ..., p_n]$ ，其中对任意 $p_i \geq 0$ 且 $p_1+p_2+...+p_n=1$ 有以下统计内容：

平均值 $\bar{x}$ (读作 x bar)：
$\bar{x}=\frac{x_1 + x_2 + ... + x_n}{n}$
数学期望 $\mu$ （简称期望，读作：谬)： $\mu=x_1*p_1 + x_2*p_2+...+p_n*x_n$
需要注意一点，很多人经常混淆数学期望与平均值，其实两者并不一样，只有当 $p_1=p_2=...=p_n$ 时，两者才会相同，即平均值实际是数学期望的特殊情况。所以，只要是等概率事件，那么平均值与数学期望就可以等价。
总体均方差 $\sigma$ (读作：西格玛）：
$\sigma^2=\frac{(x_1-\bar{x})^2 + (x_2-\bar{x})^2+...+(x_n-\bar{x})^2}{n}=\sum_{i=1}^n \frac{(x_i-\bar{x})^2}{n}$ 所以，总体标准差为：
$\sigma = \sqrt{\sum_{i=1}^n \frac{(x_i-\bar{x})^2}{n}}$
样本差s
注意，样本差的分母取 $n - 1$ ，而不是 $n$ ，即
$S = \sqrt{\sum_{i=1}^n \frac{(x_i-\bar{x})^2}{n - 1}}$

standard deviation: n VS n-1

正态分布的分布数据

The empirical rule, or the 68-95-99.7 rule, tells you where your values lie:

Around 68% of scores are within 2 standard deviations of the mean,
Around 95% of scores are within 4 standard deviations of the mean,
Around 99.7% of scores are within 6 standard deviations of the mean.

average, mean, median and mode

average: let X = [x1, x2, ..., xn], average(X) = sum(X)/n
mean: a formal speaking of average in statistics.
median: the middle value of ordered X. It is the right middle value in X if len(X) is odd, or it is the average the middle two values of X if len(X) is even.
mode: the value that occurs most often in X
reference: Difference Between the Mean & the Average

总标准差与样本标准差

参考：Difference Between Population and Sample Standard Deviation

总体标准差 (population standard deviation)
PSD表示整个样本空间所有的样本的标准差，记作 $\sigma$ ，计算公式为
$\sigma = \sqrt{\sum_{i=1}^n \frac{(x_i-\mu)^2}{n}}$ 举例来说。有一个小组10个学生的体重 (单位：kg) $W= [70, 62, 65, 72, 80, 70, 63, 72, 77, 79]$ ，那么平均值 $\bar{w}=(70+62+65+72+80+70+63+72+77+79)/10=71$ ，每个人的体重与平均值的差为 (70 – 71) = -1, (62 – 71) = -9, (65 – 71) = -6, (72 – 71) = 1, (80 – 71) = 9, (70 – 71) = -1, (63 – 71) = -8, (72 – 71) = 1, (77 – 71) = 6, (79 – 71) = 8. 然后求这些偏差的平方和： $(-1)^2 + (-9)^2 + (-6)^2 + 1^2 + 9^2 + (-1)^2 + (-8)^2 + 1^2 + 6^2 + 8^2 = 366$ 。那么标准差就是 $\sqrt{366/10} \approx 6.05$ 。所以均值就是 71，总体标准差就是6.05。
样本标准差 (sample standard deviation)
上面只是一个小组的情况，如果用这个数据来预估整个班级的体重情况，就是样本标准差，公式为：
$s = \sqrt{\sum_{i=1}^n \frac{(x_i-\bar{x})^2}{n - 1}}$ 显然可见，除数由 $n$ 变成了 $n-1$ 。以上面的例子来说， $s =\sqrt{366/9} \approx 6.38$ 。显然s要比 $\sigma$ 略大，且只有当 $n \rightarrow \infty$ 时，两者才相等。这是什么原因呢？
答案是使用样本的均值来表示整体的均值是不合适的。其规律是小样本的平均值的波动较整体更大，所以计算出来的标准差是偏小的，因此将 $n$ 减少1，以平衡估值的偏差。当然还有严格的数据证明，表明减1是最合适的值，限于篇幅，不再展开。有兴趣的读者可以阅读这两篇文章：理论证明（推荐阅读张英锋老师的回复）和实验证明，都得出同样的结论：样本标准差的分母采用n-1能够更加接近真实的总体标准差。

定义

初始化

随机初始化

此文档介绍了numpy的各类随机分布，内容很丰富，介绍了numpy.random生成不同分布的方式，包括常见的各类分布，如均匀分布、指数分布、正泰分布、超几何分布等。特别推荐，学习numpy的随机分布看这一篇就足够了。

zip 函数

使用zip可以将两个数列合并成一个数列，

xs, ys=[1,2,3], ['a', 'b', 'c']
for item in zip(xs, ys): #=> [(1, 'a'), (2, 'b'), (3, 'c')]
    print(item)

则输出为

(1, 'a')
(2, 'b')
(3, 'c')

以正泰分布初始化

正泰分布应用

在具体应用中，如在身高统计中，实际上也可以使用不同的身高的人数和总人数的比值来确定。即：
$H = \frac{h_1 * c_1 + h_2 * c_2+...+h_n *c_n}{c_1+c_2+...+c_n}=\frac{\sum_{i=1}^nh_i*c_i }{\sum_{i=1}^n c_i}$ 其中，期望身高为 H， $h_i$ 和 $c_i$ 分别为某身高和对应的人数（其中 $i = 1 ... n$ )。
根据以上定义，正泰分布定义如下：若随机变量 X服从一个位置参数为 $\mu$ 、尺度参数为 $\sigma$ 的概率分布，且其概率密度函数为
$f(x) = \frac{1}{\sqrt{2\pi} \sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ 其中， $\mu$ (读作：谬) 为数学期望， $\sigma$ (读作：西格玛）为标准差，两者定义分别如下所示：

import numpy as np
import matplotlib.pyplot as plt

# 表示身高为170，标准差为3，样本数量为100K
mu, sigma, num = 170, 3, 100_000

# 生成所有的高度
hs = np.random.normal(mu, sigma, num) 

# 绘制柱状图，返回xs序列，即不同身高的统计结果
_, xs, _ = plt.hist(hs, 50) 

# 根据正泰表达式的图形绘制出正泰分布的曲线
plt.plot(xs, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp(-(xs - mu)**2 / (2 * sigma**2)))

hs = np.random.randint(1, 10, (3, 4))
print(hs)
np.random.shuffle(hs)
print(hs)

操作

算法

显示

matplotlib有个非常全的绘制示例集，包括了数百个示例，均有效果图和实现代码，强烈推荐。

基本曲线

下图为正态分布在均值相同 ( $\mu=50$ ) 不同的标准差下不同的情况下的曲线。通过这三条曲线的绘制，演示了曲线的主要功能。
一些官方参考文档：

Linestyles 示例集.
marker 示例集

通过这个曲线，我们可以得到一个结论：标准差越小，数据分布越向平均值集中；反之越分散。

import numpy as np
import matplotlib.pyplot as plt
from numpy import random

# 待显示的曲线，元组中的内容为 sigma和曲线线形
lines = [( 6, 'o', [], 'r'), (12, 'x', [4, 2], 'g'), (18, '', [6, 2], 'b')]
for line in lines:
    mu = 50
    xs = np.arange(0, 100)
    ys = 1/(line[0] * np.sqrt(2 * np.pi)) * np.exp(-(xs - mu)**2 / (2 * line[0]**2))
    # 根据正泰表达式的图形绘制出正泰分布的曲线
    plt.plot(xs, ys, marker=line[1], dashes=line[2], color=line[3], linewidth=2, label='sigma = {}'.format(line[0])) 

#用plot函数绘制折线图，线条颜色设置为绿色
plt.title('Normal Distribution',fontsize=24)
#设置图表标题和标题字号
plt.tick_params(axis='both',which='major',labelsize=14)
#设置刻度的字号
plt.xlabel('Values',fontsize=14)
#设置x轴标签及其字号
plt.ylabel('Probility',fontsize=14)
# 设置X轴的苦海无边为0至100
plt.xlim(0, 100)
# 设置X轴的显示的内容为 [0, 10, 20, ..., 100]
plt.xticks([i * 10 for i in range(11)])
plt.legend()
plt.show()

散点图画法

强烈推荐阅读官方的文档 scatter graph by official website 不仅内容格式全面清晰，而且排版也非常友好，看了以后马上就懂了。因为官方的文档，对scatter的使用的所有参数解释的很清楚，而且排列方式自上向下，然后每个参数都有详细的说明。如果一些参数的内容过多，则使用第三方的链接进行说明。

以下示例展示了在绘制过程中常用的一些方法和属性。

import numpy as np 
import matplotlib.pyplot as plt

sigma = 2 # variance
mu = 10 # average 
num = 100000 # points

rand_data = np.random.normal(mu, sigma, num) 
#for value in rand_data:
#    print(value)
# count, bins, ignored = plt.hist(rand_data, 1000)#, normed=True)
# plt.plot(bins, 1/(sigma * np.sqrt(2*np.pi)) * np.exp()))
# plt.show()
def makedata(height_avg, height_dev, weight_avg, weight_dev, count=100):
    hs = np.random.normal(height_avg, height_dev, count) 
    ws = np.random.normal(weight_avg, weight_dev, count)
    data = []
    for i in range(len(hs)):
        data.append((hs[i], ws[i]))
    return data
m1, s1 = 170, 2.1
m2, s2 = 161, 1.5

data = makedata(170, 2.1, 65, 3, 100)

def make_data(mu, sigma, num):
    return np.random.normal(mu, sigma, num)

# make 100 nodes
count = 200
hs_m = make_data(172, 5, count) 
ws_m = make_data(75, 6, count) 
hs_w = make_data(161, 4, count) 
ws_w = make_data(60, 5, count) 

#font = FontProperties(fname='/Users/hao/Library/Fonts/msyh.ttf', size=14)  
#plt.title(u'散点图示例', FontProperties=font)
plt.scatter(hs_m, ws_m, c='blue', marker='x')#,s=area)#,c=colors,alpha=0.5)
plt.scatter(hs_w, ws_w, c='red', marker='.')#,s=area)#,c=colors,alpha=0.5)
plt.show()

#for p in zip(hs_m, ws_m):
#    print(p)
#for data in makedata(170, 2.1, 65, 3, 100):
#    print("height={:.2f}cm, weight={:.2f}kg.".format(data[0], data[1]))
# plt.scatter()
# plt.scatter(X[:,0], X[:, 1], c = y_kmeans, s=0, cmap='viridis')

# plt.scatter(X[:,0], X[:, 1], c = y_kmeans, s=0, cmap='viridis')

''' 
np.random.seed(1)
x=np.random.rand(10)
y=np.random.rand(10)
 
colors=np.random.rand(10)
area=(30*np.random.rand(10))**2
'''

概念