目录
PyOD简介
异常检测(anomaly detection),也叫异常分析(outlier analysis或者outlier detection)或者离群值检测,在工业上有非常广泛的应用场景:
- 金融业:从海量数据中找到“欺诈案例”,如信用卡反诈骗,识别虚假信贷
- 网络安全:从流量数据中找到“侵入者”,识别新的网络入侵模式
- 在线零售:从交易数据中发现“恶意买家”,比如恶意刷评等
- 生物基因:从生物数据中检测“病变”或“突变”
同时它可以被用于机器学习任务中的预处理(preprocessing),防止因为少量异常点存在而导致的训练或预测失败。换句话来说,异常检测就是从茫茫数据中找到那些“长得不一样”的数据。但检测异常过程一般都比较复杂,而且实际情况下数据一般都没有标签(label),我们并不知道哪些数据是异常点,所以一般很难直接用简单的监督学习。异常值检测还有很多困难,如极端的类别不平衡、多样的异常表达形式、复杂的异常原因分析等。
异常值不一定是坏事。 例如,如果在生物学中实验,一只老鼠没有死,而其他一切都死,那么理解为什么会非常有趣。这可能会带来新的科学发现。 因此,检测异常值非常重要。
Python Outlier Detection(PyOD)是一个Python异常检测工具库,除了支持Sklearn上支持的四种模型外,还额外提供了很多模型如:
- 传统异常检测方法:HBOS、PCA、ABOD和Feature Bagging等。
- 基于深度学习与神经网络的异常检测:自编码器(keras实现)
其主要亮点包括:
- 包括近20种常见的异常检测算法,比如经典的LOF/LOCI/ABOD以及最新的深度学习如对抗生成模型(GAN)和集成异常检测(outlier ensemble)
- 所有算法共享通用的API,方便快速调包,同时支持Python2和3。支持多种操作系统:windows,macOS和Linux。
- 代码经过了重重优化,大部分模型通过了并行与即时编译。使用JIT和并行化(parallelization)进行优化,加速算法运行及扩展性(scalability),可以处理大量数据
- 提供了详细的文档以及大量例子,方便快速上手
PyOD内置算法
PyOD工具包由三个主要功能组组成:
i) Individual Detection Algorithms:
Type | Abbr | Algorithm | Year | Ref |
Linear Model | PCA | 主成分分析(加权投影到特征向量超平面的距离之和) | 2003 | [24] |
Linear Model | MCD | 最小协方差行列式(使用马氏距离作为异常值分数) | 1999 | [9] [22] |
Linear Model | OCSVM | One-Class支持向量机 | 2001 | [23] |
Linear Model | LMDD | 基于偏差的离群点检测Deviation-based Outlier Detection (LMDD) | 1996 | [5] |
Proximity-Based | LOF | 局部离群因子Local Outlier Factor | 2000 | [6] |
Proximity-Based | COF | 基于连通性的离群因子Connectivity-Based Outlier Factor | 2002 | [25] |
Proximity-Based | CBLOF | 基于聚类的局部离群因子Clustering-Based Local Outlier Factor | 2003 | [10] |
Proximity-Based | LOCI | LOCI: Fast outlier detection using the local correlation integral | 2003 | [19] |
Proximity-Based | HBOS | 基于直方图的异常值得分Histogram-based Outlier Score | 2012 | [7] |
Proximity-Based | kNN | k Nearest Neighbors (use the distance to the kth nearest neighbor as the outlier score) | 2000 | [21] |
Proximity-Based | AvgKNN | Average kNN (use the average distance to k nearest neighbors as the outlier score) | 2002 | [4] |
Proximity-Based | MedKNN | Median kNN (use the median distance to k nearest neighbors as the outlier score) | 2002 | [4] |
Proximity-Based | SOD | 子空间离群点检测Subspace Outlier Detection | 2009 | [14] |
Probabilistic | ABOD | 基于角度的离群点检测Angle-Based Outlier Detection | 2008 | [13] |
Probabilistic | FastABOD | Fast Angle-Based Outlier Detection using approximation | 2008 | [13] |
Probabilistic | SOS | 随机离群点选择Stochastic Outlier Selection | 2012 | [11] |
Outlier Ensembles | IForest | Isolation Forest | 2008 | [17] |
Outlier Ensembles | Feature Bagging | 2005 | [15] | |
Outlier Ensembles | LSCP | 并行孤立点群的局部选择性组合LSCP: Locally Selective Combination of Parallel Outlier Ensembles | 2019 | [28] |
Outlier Ensembles | XGBOD | Extreme Boosting Based Outlier Detection (Supervised) | 2018 | [27] |
Outlier Ensembles | LODA | Lightweight On-line Detector of Anomalies | 2016 | [20] |
Neural Networks | AutoEncoder | Fully connected AutoEncoder (use reconstruction error as the outlier score) | [1] [Ch.3] | |
Neural Networks | VAE | Variational AutoEncoder (use reconstruction error as the outlier score) | 2013 | [12] |
Neural Networks | SO_GAAL | Single-Objective Generative Adversarial Active Learning | 2019 | [18] |
Neural Networks | MO_GAAL | Multiple-Objective Generative Adversarial Active Learning | 2019 | [18] |
ii) Outlier Ensembles & Outlier Detector Combination Frameworks:
Type | Abbr | Algorithm | Year | Ref |
Outlier Ensembles | Feature Bagging | 2005 | [15] | |
Outlier Ensembles | LSCP | LSCP: Locally Selective Combination of Parallel Outlier Ensembles | 2019 | [28] |
Outlier Ensembles | XGBOD | Extreme Boosting Based Outlier Detection (Supervised) | 2018 | [27] |
Outlier Ensembles | LODA | Lightweight On-line Detector of Anomalies | 2016 | [20] |
Combination | Average | Simple combination by averaging the scores | 2015 | [2] |
Combination | Weighted Average | Simple combination by averaging the scores with detector weights | 2015 | [2] |
Combination | Maximization | Simple combination by taking the maximum scores | 2015 | [2] |
Combination | AOM | Average of Maximum | 2015 | [2] |
Combination | MOA | Maximization of Average | 2015 | [2] |
Combination | Median | Simple combination by taking the median of the scores | 2015 | [2] |
Combination | majority Vote | Simple combination by taking the majority vote of the labels (weights can be used) | 2015 | [2] |
iii) Utility Functions:
Type | Name | Function | Documentation |
Data | generate_data | Synthesized data generation; normal data is generated by a multivariate Gaussian and outliers are generated by a uniform distribution | generate_data |
Data | generate_data_clusters | Synthesized data generation in clusters; more complex data patterns can be created with multiple clusters | generate_data_clusters |
Stat | wpearsonr | Calculate the weighted Pearson correlation of two samples | wpearsonr |
Utility | get_label_n | Turn raw outlier scores into binary labels by assign 1 to top n outlier scores | get_label_n |
Utility | precision_n_scores | calculate precision @ rank n | precision_n_scores |
Angle-Based Outlier Detection (ABOD)
- 它考虑每个点与其邻居之间的关系。 它没有考虑这些邻居之间的关系。 其加权余弦分数与所有邻居的方差可视为偏离分数
- ABOD在多维数据上表现良好
- PyOD提供两种不同版本的ABOD:
- 快速ABOD:使用k近邻来近似
- 原始ABOD:考虑所有具有高时间复杂性的训练点
k-Nearest Neighbors Detector
- 对于任何数据点,到第k个最近邻居的距离可以被视为远离分数
- PyOD支持三个kNN探测器:
- 最大:使用第k个邻居的距离作为离群值
- 均值:使用所有k个邻居的平均值作为离群值得分
- 中位数:使用与邻居的距离的中位数作为离群值得分
Isolation Forest
- 它在内部使用scikit-learn库。 在此方法中,使用一组树完成数据分区。 隔离森林提供了一个异常分数,用于查看结构中点的隔离程度。 然后使用异常分数来识别来自正常观察的异常值
- 隔离森林在多维数据上表现良好
Histogram-based Outlier Detection
- 这是一种有效的无监督方法,它假设特征独立并通过构建直方图来计算异常值
- 它比多变量方法快得多,但代价是精度较低
Local Correlation Integral (LOCI)
- LOCI对于检测异常值和异常值组非常有效。 它为每个点提供LOCI图,总结了该点周围区域内数据的大量信息,确定了簇,微簇,它们的直径以及它们的簇间距离
- 现有的异常检测方法都不能匹配此功能,因为它们只为每个点输出一个数字
Feature Bagging
- 功能装袋检测器在数据集的各种子样本上安装了许多基本检测器。 它使用平均或其他组合方法来提高预测精度
- 默认情况下,Local Outlier Factor(LOF)用作基本估算器。 但是,任何估计器都可以用作基本估计器,例如kNN和ABOD
- 特征装袋首先通过随机选择特征子集来构造n个子样本。 这带来了基本估计的多样性。 最后,通过平均或取所有基本检测器的最大值来生成预测分数
Clustering Based Local Outlier Factor
- 它将数据分为小型集群和大型集群。 然后根据点所属的簇的大小以及到最近的大簇的距离来计算异常分数
PyOD的使用
API介绍
特别需要注意的是,异常检测算法基本都是无监督学习,所以只需要X(输入数据),而不需要y(标签)。PyOD的使用方法和Sklearn中聚类分析很像,它的检测器(detector)均有统一的API。所有的PyOD检测器clf均有统一的API以便使用。
- fit(X): 用数据X来“训练/拟合”检测器clf。即在初始化检测器clf后,用X来“训练”它。
- fit_predict_score(X, y): 用数据X来训练检测器clf,并预测X的预测值,并在真实标签y上进行评估。此处的y只是用于评估,而非训练
- decision_function(X): 在检测器clf被fit后,可以通过该函数来预测未知数据的异常程度,返回值为原始分数,并非0和1。返回分数越高,则该数据点的异常程度越高
- predict(X): 在检测器clf被fit后,可以通过该函数来预测未知数据的异常标签,返回值为二分类标签(0为正常点,1为异常点)
- predict_proba(X): 在检测器clf被fit后,预测未知数据的异常概率,返回该点是异常点概率
当检测器clf被初始化且fit(X)函数被执行后,clf就会生成两个重要的属性:
- decision_scores: 数据X上的异常打分,分数越高,则该数据点的异常程度越高
- labels_: 数据X上的异常标签,返回值为二分类标签(0为正常点,1为异常点)
不难看出,当我们初始化一个检测器clf后,可以直接用数据X来“训练”clf,之后我们便可以得到X的异常分值(clf.decision_scores)以及异常标签(clf.labels_)。当clf被训练后(当fit函数被执行后),我们可以使用decision_function()和predict()函数来对未知数据进行训练。
示例代码:
import numpy as np from scipy import stats import matplotlib.pyplot as plt import matplotlib.font_manager from pyod.models.abod import ABOD from pyod.models.knn import KNN from pyod.utils.data import generate_data, get_outliers_inliers # generate random data with two features X_train, Y_train = generate_data(n_train=200, train_only=True, n_features=2) # by default the outlier fraction is 0.1 in generate data function outlier_fraction = 0.1 # store outliers and inliers in different numpy arrays x_outliers, x_inliers = get_outliers_inliers(X_train, Y_train) n_inliers = len(x_inliers) n_outliers = len(x_outliers) # separate the two features and use it to plot the data F1 = X_train[:, [0]].reshape(-1, 1) F2 = X_train[:, [1]].reshape(-1, 1) # create a meshgrid xx, yy = np.meshgrid(np.linspace(-10, 10, 200), np.linspace(-10, 10, 200)) # scatter plot plt.scatter(F1, F2) plt.xlabel('F1') plt.ylabel('F2') # 创建一个字典并添加要用于检测异常值的所有模型: classifiers = { 'Angle-based Outlier Detector (ABOD)': ABOD(contamination=outlier_fraction), 'K Nearest Neighbors (KNN)': KNN(contamination=outlier_fraction) } # 将数据拟合到我们在字典中添加的每个模型,然后,查看每个模型如何检测异常值: # set the figure size plt.figure(figsize=(12, 6)) for i, (clf_name, clf) in enumerate(classifiers.items()): # fit the dataset to the model clf.fit(X_train) # predict raw anomaly score scores_pred = clf.decision_function(X_train) * -1 # prediction of a datapoint category outlier or inlier y_pred = clf.predict(X_train) # no of errors in prediction n_errors = (y_pred != Y_train).sum() print('No of Errors : ', clf_name, n_errors) # rest of the code is to create the visualization # threshold value to consider a datapoint inlier or outlier threshold = stats.scoreatpercentile(scores_pred, 100 * outlier_fraction) # decision function calculates the raw anomaly score for every point Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1 Z = Z.reshape(xx.shape) subplot = plt.subplot(1, 2, i + 1) # fill blue colormap from minimum anomaly score to threshold value subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 10), cmap=plt.cm.Blues_r) # draw red contour line where anomaly score is equal to threshold a = subplot.contour(xx, yy, Z, levels=[threshold], linewidths=2, colors='red') # fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()], colors='orange') # scatter plot of inliers with white dots b = subplot.scatter(X_train[:-n_outliers, 0], X_train[:-n_outliers, 1], c='white', s=20, edgecolor='k') # scatter plot of outliers with black dots c = subplot.scatter(X_train[-n_outliers:, 0], X_train[-n_outliers:, 1], c='black', s=20, edgecolor='k') subplot.axis('tight') subplot.legend( [a.collections[0], b, c], ['learned decision function', 'true inliers', 'true outliers'], prop=matplotlib.font_manager.FontProperties(size=10), loc='lower right') subplot.set_title(clf_name) subplot.set_xlim((-10, 10)) subplot.set_ylim((-10, 10)) plt.show()
PyOD实战:基于大型商场销售数据的异常发现
数据地址:https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/
数据说明:
Variable | Description |
Item_Identifier | Unique product ID |
Item_Weight | Weight of product |
Item_Fat_Content | Whether the product is low fat or not |
Item_Visibility | The % of total display area of all products in a store allocated to the particular product |
Item_Type | The category to which the product belongs |
Item_MRP | Maximum Retail Price (list price) of the product |
Outlet_Identifier | Unique store ID |
Outlet_Establishment_Year | The year in which store was established |
Outlet_Size | The size of the store in terms of ground area covered |
Outlet_Location_Type | The type of city in which the store is located |
Outlet_Type | Whether the outlet is just a grocery store or some sort of supermarket |
Item_Outlet_Sales | Sales of the product in the particulat store. This is the outcome variable to be predicted. |
1、加载需要用到的Python包和模块
import pandas as pd import numpy as np from scipy import stats import matplotlib.pyplot as plt import matplotlib.font_manager import warnings warnings.filterwarnings('ignore') # Import models from pyod.models.abod import ABOD from pyod.models.cblof import CBLOF from pyod.models.feature_bagging import FeatureBagging from pyod.models.hbos import HBOS from pyod.models.iforest import IForest from pyod.models.knn import KNN from pyod.models.lof import LOF
2、读取数据并绘制Item MRP vs Item Outlet Sales散点图以了解数据:
df = pd.read_csv("train_kOBLwZA.csv") df.plot.scatter('Item_MRP','Item_Outlet_Sales')
3、对数据进行规格化处理:
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1)) df[['Item_MRP','Item_Outlet_Sales']] = scaler.fit_transform(df[['Item_MRP','Item_Outlet_Sales']]) df[['Item_MRP','Item_Outlet_Sales']].head()
4、将这些值存储在NumPy数组中,以便以后在我们的模型中使用:
X1 = df['Item_MRP'].values.reshape(-1,1) X2 = df['Item_Outlet_Sales'].values.reshape(-1,1) X = np.concatenate((X1,X2),axis=1)
5、创建模型词典,设置异常分数值0.05(5%):
random_state = np.random.RandomState(1024) outliers_fraction = 0.05 # Define seven outlier detection tools to be compared classifiers = { 'Angle-based Outlier Detector (ABOD)': ABOD(contamination=outliers_fraction), 'Cluster-based Local Outlier Factor (CBLOF)':CBLOF(contamination=outliers_fraction,check_estimator=False, random_state=random_state), 'Feature Bagging':FeatureBagging(LOF(n_neighbors=35),contamination=outliers_fraction,check_estimator=False,random_state=random_state), 'Histogram-base Outlier Detection (HBOS)': HBOS(contamination=outliers_fraction), 'Isolation Forest': IForest(contamination=outliers_fraction,random_state=random_state), 'K Nearest Neighbors (KNN)': KNN(contamination=outliers_fraction), 'Average KNN': KNN(method='mean',contamination=outliers_fraction) }
6、逐个拟合每个模型,看看每个模型预测异常值的方式有什么不一样:
xx , yy = np.meshgrid(np.linspace(0,1 , 200), np.linspace(0, 1, 200)) for i, (clf_name, clf) in enumerate(classifiers.items()): clf.fit(X) # predict raw anomaly score scores_pred = clf.decision_function(X) * -1 # prediction of a datapoint category outlier or inlier y_pred = clf.predict(X) n_inliers = len(y_pred) - np.count_nonzero(y_pred) n_outliers = np.count_nonzero(y_pred == 1) plt.figure(figsize=(10, 10)) # copy of dataframe dfx = df dfx['outlier'] = y_pred.tolist() # IX1 - inlier feature 1, IX2 - inlier feature 2 IX1 = np.array(dfx['Item_MRP'][dfx['outlier'] == 0]).reshape(-1,1) IX2 = np.array(dfx['Item_Outlet_Sales'][dfx['outlier'] == 0]).reshape(-1,1) # OX1 - outlier feature 1, OX2 - outlier feature 2 OX1 = dfx['Item_MRP'][dfx['outlier'] == 1].values.reshape(-1,1) OX2 = dfx['Item_Outlet_Sales'][dfx['outlier'] == 1].values.reshape(-1,1) print('OUTLIERS : ',n_outliers,'INLIERS : ',n_inliers, clf_name) # threshold value to consider a datapoint inlier or outlier threshold = stats.scoreatpercentile(scores_pred,100 * outliers_fraction) # decision function calculates the raw anomaly score for every point Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1 Z = Z.reshape(xx.shape) # fill blue map colormap from minimum anomaly score to threshold value plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Blues_r) # draw red contour line where anomaly score is equal to thresold a = plt.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='red') # fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score plt.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='orange') b = plt.scatter(IX1,IX2, c='white',s=20, edgecolor='k') c = plt.scatter(OX1,OX2, c='black',s=20, edgecolor='k') plt.axis('tight') # loc=2 is used for the top left corner plt.legend( [a.collections[0], b,c], ['learned decision function', 'inliers','outliers'], prop=matplotlib.font_manager.FontProperties(size=20), loc=2) plt.xlim((0, 1)) plt.ylim((0, 1)) plt.title(clf_name) plt.show()
执行结果:
OUTLIERS 447 INLIERS : 8076 Angle-based Outlier Detector (ABOD)
OUTLIERS : 427 INLIERS : 8096 Cluster-based Local Outlier Factor (CBLOF)
OUTLIERS : 392 INLIERS : 8131 Feature Bagging
OUTLIERS : 501 INLIERS : 8022 Histogram-base Outlier Detection (HBOS)
OUTLIERS : 427 INLIERS : 8096 Isolation Forest
OUTLIERS : 311 INLIERS : 8212 K Nearest Neighbors (KNN)
OUTLIERS : 176 INLIERS : 8347 Average KNN
参考链接: