器→工具, 开源项目

Python异常检测包:PyOD

钱魏Way · · 939 次浏览

PyOD简介

异常检测(anomaly detection),也叫异常分析(outlier analysis或者outlier detection)或者离群值检测,在工业上有非常广泛的应用场景:

  • 金融业:从海量数据中找到“欺诈案例”,如信用卡反诈骗,识别虚假信贷
  • 网络安全:从流量数据中找到“侵入者”,识别新的网络入侵模式
  • 在线零售:从交易数据中发现“恶意买家”,比如恶意刷评等
  • 生物基因:从生物数据中检测“病变”或“突变”

同时它可以被用于机器学习任务中的预处理(preprocessing),防止因为少量异常点存在而导致的训练或预测失败。换句话来说,异常检测就是从茫茫数据中找到那些“长得不一样”的数据。但检测异常过程一般都比较复杂,而且实际情况下数据一般都没有标签(label),我们并不知道哪些数据是异常点,所以一般很难直接用简单的监督学习。异常值检测还有很多困难,如极端的类别不平衡、多样的异常表达形式、复杂的异常原因分析等。

异常值不一定是坏事。 例如,如果在生物学中实验,一只老鼠没有死,而其他一切都死,那么理解为什么会非常有趣。这可能会带来新的科学发现。 因此,检测异常值非常重要。

Python Outlier Detection(PyOD)是一个Python异常检测工具库,除了支持Sklearn上支持的四种模型外,还额外提供了很多模型如:

  • 传统异常检测方法:HBOS、PCA、ABOD和Feature Bagging等。
  • 基于深度学习与神经网络的异常检测:自编码器(keras实现)

其主要亮点包括:

  • 包括近20种常见的异常检测算法,比如经典的LOF/LOCI/ABOD以及最新的深度学习如对抗生成模型(GAN)和集成异常检测(outlier ensemble)
  • 所有算法共享通用的API,方便快速调包,同时支持Python2和3。支持多种操作系统:windows,macOS和Linux。
  • 代码经过了重重优化,大部分模型通过了并行与即时编译。使用JIT和并行化(parallelization)进行优化,加速算法运行及扩展性(scalability),可以处理大量数据
  • 提供了详细的文档以及大量例子,方便快速上手

PyOD内置算法

PyOD工具包由三个主要功能组组成:

i) Individual Detection Algorithms:

Type Abbr Algorithm Year Ref
Linear Model PCA 主成分分析(加权投影到特征向量超平面的距离之和) 2003 [24]
Linear Model MCD 最小协方差行列式(使用马氏距离作为异常值分数) 1999 [9] [22]
Linear Model OCSVM One-Class支持向量机 2001 [23]
Linear Model LMDD 基于偏差的离群点检测Deviation-based Outlier Detection (LMDD) 1996 [5]
Proximity-Based LOF 局部离群因子Local Outlier Factor 2000 [6]
Proximity-Based COF 基于连通性的离群因子Connectivity-Based Outlier Factor 2002 [25]
Proximity-Based CBLOF 基于聚类的局部离群因子Clustering-Based Local Outlier Factor 2003 [10]
Proximity-Based LOCI LOCI: Fast outlier detection using the local correlation integral 2003 [19]
Proximity-Based HBOS 基于直方图的异常值得分Histogram-based Outlier Score 2012 [7]
Proximity-Based kNN k Nearest Neighbors (use the distance to the kth nearest neighbor as the outlier score) 2000 [21]
Proximity-Based AvgKNN Average kNN (use the average distance to k nearest neighbors as the outlier score) 2002 [4]
Proximity-Based MedKNN Median kNN (use the median distance to k nearest neighbors as the outlier score) 2002 [4]
Proximity-Based SOD 子空间离群点检测Subspace Outlier Detection 2009 [14]
Probabilistic ABOD 基于角度的离群点检测Angle-Based Outlier Detection 2008 [13]
Probabilistic FastABOD Fast Angle-Based Outlier Detection using approximation 2008 [13]
Probabilistic SOS 随机离群点选择Stochastic Outlier Selection 2012 [11]
Outlier Ensembles IForest Isolation Forest 2008 [17]
Outlier Ensembles Feature Bagging 2005 [15]
Outlier Ensembles LSCP 并行孤立点群的局部选择性组合LSCP: Locally Selective Combination of Parallel Outlier Ensembles 2019 [28]
Outlier Ensembles XGBOD Extreme Boosting Based Outlier Detection (Supervised) 2018 [27]
Outlier Ensembles LODA Lightweight On-line Detector of Anomalies 2016 [20]
Neural Networks AutoEncoder Fully connected AutoEncoder (use reconstruction error as the outlier score) [1] [Ch.3]
Neural Networks VAE Variational AutoEncoder (use reconstruction error as the outlier score) 2013 [12]
Neural Networks SO_GAAL Single-Objective Generative Adversarial Active Learning 2019 [18]
Neural Networks MO_GAAL Multiple-Objective Generative Adversarial Active Learning 2019 [18]

ii) Outlier Ensembles & Outlier Detector Combination Frameworks:

Type Abbr Algorithm Year Ref
Outlier Ensembles Feature Bagging 2005 [15]
Outlier Ensembles LSCP LSCP: Locally Selective Combination of Parallel Outlier Ensembles 2019 [28]
Outlier Ensembles XGBOD Extreme Boosting Based Outlier Detection (Supervised) 2018 [27]
Outlier Ensembles LODA Lightweight On-line Detector of Anomalies 2016 [20]
Combination Average Simple combination by averaging the scores 2015 [2]
Combination Weighted Average Simple combination by averaging the scores with detector weights 2015 [2]
Combination Maximization Simple combination by taking the maximum scores 2015 [2]
Combination AOM Average of Maximum 2015 [2]
Combination MOA Maximization of Average 2015 [2]
Combination Median Simple combination by taking the median of the scores 2015 [2]
Combination majority Vote Simple combination by taking the majority vote of the labels (weights can be used) 2015 [2]

iii) Utility Functions:

Type Name Function Documentation
Data generate_data Synthesized data generation; normal data is generated by a multivariate Gaussian and outliers are generated by a uniform distribution generate_data
Data generate_data_clusters Synthesized data generation in clusters; more complex data patterns can be created with multiple clusters generate_data_clusters
Stat wpearsonr Calculate the weighted Pearson correlation of two samples wpearsonr
Utility get_label_n Turn raw outlier scores into binary labels by assign 1 to top n outlier scores get_label_n
Utility precision_n_scores calculate precision @ rank n precision_n_scores

Angle-Based Outlier Detection (ABOD)

  • 它考虑每个点与其邻居之间的关系。 它没有考虑这些邻居之间的关系。 其加权余弦分数与所有邻居的方差可视为偏离分数
  • ABOD在多维数据上表现良好
  • PyOD提供两种不同版本的ABOD:
    • 快速ABOD:使用k近邻来近似
    • 原始ABOD:考虑所有具有高时间复杂性的训练点

 k-Nearest Neighbors Detector

  • 对于任何数据点,到第k个最近邻居的距离可以被视为远离分数
  • PyOD支持三个kNN探测器:
    • 最大:使用第k个邻居的距离作为离群值
    • 均值:使用所有k个邻居的平均值作为离群值得分
    • 中位数:使用与邻居的距离的中位数作为离群值得分

Isolation Forest

  • 它在内部使用scikit-learn库。 在此方法中,使用一组树完成数据分区。 隔离森林提供了一个异常分数,用于查看结构中点的隔离程度。 然后使用异常分数来识别来自正常观察的异常值
  • 隔离森林在多维数据上表现良好

Histogram-based Outlier Detection

  • 这是一种有效的无监督方法,它假设特征独立并通过构建直方图来计算异常值
  • 它比多变量方法快得多,但代价是精度较低

Local Correlation Integral (LOCI)

  • LOCI对于检测异常值和异常值组非常有效。 它为每个点提供LOCI图,总结了该点周围区域内数据的大量信息,确定了簇,微簇,它们的直径以及它们的簇间距离
  • 现有的异常检测方法都不能匹配此功能,因为它们只为每个点输出一个数字

Feature Bagging

  • 功能装袋检测器在数据集的各种子样本上安装了许多基本检测器。 它使用平均或其他组合方法来提高预测精度
  • 默认情况下,Local Outlier Factor(LOF)用作基本估算器。 但是,任何估计器都可以用作基本估计器,例如kNN和ABOD
  • 特征装袋首先通过随机选择特征子集来构造n个子样本。 这带来了基本估计的多样性。 最后,通过平均或取所有基本检测器的最大值来生成预测分数

Clustering Based Local Outlier Factor

  • 它将数据分为小型集群和大型集群。 然后根据点所属的簇的大小以及到最近的大簇的距离来计算异常分数

PyOD的使用

API介绍

特别需要注意的是,异常检测算法基本都是无监督学习,所以只需要X(输入数据),而不需要y(标签)。PyOD的使用方法和Sklearn中聚类分析很像,它的检测器(detector)均有统一的API。所有的PyOD检测器clf均有统一的API以便使用。

  • fit(X): 用数据X来“训练/拟合”检测器clf。即在初始化检测器clf后,用X来“训练”它。
  • fit_predict_score(X, y): 用数据X来训练检测器clf,并预测X的预测值,并在真实标签y上进行评估。此处的y只是用于评估,而非训练
  • decision_function(X): 在检测器clf被fit后,可以通过该函数来预测未知数据的异常程度,返回值为原始分数,并非0和1。返回分数越高,则该数据点的异常程度越高
  • predict(X): 在检测器clf被fit后,可以通过该函数来预测未知数据的异常标签,返回值为二分类标签(0为正常点,1为异常点)
  • predict_proba(X): 在检测器clf被fit后,预测未知数据的异常概率,返回该点是异常点概率

当检测器clf被初始化且fit(X)函数被执行后,clf就会生成两个重要的属性:

  • decision_scores: 数据X上的异常打分,分数越高,则该数据点的异常程度越高
  • labels_: 数据X上的异常标签,返回值为二分类标签(0为正常点,1为异常点)

不难看出,当我们初始化一个检测器clf后,可以直接用数据X来“训练”clf,之后我们便可以得到X的异常分值(clf.decision_scores)以及异常标签(clf.labels_)。当clf被训练后(当fit函数被执行后),我们可以使用decision_function()和predict()函数来对未知数据进行训练。

示例代码:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib.font_manager
from pyod.models.abod import ABOD
from pyod.models.knn import KNN
from pyod.utils.data import generate_data, get_outliers_inliers

# generate random data with two features
X_train, Y_train = generate_data(n_train=200, train_only=True, n_features=2)

# by default the outlier fraction is 0.1 in generate data function
outlier_fraction = 0.1

# store outliers and inliers in different numpy arrays
x_outliers, x_inliers = get_outliers_inliers(X_train, Y_train)

n_inliers = len(x_inliers)
n_outliers = len(x_outliers)

# separate the two features and use it to plot the data
F1 = X_train[:, [0]].reshape(-1, 1)
F2 = X_train[:, [1]].reshape(-1, 1)

# create a meshgrid
xx, yy = np.meshgrid(np.linspace(-10, 10, 200), np.linspace(-10, 10, 200))

# scatter plot
plt.scatter(F1, F2)
plt.xlabel('F1')
plt.ylabel('F2')

# 创建一个字典并添加要用于检测异常值的所有模型:
classifiers = {
    'Angle-based Outlier Detector (ABOD)': ABOD(contamination=outlier_fraction),
    'K Nearest Neighbors (KNN)': KNN(contamination=outlier_fraction)
}

# 将数据拟合到我们在字典中添加的每个模型,然后,查看每个模型如何检测异常值:
# set the figure size
plt.figure(figsize=(12, 6))

for i, (clf_name, clf) in enumerate(classifiers.items()):
    # fit the dataset to the model
    clf.fit(X_train)

    # predict raw anomaly score
    scores_pred = clf.decision_function(X_train) * -1

    # prediction of a datapoint category outlier or inlier
    y_pred = clf.predict(X_train)

    # no of errors in prediction
    n_errors = (y_pred != Y_train).sum()
    print('No of Errors : ', clf_name, n_errors)

    # rest of the code is to create the visualization

    # threshold value to consider a datapoint inlier or outlier
    threshold = stats.scoreatpercentile(scores_pred, 100 * outlier_fraction)

    # decision function calculates the raw anomaly score for every point
    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
    Z = Z.reshape(xx.shape)

    subplot = plt.subplot(1, 2, i + 1)

    # fill blue colormap from minimum anomaly score to threshold value
    subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 10), cmap=plt.cm.Blues_r)

    # draw red contour line where anomaly score is equal to threshold
    a = subplot.contour(xx, yy, Z, levels=[threshold], linewidths=2, colors='red')

    # fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score
    subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()], colors='orange')

    # scatter plot of inliers with white dots
    b = subplot.scatter(X_train[:-n_outliers, 0], X_train[:-n_outliers, 1], c='white', s=20, edgecolor='k')
    # scatter plot of outliers with black dots
    c = subplot.scatter(X_train[-n_outliers:, 0], X_train[-n_outliers:, 1], c='black', s=20, edgecolor='k')
    subplot.axis('tight')

    subplot.legend(
        [a.collections[0], b, c],
        ['learned decision function', 'true inliers', 'true outliers'],
        prop=matplotlib.font_manager.FontProperties(size=10),
        loc='lower right')

    subplot.set_title(clf_name)
    subplot.set_xlim((-10, 10))
    subplot.set_ylim((-10, 10))
plt.show()

PyOD实战:基于大型商场销售数据的异常发现

数据地址:https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/

数据说明:

Variable Description
Item_Identifier Unique product ID
Item_Weight Weight of product
Item_Fat_Content Whether the product is low fat or not
Item_Visibility The % of total display area of all products in a store allocated to the particular product
Item_Type The category to which the product belongs
Item_MRP Maximum Retail Price (list price) of the product
Outlet_Identifier Unique store ID
Outlet_Establishment_Year The year in which store was established
Outlet_Size The size of the store in terms of ground area covered
Outlet_Location_Type The type of city in which the store is located
Outlet_Type Whether the outlet is just a grocery store or some sort of supermarket
Item_Outlet_Sales Sales of the product in the particulat store. This is the outcome variable to be predicted.

1、加载需要用到的Python包和模块

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib.font_manager
import warnings
warnings.filterwarnings('ignore')

# Import models
from pyod.models.abod import ABOD
from pyod.models.cblof import CBLOF
from pyod.models.feature_bagging import FeatureBagging
from pyod.models.hbos import HBOS
from pyod.models.iforest import IForest
from pyod.models.knn import KNN
from pyod.models.lof import LOF

2、读取数据并绘制Item MRP vs Item Outlet Sales散点图以了解数据:

df = pd.read_csv("train_kOBLwZA.csv")
df.plot.scatter('Item_MRP','Item_Outlet_Sales')

3、对数据进行规格化处理:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
df[['Item_MRP','Item_Outlet_Sales']] = scaler.fit_transform(df[['Item_MRP','Item_Outlet_Sales']])
df[['Item_MRP','Item_Outlet_Sales']].head()

4、将这些值存储在NumPy数组中,以便以后在我们的模型中使用:

X1 = df['Item_MRP'].values.reshape(-1,1)
X2 = df['Item_Outlet_Sales'].values.reshape(-1,1)
X = np.concatenate((X1,X2),axis=1)

5、创建模型词典,设置异常分数值0.05(5%):

random_state = np.random.RandomState(1024)
outliers_fraction = 0.05
# Define seven outlier detection tools to be compared
classifiers = {
    'Angle-based Outlier Detector (ABOD)': ABOD(contamination=outliers_fraction),
    'Cluster-based Local Outlier Factor (CBLOF)':CBLOF(contamination=outliers_fraction,check_estimator=False, random_state=random_state),
    'Feature Bagging':FeatureBagging(LOF(n_neighbors=35),contamination=outliers_fraction,check_estimator=False,random_state=random_state),
    'Histogram-base Outlier Detection (HBOS)': HBOS(contamination=outliers_fraction),
    'Isolation Forest': IForest(contamination=outliers_fraction,random_state=random_state),
    'K Nearest Neighbors (KNN)': KNN(contamination=outliers_fraction),
    'Average KNN': KNN(method='mean',contamination=outliers_fraction)
}

6、逐个拟合每个模型,看看每个模型预测异常值的方式有什么不一样:

xx , yy = np.meshgrid(np.linspace(0,1 , 200), np.linspace(0, 1, 200))

for i, (clf_name, clf) in enumerate(classifiers.items()):
    clf.fit(X)
    # predict raw anomaly score
    scores_pred = clf.decision_function(X) * -1
        
    # prediction of a datapoint category outlier or inlier
    y_pred = clf.predict(X)
    n_inliers = len(y_pred) - np.count_nonzero(y_pred)
    n_outliers = np.count_nonzero(y_pred == 1)
    plt.figure(figsize=(10, 10))
    
    # copy of dataframe
    dfx = df
    dfx['outlier'] = y_pred.tolist()
    
    # IX1 - inlier feature 1,  IX2 - inlier feature 2
    IX1 =  np.array(dfx['Item_MRP'][dfx['outlier'] == 0]).reshape(-1,1)
    IX2 =  np.array(dfx['Item_Outlet_Sales'][dfx['outlier'] == 0]).reshape(-1,1)
    
    # OX1 - outlier feature 1, OX2 - outlier feature 2
    OX1 =  dfx['Item_MRP'][dfx['outlier'] == 1].values.reshape(-1,1)
    OX2 =  dfx['Item_Outlet_Sales'][dfx['outlier'] == 1].values.reshape(-1,1)
         
    print('OUTLIERS : ',n_outliers,'INLIERS : ',n_inliers, clf_name)
        
    # threshold value to consider a datapoint inlier or outlier
    threshold = stats.scoreatpercentile(scores_pred,100 * outliers_fraction)
        
    # decision function calculates the raw anomaly score for every point
    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
    Z = Z.reshape(xx.shape)
          
    # fill blue map colormap from minimum anomaly score to threshold value
    plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),cmap=plt.cm.Blues_r)
        
    # draw red contour line where anomaly score is equal to thresold
    a = plt.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='red')
        
    # fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score
    plt.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='orange')
        
    b = plt.scatter(IX1,IX2, c='white',s=20, edgecolor='k')
    
    c = plt.scatter(OX1,OX2, c='black',s=20, edgecolor='k')
       
    plt.axis('tight')  
    
    # loc=2 is used for the top left corner 
    plt.legend(
        [a.collections[0], b,c],
        ['learned decision function', 'inliers','outliers'],
        prop=matplotlib.font_manager.FontProperties(size=20),
        loc=2)
      
    plt.xlim((0, 1))
    plt.ylim((0, 1))
    plt.title(clf_name)
    plt.show()

执行结果:

OUTLIERS  447 INLIERS :  8076 Angle-based Outlier Detector (ABOD)

OUTLIERS :  427 INLIERS :  8096 Cluster-based Local Outlier Factor (CBLOF)

OUTLIERS :  392 INLIERS :  8131 Feature Bagging

OUTLIERS :  501 INLIERS :  8022 Histogram-base Outlier Detection (HBOS)

OUTLIERS :  427 INLIERS :  8096 Isolation Forest

OUTLIERS :  311 INLIERS :  8212 K Nearest Neighbors (KNN)

OUTLIERS :  176 INLIERS :  8347 Average KNN

参考链接:

发表评论

邮箱地址不会被公开。 必填项已用*标注