Learning to Rank算法学习之GBRank

文章内容如有错误或排版问题，请提交反馈，非常感谢！

GBRank是一种pair-wise的学习排序算法，他是基于回归来解决pair对的先后排序问题。在GBRank中，使用的回归算法是梯度提升数GBT (Gradient Boosting Tree)

算法原理

Learning To Rank需要解决的问题是给定一个Query，如何选择最相关的Document。GBRank核心为将排序问题转化为一组回归问题，对于回归问题可以用GBDT进行求解，也可以用其他的回归函数。

一般来说在搜索引擎里面，相关性越高的越应该排在前面。对于所有的query-document pair，我们从pair抽取出一系列特征对其进行表示。例如query1-document1记为$x$，query1-document2记为$y$。记$x\succ y$表示，用户发起查询query1时，$x$比$y$更适合，更加满足query1的需求。记训练集合为：

$$S=\{{\left\langle{{x_i},{y_i}}\right\rangle|{x_i}\succ{y_i},i=1,…,N}\}$$

给定排序函数空间$H$，我们希望得到一个排序函数$h$ ($h\in H$)，当$x_i\succ y_i$时，我们有$h\left({{x_i}}\right)\ge h\left({{y_i}}\right)$。损失函数定义为如下形式：

$$R(h)={1\over 2}{\sum\limits_{i=1}^N{({\max\{{0,h({{y_i}})-h({{x_i}})}\}})}^2$$

这个函数可以解读为，对于训练数据中的一个$\left\langle{{x_i},{y_i}}\right\rangle$，如果$h$学到了这种偏序关系，那么有$h(x_i)>h(y_i)$，$h$对于损失函数的贡献为0，否则为$({h({y_i})-h({x_i})})^2$。直接优化loss比较困难，可以通过改变$h(x_i)$或者$h(y_i)$达到减少loss的目的，例如用回归的方式来拟合$h(x_i)$、$h(y_i)$。

为了避免优化函数$h$是一个常量，在loss fuction上增加一个平滑项$\tau$，$0<\tau\le 1$。在实际应用中$\tau$为固定常数。

$$R(h,\tau)={1\over 2}\sum_{i=1}^{N}(\max\{0,h(y_i)-h(x_i)+\tau\})^2-\lambda\tau^2$$

因为当$h$为常量函数时，原$R(h)=0$就没有再优化的空间了。其实加了平衡项，就变相转为：如果希望$x_i\succ y_i$，就得有$h(x_i)\ge h(y_i)+\tau$，更加严格。多了一个gap。

按一般套路，用梯度下降的方法去最小化loss function。假设有未知函数$h(x_i)$，$h(y_i)$，$i=1,…,N$，loss $R(h)$对$h(x_i)$，$h(y_i)$的负梯度分别为：

$$\max\{0,h(y_i)-h(x_i)\},-\max\{0,h(y_i)-h(x_i)\}$$

当$h$满足$<{x_i},{y_i}>$偏序关系，$h(y_i)-h(x_i)<0$，上面两个梯度都会为0。当$h$不满足$<{x_i},{y_i}>$偏序关系时，梯度为：

$$h(y_i)-h(x_i),h(x_i)-h(y_i)$$

接下来，还需要知道如何将梯度作用到$h$的更新上，通过设定$x_i$的目标值为$h(y_i)+\tau$。$y_i$的目标值为$h(x_i)-\tau$。因此在每轮迭代中，当$h$不满足$<{x_i},{y_i}>$会产生一组数据：

$$\{(x_i,h(y_i)+\tau),(y_i,h(x_i)-\tau)\}$$我们需要拟合本轮产生的所有负例。

GBRank算法：

估计一个初始函数$h_0$（任意选择一个都可以）。
根据上一轮的$h_{k-1}$，将数据分为两个不相交的子集（正例和负例两个集合）：${S^+}=\{\left\langle{{x_i},{y_i}}\right\rangle\in S|{h_{k-1}}({x_i})\ge {h_{k-1}}({y_i})+\tau\}$和$S^-=\{\left\langle x_i,y_i\right\rangle\in S|h_{k-1}(x_i)<h_{k-1}(y_i)+\tau\}$。其中$S^-$作为我们下一步的训练集。
使用GBDT拟合负例集合中的数据$\{(x_i,h_{k-1}(y_i)+\tau),(y_i,h_{k-1}(x_i)-\tau)|(x_i,y_i)\in S^-\}$得到$g_k(x)$。（作者设置$\tau=0.1$，其他值也可以尝试。）
进行模型的更新：$h_k(x)=\frac{kh_{k-1}(x)+\eta g_k(x)}{k+1}$其中，$\eta$为伸缩因子。

可以看到step3里面每轮都拟合误判的结果，在迭代中这个集合会越来越小。还有一种做法是将曾经误判的集合维持在训练集中，那么训练集就会始终增长。在这个步骤中使用GBDT模型进行回归预测，当然其他的回归方法也可以使用。

GBrank的Python实现

from itertools import combinations
import numpy as np


class Node(object):
    def __init__(self, parent, data_index, predict_value):
        self.parent = parent
        self.data_index = data_index
        self.predict_value = predict_value
        self.left = None
        self.right = None
        self.split_feature_index = None
        self.split_feature_value = None


class RegressionTree(object):
    def __init__(self, min_data_in_leaf):
        self.min_data_in_leaf = min_data_in_leaf
        self.tree = None

    def fit(self, X, ys):
        tree = Node(None, np.arange(X.shape[0]), np.mean(ys))
        cand_leaves = [tree]

        while len(cand_leaves) != 0:
            min_squared_error = np.inf
            for leaf in cand_leaves:
                X_target = X[leaf.data_index]
                ys_target = ys[leaf.data_index]

                for d in range(X_target.shape[1]):
                    argsort = np.argsort(X_target[:, d])

                    # 以最小二乘法进行分割
                    for split in range(1, argsort.shape[0]):
                        #[0,split),[split,N_target)で分割
                        tmp_left_data_index = argsort[:split]
                        tmp_right_data_index = argsort[split:]

                        left_predict = np.mean(ys_target[tmp_left_data_index])
                        left_squared_error = np.sum((ys_target[tmp_left_data_index] - left_predict)**2)
                        right_predict = np.mean(ys_target[tmp_right_data_index])
                        right_squared_error = np.sum((ys_target[tmp_right_data_index] - right_predict)**2)

                        squared_error = left_squared_error + right_squared_error
                        if squared_error< min_squared_error:
                            min_squared_error = squared_error
                            target_leaf = leaf
                            left_data_index = leaf.data_index[tmp_left_data_index]
                            right_data_index = leaf.data_index[tmp_right_data_index]
                            split_feature_index = d
                            split_feature_value = X_target[:, d][tmp_right_data_index[0]]

            if min_squared_error == np.inf:
                break

            cand_leaves.remove(target_leaf)

            if left_data_index.shape[0] < self.min_data_in_leaf or right_data_index.shape[0] < self.min_data_in_leaf:
                continue

            left_node = Node(target_leaf, np.sort(left_data_index), np.mean(ys[left_data_index]))
            right_node = Node(target_leaf, np.sort(right_data_index), np.mean(ys[right_data_index]))
            target_leaf.split_feature_index = split_feature_index
            target_leaf.split_feature_value = split_feature_value
            target_leaf.left = left_node
            target_leaf.right = right_node

            if left_node.data_index.shape[0] > 1:
                cand_leaves.append(left_node)
            if right_node.data_index.shape[0] > 1:
                cand_leaves.append(right_node)

        self.tree = tree

    def predict(self, X):
        ys_predict = []
        for xs in X:
            node = self.tree
            while node.left is not None and node.right is not None:
                if xs[node.split_feature_index] < node.split_feature_value:
                    node = node.left
                else:
                    node = node.right
            ys_predict.append(node.predict_value)
        return np.array(ys_predict)


class GBrank(object):
    def __init__(self, n_trees, min_data_in_leaf, sampling_rate, shrinkage, tau=0.5):
        self.n_trees = n_trees
        self.min_data_in_leaf = min_data_in_leaf
        self.sampling_rate = sampling_rate
        self.shrinkage = shrinkage
        self.tau = tau
        self.trees = []

    def fit(self, X, ys, qid_lst):
        for n_tree in range(self.n_trees):
            if n_tree == 0:
                rt = RegressionTree(1)
                rt.fit(np.array([[0]]), np.array([0]))
                self.trees.append(rt)
                continue

            target_index = np.random.choice(
                X.shape[0],
                int(X.shape[0] * self.sampling_rate),
                replace=False
            )
            X_target = X[target_index]
            ys_target = ys[target_index]
            qid_target = qid_lst[target_index]

            ys_predict = self._predict(X_target, n_tree)

            qid_target_distinct = np.unique(qid_target)

            X_train_for_n_tree = []
            ys_train_for_n_tree = []
            for qid in qid_target_distinct:
                X_target_in_qid = X_target[qid_target == qid]
                ys_target_in_qid = ys_target[qid_target == qid]
                ys_predict_in_qid = ys_predict[qid_target == qid]

                for left, right in combinations(enumerate(zip(ys_target_in_qid, ys_predict_in_qid)), 2):
                    ind_1, (ys_target_1, ys_predict_1) = left
                    ind_2, (ys_target_2, ys_predict_2) = right

                    if ys_target_1< ys_target_2:
                        ys_target_1, ys_target_2 = ys_target_2, ys_target_1
                        ys_predict_1, ys_predict_2 = ys_predict_2, ys_predict_1
                        ind_1, ind_2 = ind_2, ind_1

                    if ys_predict_1< ys_predict_2 + self.tau:
                        X_train_for_n_tree.append(X_target_in_qid[ind_1])
                        ys_train_for_n_tree.append(ys_target_in_qid[ind_1] + self.tau)
                        X_train_for_n_tree.append(X_target_in_qid[ind_2])
                        ys_train_for_n_tree.append(ys_target_in_qid[ind_2] - self.tau)X_train_for_n_tree = np.array(X_train_for_n_tree)
ys_train_for_n_tree = np.array(ys_train_for_n_tree)

rt = RegressionTree(self.min_data_in_leaf)
rt.fit(X_train_for_n_tree, ys_train_for_n_tree)
self.trees.append(rt)

def _predict(self, X, n_predict_trees):
    predict_lst_by_trees = [self.trees[n_tree].predict(X) for n_tree in range(n_predict_trees)]
    predict_result = predict_lst_by_trees[0]
    for n_tree in range(1, n_predict_trees):
        predict_result = (n_tree * predict_result + self.shrinkage * predict_lst_by_trees[n_tree]) / (n_tree + 1)
    return predict_result

def predict(self, X):
    return self._predict(X, len(self.trees))

算法总结

$h(x)$为最终的排序函数，GBRank在训练时每次迭代中将$h_k(x)$的排序结果与真实结果不一样的样本（就是分错的样本）单独拿出来做训练样本，并且其训练目标为pair的另一个预测值作为回归目标，非常巧妙。

此时重新看这个GBRank模型与AdaBoost、GBT其实很大同小异，都是将上一次训练中分错的样本再拿来训练，也是一个提升的模型

在其paper中的实验结果也是要略好于$RankSvm$，但是其比较疼的时其训练还是比较复杂的，或者说比较耗时,其预测也会比较麻烦一点，所以使用时得慎重~

参考链接：

https://github.com/szdr/my-gbrank

Learning to Rank算法学习之GBRank

算法原理

GBrank的Python实现

算法总结

运营权重合理性评估与检测方法

文本易读性指标

《软件随想录》读书笔记

发表回复取消回复

Learning to Rank算法学习之GBRank

算法原理

GBrank的Python实现

算法总结

运营权重合理性评估与检测方法

文本易读性指标

《软件随想录》读书笔记

发表回复 取消回复

发表回复取消回复