集成学习相关面试知识点(Integrate learning related interview knowledge points)

集成学习

该内容由个人整理和归纳,如有不同见解,还望指教~

该内容由个人整理和归纳,如有不同见解,还望指教~

集成学习方法有哪些?bagging和boosting的区别?(京东)

  • Boosting: AdaBoost, 提升树, GBDT, XGBoost, LightGBM
  • Bagging: 随机森林

区别:

  • 样本采样方式不同:Bagging 是按照一定标准进行有放回采样,每次的采样是独立的,而 Boosting 每次训练使用的样本相同,但权重会发生变化。
  • 样本权重不同:Bagging 是每个样本权重都相同,而 Boosting 中会提升错分样本的权重。
  • 学习器的训练顺序不同:Bagging 是可以并行训练的,而 Boosting 只能串行训练 (因为后面的学习器相当于是在前面学习器的基础上做改进)。
  • 学习器的结合策略不同:Bagging 在分类分类问题上是通过多数投票法则,回归问题可通过加权平均。Boosting 则是加权相加。

提升树的的随机提升版是怎么做的?

  • 每一轮迭代中,新的决策树拟合的是原始训练集的一个子集(而并不是原始训练集)的残差。
    这个子集是通过对原始训练集的无放回随机采样而来。
  • 子集的占比 \(f\) 是一个超参数,并且在每轮迭代中保持不变。

    如果 \(f=1\) ,则与原始的梯度提升树算法相同。
    较小的 \(f\) 会引入随机性,有助于改善过拟合,因此可以视作一定程度上的正则化。
    工程经验表明,\(0.5 \leq f \leq 0.8\) 会带来一个较好的结果。

  • 如果 \(f=1\) ,则与原始的梯度提升树算法相同。
  • 较小的 \(f\) 会引入随机性,有助于改善过拟合,因此可以视作一定程度上的正则化。
  • 工程经验表明,\(0.5 \leq f \leq 0.8\) 会带来一个较好的结果。
  • 这种方法除了改善过拟合之外,另一个好处是:未被采样的另一部分子集可以用来计算包外估计误差。因此可以避免额外给出一个独立的验证集。

在白纸上写下GBDT的原理并介绍下 (京东)

GBDT分类是如何做的?(京东)

GBDT梯度算的干啥的?(京东)

因为在提升树中,当损失函数是平方损失和指数损失函数时,每—步优化是很简单的。但对一般损失函数而言 , 往往每一步优化并不那么容易。所以提出了梯度提升 (gradient boosting) 算法.这是利用最速下降法的近似方法 ,其关键是利用损失函数的负梯度在当前模型的值作为回归问题提升树算法中的残差的近似值,拟合一个回归树。对于平方损失函数,其梯度就是残差。而对于其他损失函数,其梯度可以近似为残差。

和其余的boosting方法有什么区别 (京东)

XGBoost 为何采用泰勒二次展开?不是一次或三次?/ XGBOOST相比GBDT多用了二阶导信息,计算更复杂了啊,为啥还要用? (京东)

  • 精确性:相比一阶偏导,二阶偏导有利于梯度下降的更快更准。
  • 可拓展性:而对损失函数采用泰勒二次展开即可获得自变量的二阶导数形式,只要损失函数二阶可导,就可根据样本对叶子结点进行进行分裂优化计算,不用管是分类还是回归,所以实现了可以自定义损失函数。

因为二次展开已经能近似大部分的损失函数了,所以不需要再变成更多层次的展开了。且展开越多项,对损失函数的多阶可导要求也就越高。

XGBoost如何寻找最优特征?是有放回还是无放回的呢?

XGBoost在训练的过程中给出各个特征的评分,从而表明每个特征对模型训练的重要性。

XGBoost利用梯度优化模型算法, 样本是不放回的,想象一个样本连续重复抽出,梯度来回踏步,这显然不利于收敛。

XGBoost支持子采样, 也就是每轮计算可以不使用全部样本。

XGBoost 如何提升计算速度?

  • 预排序pre-sorted 。
    xgboost 提出column block 数据结构来降低排序时间。

    每一个block 代表一个属性,样本在该block 中按照它在该属性的值排好序。
    这些block 只需要在程序开始的时候计算一次,后续排序只需要线性扫描这些block 即可。
    由于属性之间是独立的,因此在每个维度寻找划分点可以并行计算。

    block 可以仅存放样本的索引,而不是样本本身,这样节省了大量的存储空间。
    如:block_1 代表所有样本在feature_1 上的从小到大排序:sample_no1,sample_no2,…. 。
    其中样本编号出现的位置代表了该样本的排序。

  • 每一个block 代表一个属性,样本在该block 中按照它在该属性的值排好序。
  • 这些block 只需要在程序开始的时候计算一次,后续排序只需要线性扫描这些block 即可。
  • 由于属性之间是独立的,因此在每个维度寻找划分点可以并行计算。
  • cache-aware 预取。

    由于在column block 中,样本的顺序会被打乱,这会使得从导数数组中获取 时的缓存命中率较低。
    因此xgboost 提出了cache-aware 预取算法,用于提升缓存命中率。

    xgboost 会以minibatch 的方式累加数据,然后在后台开启一个线程来加载需要用到的导数 。
    这里有个折中:minibatch 太大,则会引起cache miss ;太小,则并行程度较低。

  • 由于在column block 中,样本的顺序会被打乱,这会使得从导数数组中获取 时的缓存命中率较低。
    因此xgboost 提出了cache-aware 预取算法,用于提升缓存命中率。
  • xgboost 会以minibatch 的方式累加数据,然后在后台开启一个线程来加载需要用到的导数 。
    这里有个折中:minibatch 太大,则会引起cache miss ;太小,则并行程度较低。
  • Out-of-Core 大数据集。
    xgboost 利用硬盘来处理超过内存容量的大数据集。其中使用了下列技术:

    使用block 压缩技术来缓解内存和硬盘的数据交换IO : 数据按列压缩,并且在硬盘到内存的传输过程中被自动解压缩。
    数据随机分片到多个硬盘,每个硬盘对应一个预取线程,从而加大”内存-硬盘”交换数据的吞吐量。

  • 使用block 压缩技术来缓解内存和硬盘的数据交换IO : 数据按列压缩,并且在硬盘到内存的传输过程中被自动解压缩。
  • 数据随机分片到多个硬盘,每个硬盘对应一个预取线程,从而加大”内存-硬盘”交换数据的吞吐量。

XGBoost 的一些内部优化

  • 在寻找最佳分割点时,传统的方法会枚举每个特征的所有可能切分点。XGBoost 实现了一种近似的算法,大致的思想是根据百分位法列举几个可能成为分割点的候选者,然后从候选者中根据上面求分割点的公式计算找出最佳的分割点。
  • XGBoost 考虑了训练数据为稀疏值的情况,可以为缺失值或者指定的值指定分支的默认方向,这能大大提升算法的效率,paper 提到能提高 50 倍。
  • 特征列排序后以块的形式存储在内存中,在迭代中可以重复使用;虽然 Boosting 算法迭代必须串行,但是在处理每个特征列时可以做到并行。
  • 按照特征列方式存储能优化寻找最佳的分割点,但是当以行计算梯度数据时会导致内存的不连续访问,严重时会导致 cache miss,降低算法效率。Paper 中提到,可先将数据收集到线程内部的 buffer,然后再计算,提高算法的效率。
  • XGBoost 还考虑了数据量比较大的情况,当内存不够时怎么有效的使用磁盘,主要是结合多线程、数据压缩、分片的方法,尽可能的提高算法的效率。

XGBoost 防止过拟合的方法

  • 目标函数添加正则项:叶子节点个数+叶子节点权重的L2正则化
  • 列抽样:训练的时候只用一部分特征(不考虑剩余的block块即可)
  • 子采样:每轮计算可以不使用全部样本,使算法更加保守
  • shrinkage: 可以叫学习率或步长,为了给后面的训练留出更多的学习空间

xgboost适合处理哪些场景的问题?(贝壳)

模型类型来说,回归和分类都可以;按任务类型来说结构化数据,nlp,推荐,排序都可以。

XGBoost 和 GBDT 的区别?(微软,说的要有条理,有哪些算法优化,哪些工程实现优化,可以适当扩展提一下lgb)

XGBoost 是 GBDT 的一种工程实现

  • 基分类器:GBDT 采用的 CART 树,而 XGBoost 还支持线性分类器,相当于在正则化项的LR和线性回归。
  • 梯度信息:GBDT 只使用了一阶梯度去近似残差,而 XGBoost 采用了加入了二阶梯度的信息,使得残差近似得更快更准。
  • 正则化:GBDT 中没有正则化相关的设置,而 XGBoost 在目标函数中设置了叶子结点数和叶子结点的权重的L2范数来防止过拟合(预剪枝)。
  • 损失函数:XGBoost 支持自定义损失函数,要求损失函数二阶可导。
  • 列采样:GBDT 每次划分使用所有的特征,而 XGBoost 划分时会对列进行采样,防止过拟合。
  • 缺失值处理:传统 GBDT 中没有做缺失值的处理,而 XGBoost 在选择计算划分时的得分时,不会将缺失值计算进去,而在向下划分时,有一个默认分裂方向会将缺失值分过去。
  • 并行化:GBDT 没有并行化,而 XGBoost 先将每个特征排序并将其存储为块(block)结构,分裂节点时会在各个块中并行查找最优分裂点,极大提升了训练速度。

Xgboost 和 LightGBM 区别

机器学习-XGBoost和LightGBM

机器学习-XGBoost和LightGBM

它们都是 GBDT 的一种实现。LightGBM 在 XGBoost 的基础上进行了内存和速度上的优化,在速度和精度上做了一定的权衡。

内存更小:

XGBoost 使用预排序后需要记录特征值及其对应样本的统计值的索引,而 LightGBM 使用了直方图算法将特征值转变为 bin 值,且不需要记录特征到样本的索引,将空间复杂度从 O(2*#data) 降低为 O(#bin) ,极大的减少了内存消耗;
LightGBM 采用了直方图算法将存储特征值转变为存储 bin 值,降低了内存消耗;
LightGBM 在训练过程中采用互斥特征捆绑算法减少了特征数量,降低了内存消耗。

  • XGBoost 使用预排序后需要记录特征值及其对应样本的统计值的索引,而 LightGBM 使用了直方图算法将特征值转变为 bin 值,且不需要记录特征到样本的索引,将空间复杂度从 O(2*#data) 降低为 O(#bin) ,极大的减少了内存消耗;
  • LightGBM 采用了直方图算法将存储特征值转变为存储 bin 值,降低了内存消耗;
  • LightGBM 在训练过程中采用互斥特征捆绑算法减少了特征数量,降低了内存消耗。

速度更快:

LightGBM 采用了直方图算法将遍历样本转变为遍历直方图,极大的降低了时间复杂度;
LightGBM 在训练过程中采用单边梯度算法过滤掉梯度小的样本,减少了大量的计算;
LightGBM 采用了基于 Leaf-wise 算法的增长策略构建树,减少了很多不必要的计算量;
LightGBM 采用优化后的特征并行、数据并行方法加速计算,当数据量非常大的时候还可以采用投票并行的策略;
LightGBM 对缓存也进行了优化,增加了 Cache hit 的命中率。

  • LightGBM 采用了直方图算法将遍历样本转变为遍历直方图,极大的降低了时间复杂度;
  • LightGBM 在训练过程中采用单边梯度算法过滤掉梯度小的样本,减少了大量的计算;
  • LightGBM 采用了基于 Leaf-wise 算法的增长策略构建树,减少了很多不必要的计算量;
  • LightGBM 采用优化后的特征并行、数据并行方法加速计算,当数据量非常大的时候还可以采用投票并行的策略;
  • LightGBM 对缓存也进行了优化,增加了 Cache hit 的命中率。

Boosting 算法中的学习率的作用和优缺点?

学习率是正则化的一部分,它可以降低模型更新的速度(需要更多的迭代)。

  • 经验表明:一个小的学习率 (小于0.1) 可以显著提高模型的泛化能力(相比较于学习率为1的时候) 。
  • 如果学习率较大会导致预测性能出现较大波动。

随机森林的优点

  • 训练效率较高。因为随机森林使用的决策树只需要考虑所有属性的一个子集,且树的生成可并行化。
  • 随机森林简单、容易实现、计算开销小。
  • 随机森林在很多现实任务中展现出强大的性能,被称作 “代表集成学习技术水平的方法”。

随机森林的适用场景

数据维度相对低(几十维),同时对准确性有较高要求时。(因为不需要很多参数调整就可以达到不错的效果,基本上不知道用什么方法的时候都可以先试一下随机森林。)

各种机器学习算法的应用场景分别是什么(比如朴素贝叶斯、决策树、K 近邻、SVM、逻辑回归最大熵模型)? – xyzh的回答 – 知乎 https://www.zhihu.com/question/26726794/answer/151282052

集成学习的优点

  • 由于学习任务的假设空间往往很大,可能有多个假设在训练集上达到同等性能。
    此时如果使用单学习器可能因为造成误选而导致泛化性能不佳,通过学习器组合之后会减小这一风险。
  • 学习算法往往会陷入局部极小。有的局部极小点所对应的泛化性能可能很差,而通过学习器组合之后可降低陷入糟糕局部极小的风险。
  • 某些学习任务的真实假设可能不在当前学习算法所考虑的假设空间中,此时使用单学习器肯定无效。
    通过学习器组合之后,由于相应的假设空间有所扩大,有可能学得更好的近似。

加权平均和简单平均那个更好?

现实任务中训练样本通常不充分或者存在噪声,这就使得加权平均学得的权重不完全可靠。尤其是对于规模比较大的集成学习,要学习的权重比较多,很容易出现过拟合。通常如果个体学习器性能相差较大时,适合使用加权平均法;个体学习器性能相差较近时,适合使用简单平均法。

集成学习方法中,如何进行多样性增强?

一般的思路是在学习过程中引入随机性。常见的做法是:对数据样本、输入属性、输出表示、算法参数进行扰动。

  • 数据样本扰动:给定初始数据集,可以从中产生出不同的数据子集。再利用不同的数据子集训练出不同的个体学习器。

    数据样本扰动通常是基于采样法,此类做法简单高效、使用最广。

    对于常见的基学习器,如决策树、神经网络等,训练样本稍加变化就会导致学习器有显著的变动,数据样本扰动法对这样的“不稳定基学习器”很有效。

    对于一些基学习器对数据样本的扰动不敏感,如线性学习器、支持向量机、朴素贝叶斯、 近邻学习器等,这样的基学习器称作稳定基学习器。
    对于此类的基学习器进行集成往往需要使用输入属性扰动等其他机制。

  • 数据样本扰动通常是基于采样法,此类做法简单高效、使用最广。
  • 对于常见的基学习器,如决策树、神经网络等,训练样本稍加变化就会导致学习器有显著的变动,数据样本扰动法对这样的“不稳定基学习器”很有效。
  • 对于一些基学习器对数据样本的扰动不敏感,如线性学习器、支持向量机、朴素贝叶斯、 近邻学习器等,这样的基学习器称作稳定基学习器。
    对于此类的基学习器进行集成往往需要使用输入属性扰动等其他机制。
  • 输入属性扰动:训练样本通常由一组属性描述,不同的“子空间”提供了观察数据的不同视角。显然从不同子空间训练出来的个体学习器必然有所不同。

    对于包含了大量冗余属性的数据,在子空间中训练个体学习器不仅能够产生多样性大的个体,还会因为属性数量的减少而大幅节省时间开销。
    同时由于冗余属性多,减少一些属性之后训练的个体学习器也不至于太差。

    对于只包含少量属性的数据,或者冗余属性较少,则不宜采用输入属性扰动法。

  • 对于包含了大量冗余属性的数据,在子空间中训练个体学习器不仅能够产生多样性大的个体,还会因为属性数量的减少而大幅节省时间开销。
    同时由于冗余属性多,减少一些属性之后训练的个体学习器也不至于太差。
  • 对于只包含少量属性的数据,或者冗余属性较少,则不宜采用输入属性扰动法。
  • 输出表示扰动:此类做法的思路是对输出表示进行操纵以增强多样性。
    如:可以对训练样本的类标记稍作变动,如翻转法Flipping Output随机改变一些训练样本的标记。
  • 算法参数扰动:基学习算法一般都有超参数需要设置。可以通过随机设置不同的超参数,从而产生差别较大的个体学习器。
    使用单一学习器时通常需要使用交叉验证等方法来确定最佳的超参数值。这种做法实际上是用了不同的超参数训练出来了多个学习器,只不过最终挑选出来效果最好的那个学习器来使用。
    集成学习则是相当于把所有这些学习器都利用起来。
  • 不同的多样性增强机制可以同时使用。如随机森林同时是用了数据样本扰动和输入属性扰动。

RF vs GBT

  • 从模型框架的角度来看:

    梯度提升树GBT 为boosting 模型,多个学习器串行生成。
    随机森林RF 为bagging 模型,多个学习器并行生成。

  • 梯度提升树GBT 为boosting 模型,多个学习器串行生成。
  • 随机森林RF 为bagging 模型,多个学习器并行生成。
  • 从偏差分解的角度来看:

    梯度提升树GBT 采用弱分类器(高偏差,低方差)。梯度提升树综合了这些弱分类器,在每一步的过程中降低了偏差,但是保持低方差。可能会过拟合。
    随机森林 RF 采用完全成长的子决策树(低偏差,高方差)。随机森林要求这些子树之间尽可能无关,从而综合之后能降低方差,但是保持低偏差。树的数目多时,不会过拟合。

  • 梯度提升树GBT 采用弱分类器(高偏差,低方差)。梯度提升树综合了这些弱分类器,在每一步的过程中降低了偏差,但是保持低方差。可能会过拟合。
  • 随机森林 RF 采用完全成长的子决策树(低偏差,高方差)。随机森林要求这些子树之间尽可能无关,从而综合之后能降低方差,但是保持低偏差。树的数目多时,不会过拟合。
  • 从采样方式看:

    提升树 GBT 采用了所有数据和所有特征对学习器进行训练 (XGBoost 没有)。
    随机森林在样本上采用的bootstrap采样,同时在特征上也会进行采样,会得到多个数据集,每个数据集会训练得到一个学习器。

  • 提升树 GBT 采用了所有数据和所有特征对学习器进行训练 (XGBoost 没有)。
  • 随机森林在样本上采用的bootstrap采样,同时在特征上也会进行采样,会得到多个数据集,每个数据集会训练得到一个学习器。
  • 如果在梯度提升树和随机森林之间二选一,几乎总是建议选择梯度提升树。

    随机森林的优点:天然的支持并行计算,因为每个子树都是独立的计算。

    梯度提升树的优点:

    梯度提升树采用更少的子树来获得更好的精度。
    因为在每轮迭代中,梯度提升树会完全接受现有树(投票权为1)。而随机森林中每棵树都是同等重要的(无论它们表现的好坏),它们的投票权都是 \(\frac{1}{N}\),因此不是完全接受的。

    梯度提升树也可以修改从而实现并行化。

    梯度提升树有一个明确的数学模型。因此任何能写出梯度的任务,都可以应用梯度提升树(比如 ranking 任务)。而随机森林并没有一个明确的数学模型。

  • 随机森林的优点:天然的支持并行计算,因为每个子树都是独立的计算。
  • 梯度提升树的优点:

    梯度提升树采用更少的子树来获得更好的精度。
    因为在每轮迭代中,梯度提升树会完全接受现有树(投票权为1)。而随机森林中每棵树都是同等重要的(无论它们表现的好坏),它们的投票权都是 \(\frac{1}{N}\),因此不是完全接受的。

    梯度提升树也可以修改从而实现并行化。

    梯度提升树有一个明确的数学模型。因此任何能写出梯度的任务,都可以应用梯度提升树(比如 ranking 任务)。而随机森林并没有一个明确的数学模型。

  • 梯度提升树采用更少的子树来获得更好的精度。
    因为在每轮迭代中,梯度提升树会完全接受现有树(投票权为1)。而随机森林中每棵树都是同等重要的(无论它们表现的好坏),它们的投票权都是 \(\frac{1}{N}\),因此不是完全接受的。
  • 梯度提升树也可以修改从而实现并行化。
  • 梯度提升树有一个明确的数学模型。因此任何能写出梯度的任务,都可以应用梯度提升树(比如 ranking 任务)。而随机森林并没有一个明确的数学模型。
————————

Integrated learning

This content is sorted and summarized by individuals. If you have different opinions, please give me advice ~

This content is sorted and summarized by individuals. If you have different opinions, please give me advice ~

< strong > What are the integrated learning methods? What’s the difference between bagging and boosting (JD)

  • Boosting: AdaBoost, 提升树, GBDT, XGBoost, LightGBM
  • Bagging: 随机森林

difference:

  • Sample sampling methods are different: bagging is carried out according to certain standards, with put back sampling. Each sampling is independent, while boosting uses the same samples for each training, but the weight will change.
  • Different sample weights: bagging means that each sample has the same weight, and boosting will increase the weight of misclassified samples.
  • The training sequence of learners is different: bagging can be trained in parallel, while boosting can only be trained in serial (because the latter learners are equivalent to improving on the former learners).
  • The combination strategies of learners are different: bagging adopts the majority voting rule in the classification problem, and the regression problem can pass the weighted average. Boosting is weighted addition.

< strong > how is the random promotion version of the promotion tree done

  • In each iteration, the new decision tree fits the residual of a subset of the original training set (not the original training set).
    This subset is obtained by random sampling of the original training set.
  • The proportion of subsets \ (f \) is a super parameter and remains unchanged in each iteration.
    If \ (F = 1 \), it is the same as the original gradient lifting tree algorithm.
    Smaller \ (f \) introduces randomness and helps to improve over fitting, so it can be regarded as a degree of regularization.
    Engineering experience shows that \ (0.5 \ Leq f \ Leq 0.8 \) will bring a better result.
  • If \ (F = 1 \), it is the same as the original gradient lifting tree algorithm.
  • Smaller \ (f \) introduces randomness and helps to improve over fitting, so it can be regarded as a degree of regularization.
  • Engineering experience shows that \ (0.5 \ Leq f \ Leq 0.8 \) will bring a better result.
  • In addition to improving the over fitting, another advantage of this method is that another unsampled molecular set can be used to calculate the out of packet estimation error. Therefore, it can avoid giving an additional independent verification set.

< strong > write down the principle of gbdt on white paper and introduce < / strong > (JD)

< strong > how does gbdt classification work (JD)

< strong > What does gbdt gradient do (JD)

Because in the lifting tree, when the loss function is a square loss function and an exponential loss function, each step of optimization is very simple. But for the general loss function, it is often not so easy to optimize each step. Therefore, a gradient boosting algorithm is proposed, which is an approximate method using the steepest descent method. The key is to use the value of the negative gradient of the loss function in the < strong > current model < / strong > as the approximate value of the residual in the lifting tree algorithm of the regression problem to fit a regression tree. For the square loss function, the gradient is the residual. For other loss functions, the gradient can be approximated as residual.

< strong > how is it different from other boosting methods < / strong > (JD)

< strong > why does xgboost use Taylor quadratic expansion? Not once or three / < strong > compared with gbdt, xgboost uses more second-order derivative information, and the calculation is more complex. Why use it (JD)

  • Accuracy: compared with the first-order partial derivative, the second-order partial derivative is conducive to the faster and more accurate gradient descent.
  • Expansibility: the second derivative form of the independent variable can be obtained by using the Taylor quadratic expansion for the loss function. As long as the loss function is second-order differentiable, the leaf node can be split and optimized according to the sample, regardless of classification or regression, so the loss function can be customized.

Because quadratic expansion can approximate most of the loss functions, it does not need to become more hierarchical expansion. And the more multiple expansions, the higher the requirements for the multi-order derivability of the loss function.

< strong > how does xgboost find the best features? Is it put back or not

Xgboost gives the score of each feature in the process of training, so as to show the importance of each feature to model training.

Xgboost uses the gradient optimization model algorithm, and the samples are not put back. Imagine that a sample is extracted continuously and repeatedly, and the gradient steps back and forth, which is obviously not conducive to convergence.

Xgboost supports sub sampling, that is, not all samples can be used for each round of calculation.

< strong > How can xgboost improve computing speed

  • Pre sorted.
    Xgboost proposes a column block data structure to reduce the sorting time.
    Each block represents an attribute, and the samples are arranged in order according to their values in the attribute in the block.
    These blocks only need to be calculated once at the beginning of the program, and the subsequent sorting only needs to scan these blocks linearly.
    Because attributes are independent, finding partition points in each dimension can be calculated in parallel.
    Block can only store the index of the sample, not the sample itself, which saves a lot of storage space.
    Such as: block_ 1 represents all samples in the feature_ Sorting from small to large on 1: Sample_ no1,sample_ no2,…. 。
    The position where the sample number appears represents the sorting of the sample.
  • Each block represents an attribute, and the samples are arranged in order according to their values in the attribute in the block.
  • These blocks only need to be calculated once at the beginning of the program, and the subsequent sorting only needs to scan these blocks linearly.
  • Because attributes are independent, finding partition points in each dimension can be calculated in parallel.
  • Cache aware prefetching.
    In the column block, the order of samples will be disrupted, which will reduce the cache hit rate when getting from the derivative array.
    Therefore, xgboost proposes a cache aware prefetching algorithm to improve the cache hit rate.
    Xgboost will accumulate data in minibatch mode, and then start a thread in the background to load the required derivatives.
    If the cache is too large, there will be a compromise; Too small, the degree of parallelism is low.
  • In the column block, the order of samples will be disrupted, which will reduce the cache hit rate when getting from the derivative array.
    Therefore, xgboost proposes a cache aware prefetching algorithm to improve the cache hit rate.
  • Xgboost will accumulate data in minibatch mode, and then start a thread in the background to load the required derivatives.
    If the cache is too large, there will be a compromise; Too small, the degree of parallelism is low.
  • Out of core large dataset.
    Xgboost uses the hard disk to handle large data sets that exceed the memory capacity. The following technologies are used:
    Block compression technology is used to alleviate the data exchange IO between memory and hard disk: data is compressed in columns and automatically decompressed during the transmission from hard disk to memory.
    The data is randomly divided into multiple hard disks, and each hard disk corresponds to a prefetch thread, so as to increase the throughput of “memory hard disk” exchange data.
  • Block compression technology is used to alleviate the data exchange IO between memory and hard disk: data is compressed in columns and automatically decompressed during the transmission from hard disk to memory.
  • The data is randomly divided into multiple hard disks, and each hard disk corresponds to a prefetch thread, so as to increase the throughput of “memory hard disk” exchange data.

< strong > some internal optimizations of xgboost < / strong >

  • When finding the best segmentation point, the traditional method enumerates all possible segmentation points of each feature. Xgboost implements an approximate algorithm. The general idea is to list several candidates who may become segmentation points according to the percentile method, and then calculate and find the best segmentation point from the candidates according to the above formula for finding segmentation points.
  • Xgboost considers that the training data is sparse. It can specify the default direction of the branch for the missing value or the specified value, which can greatly improve the efficiency of the algorithm. As mentioned in paper, it can be increased by 50 times.
  • After sorting, the feature columns are stored in memory in the form of blocks, which can be reused in the iteration; Although the boosting algorithm iteration must be serial, it can be parallel when processing each feature column.
  • Storing in characteristic columns can optimize the search for the best segmentation point, but when calculating gradient data in rows, it will lead to discontinuous access to memory, and in serious cases, it will lead to cache miss and reduce the efficiency of the algorithm. As mentioned in paper, the data can be collected into the buffer inside the thread before calculation, so as to improve the efficiency of the algorithm.
  • Xgboost also considers the large amount of data and how to effectively use the disk when the memory is insufficient. It mainly combines the methods of multithreading, data compression and fragmentation to improve the efficiency of the algorithm as much as possible.

< strong > xgboost methods to prevent overfitting < / strong >

  • The objective function adds a regularization term: the number of leaf nodes + L2 regularization of leaf node weight
  • Column sampling: only some features are used during training (the remaining block blocks are not considered)
  • Subsampling: each round of calculation can not use all samples, making the algorithm more conservative
  • Shrinkage: it can be called learning rate or step size, in order to leave more learning space for later training

< strong > what scenarios are xgboost suitable for (shell)

In terms of model types, both regression and classification are OK; By task type, structured data, NLP, recommendation and sorting are OK.

< strong > what is the difference between xgboost and gbdt (Microsoft, it should be organized. What algorithms and projects can be optimized, which can be appropriately extended to mention LGB)

Xgboost is an engineering implementation of gbdt

  • Base classifier: gbdt adopts cart tree, while xgboost also supports linear classifier, which is equivalent to LR and linear regression in regularization term.
  • Gradient information: gbdt only uses one step to approximate the residual, while xgboost uses the information of two steps to make the residual approximate faster and more accurate.
  • Regularization: there are no regularization related settings in gbdt, but xgboost sets the L2 norm of the number of leaf nodes and the weight of leaf nodes in the objective function to prevent over fitting (pre pruning).
  • Loss function: xgboost supports user-defined loss functions, which are required to be second-order differentiable.
  • Column sampling: gbdt uses all features for each partition, while xgboost will sample columns to prevent over fitting.
  • Missing value processing: there is no missing value processing in the traditional gbdt. When xgboost chooses to calculate the score when dividing, it will not calculate the missing value. When dividing downward, there is a default splitting direction that will divide the missing value.
  • Parallelization: gbdt is not parallelized, but xgboost sorts and stores each feature as a block structure. When splitting nodes, it will find the optimal splitting point in each block in parallel, which greatly improves the training speed.

Xgboost 和 LightGBM 区别

Machine learning xgboost and lightgbm

Machine learning xgboost and lightgbm

They are all an implementation of gbdt. Lightgbm optimizes memory and speed on the basis of xgboost, and makes a certain trade-off between speed and accuracy.

Smaller memory:

Xgboost uses the index of the feature value and its corresponding sample statistical value after pre sorting, while lightgbm uses the histogram algorithm to convert the feature value into bin value, and does not need to record the index of the feature to the sample, reducing the spatial complexity from O (2 * #data) to o (#bin), greatly reducing the memory consumption;
Lightgbm uses histogram algorithm to convert stored eigenvalues into stored bin values, which reduces memory consumption;
In the training process, lightgbm adopts mutually exclusive feature binding algorithm to reduce the number of features and memory consumption.

  • Xgboost uses the index of the feature value and its corresponding sample statistical value after pre sorting, while lightgbm uses the histogram algorithm to convert the feature value into bin value, and does not need to record the index of the feature to the sample, reducing the spatial complexity from O (2 * #data) to o (#bin), greatly reducing the memory consumption;
  • Lightgbm uses histogram algorithm to convert stored eigenvalues into stored bin values, which reduces memory consumption;
  • In the training process, lightgbm adopts mutually exclusive feature binding algorithm to reduce the number of features and memory consumption.

Faster:

Lightgbm uses histogram algorithm to transform traversal samples into traversal histograms, which greatly reduces the time complexity;
Lightgbm adopts unilateral gradient algorithm to filter out samples with small gradient in the training process, which reduces a lot of calculation;
Lightgbm adopts the growth strategy based on leaf wise algorithm to build the tree, which reduces a lot of unnecessary computation;
Lightgbm adopts the optimized feature parallel and data parallel methods to speed up the calculation. When the amount of data is very large, it can also adopt the voting parallel strategy;
Lightgbm also optimizes the cache to increase the hit rate of cache hit.

  • Lightgbm uses histogram algorithm to transform traversal samples into traversal histograms, which greatly reduces the time complexity;
  • Lightgbm adopts unilateral gradient algorithm to filter out samples with small gradient in the training process, which reduces a lot of calculation;
  • Lightgbm adopts the growth strategy based on leaf wise algorithm to build the tree, which reduces a lot of unnecessary computation;
  • Lightgbm adopts the optimized feature parallel and data parallel methods to speed up the calculation. When the amount of data is very large, it can also adopt the voting parallel strategy;
  • Lightgbm also optimizes the cache to increase the hit rate of cache hit.

< strong > What are the functions, advantages and disadvantages of learning rate in boosting algorithm

The learning rate is part of regularization, which can reduce the speed of model update (more iterations are required).

  • Experience shows that a small learning rate (less than 0.1) can significantly improve the generalization ability of the model (compared with the learning rate of 1).
  • If the learning rate is large, the prediction performance will fluctuate greatly.

< strong > advantages of random forest < / strong >

  • High training efficiency. Because the decision tree used in random forest only needs to consider a subset of all attributes, and the generation of the tree can be parallelized.
  • Random forest is simple, easy to implement and low computational overhead.
  • Random forest shows strong performance in many real tasks, which is called “the method representing the level of integrated learning technology”.

< strong > applicable scenario of random forest < / strong >

When the data dimension is relatively low (dozens of dimensions) and there are high requirements for accuracy. (because you don’t need many parameter adjustments to achieve good results, you can try the random forest first when you don’t know what method to use.)

What are the application scenarios of various machine learning algorithms (such as naive Bayes, decision tree, k-nearest neighbor, SVM, logistic regression maximum entropy model)- Xyzh’s answer – Zhihu https://www.zhihu.com/question/26726794/answer/151282052

< strong > advantages of integrated learning < / strong >

  • Because the hypothesis space of learning task is often large, multiple hypotheses may achieve the same performance on the training set.
    At this time, if a single learner is used, the generalization performance may be poor due to false selection. This risk will be reduced after the combination of learners.
  • Learning algorithms often fall into local minima. The generalization performance corresponding to some local minima may be very poor, and the risk of falling into bad local minima can be reduced through the combination of learners.
  • The real hypothesis of some learning tasks may not be in the hypothesis space considered by the current learning algorithm. At this time, using a single learner is certainly invalid.
    After the combination of learners, it is possible to learn better approximation due to the expansion of the corresponding hypothesis space.

< strong > which is better, weighted average or simple average

In real tasks, the training samples are usually insufficient or noisy, which makes the weight learned by weighted average not completely reliable. Especially for large-scale ensemble learning, there are many weights to learn, and it is easy to have fitting. Generally, if the performance of individual learners varies greatly, the weighted average method is suitable; When the performance difference of individual learners is close, the simple average method is suitable.

< strong > how to enhance diversity in the integrated learning method

The general idea is to introduce randomness into the learning process. The common method is to disturb the data samples, input attributes, output representation and algorithm parameters.

  • Data sample perturbation: given the initial data set, different data subsets can be generated. Then different data subsets are used to train different individual learners.
    Data sample perturbation is usually based on sampling method, which is simple, efficient and widely used.
    For common base learners, such as decision trees and neural networks, a slight change in the training samples will lead to significant changes in the learners. The data sample perturbation method is very effective for such “unstable base learners”.
    For some base learners, such as linear learners, support vector machines, naive Bayes, nearest neighbor learners and so on, they are not sensitive to the disturbance of data samples. Such base learners are called stable base learners.
    For the integration of such base learners, other mechanisms such as input attribute perturbation are often needed.
  • Data sample perturbation is usually based on sampling method, which is simple, efficient and widely used.
  • For common base learners, such as decision trees and neural networks, a slight change in the training samples will lead to significant changes in the learners. The data sample perturbation method is very effective for such “unstable base learners”.
  • For some base learners, such as linear learners, support vector machines, naive Bayes, nearest neighbor learners and so on, they are not sensitive to the disturbance of data samples. Such base learners are called stable base learners.
    For the integration of such base learners, other mechanisms such as input attribute perturbation are often needed.
  • Input attribute disturbance: training samples are usually described by a set of attributes, and different “subspaces” provide different perspectives of observation data. Obviously, individual learners trained from different subspaces must be different.
    For data containing a large number of redundant attributes, training individual learners in subspace can not only produce individuals with large diversity, but also save time greatly because of the reduction of the number of attributes.
    At the same time, due to many redundant attributes, the individual learner trained after reducing some attributes will not be too bad.
    For data containing only a small number of attributes or with few redundant attributes, the input attribute perturbation method should not be used.
  • For data containing a large number of redundant attributes, training individual learners in subspace can not only produce individuals with large diversity, but also save time greatly because of the reduction of the number of attributes.
    At the same time, due to many redundant attributes, the individual learner trained after reducing some attributes will not be too bad.
  • For data containing only a small number of attributes or with few redundant attributes, the input attribute perturbation method should not be used.
  • Output representation disturbance: the idea of this approach is to manipulate the output representation to enhance diversity.
    For example, the class marks of training samples can be changed slightly, such as flipping output randomly changing the marks of some training samples.
  • Algorithm parameter disturbance: basic learning algorithms generally have super parameters to be set. Different super parameters can be set randomly to produce individual learners with great differences.
    When using a single learner, it is usually necessary to use methods such as cross validation to determine the best hyperparametric value. This method actually uses different super parameters to train multiple learners, but finally selects the one with the best effect to use.
    Integrated learning is equivalent to using all these learners.
  • Different diversity enhancement mechanisms can be used at the same time. For example, random forest uses both data sample disturbance and input attribute disturbance.

RF vs GBT

  • From the perspective of model framework:
    The gradient lifting tree GBT is a boosting model, which is generated serially by multiple learners.
    The random forest RF is a bagging model, which is generated by multiple learners in parallel.
  • The gradient lifting tree GBT is a boosting model, which is generated serially by multiple learners.
  • The random forest RF is a bagging model, which is generated by multiple learners in parallel.
  • From the perspective of deviation decomposition:
    Gradient lifting tree GBT adopts weak classifier (high deviation, low variance). Gradient lifting tree synthesizes these weak classifiers, reduces the deviation in each step, but maintains low variance. May over fit.
    The random forest RF uses a fully grown sub decision tree (low deviation, high variance). Random forest requires these subtrees to be as irrelevant as possible, so that the variance can be reduced after synthesis, but the deviation can be kept low. When the number of trees is large, it will not be over fitted.
  • Gradient lifting tree GBT adopts weak classifier (high deviation, low variance). Gradient lifting tree synthesizes these weak classifiers, reduces the deviation in each step, but maintains low variance. May over fit.
  • The random forest RF uses a fully grown sub decision tree (low deviation, high variance). Random forest requires these subtrees to be as irrelevant as possible, so that the variance can be reduced after synthesis, but the deviation can be kept low. When the number of trees is large, it will not be over fitted.
  • From the sampling method:
    The lifting tree GBT uses all data and all features to train the learner (xgboost does not).
    Bootstrap sampling is used in the samples of random forest. At the same time, it will also sample the features, and multiple data sets will be obtained. Each data set will be trained to get a learner.
  • The lifting tree GBT uses all data and all features to train the learner (xgboost does not).
  • Bootstrap sampling is used in the samples of random forest. At the same time, it will also sample the features, and multiple data sets will be obtained. Each data set will be trained to get a learner.
  • If you choose between gradient lifting tree and random forest, it is almost always recommended to choose gradient lifting tree.
    Advantages of random forest: it naturally supports parallel computing, because each subtree is calculated independently.
    Advantages of gradient lifting tree:
    Gradient lifting tree uses fewer subtrees to obtain better accuracy.
    Because in each iteration, the gradient lifting tree will fully accept the existing tree (voting right is 1). However, every tree in the random forest is equally important (no matter how well they perform), and their voting rights are \ (\ frac{1} {n} \), so they are not fully accepted.
    The gradient lifting tree can also be modified to achieve parallelization.
    Gradient lifting tree has an explicit mathematical model. Therefore, any task that can write gradient can apply gradient lifting tree (such as ranking task). However, there is no clear mathematical model for random forest.
  • Advantages of random forest: it naturally supports parallel computing, because each subtree is calculated independently.
  • Advantages of gradient lifting tree:
    Gradient lifting tree uses fewer subtrees to obtain better accuracy.
    Because in each iteration, the gradient lifting tree will fully accept the existing tree (voting right is 1). However, every tree in the random forest is equally important (no matter how well they perform), and their voting rights are \ (\ frac{1} {n} \), so they are not fully accepted.
    The gradient lifting tree can also be modified to achieve parallelization.
    Gradient lifting tree has an explicit mathematical model. Therefore, any task that can write gradient can apply gradient lifting tree (such as ranking task). However, there is no clear mathematical model for random forest.
  • Gradient lifting tree uses fewer subtrees to obtain better accuracy.
    Because in each iteration, the gradient lifting tree will fully accept the existing tree (voting right is 1). However, every tree in the random forest is equally important (no matter how well they perform), and their voting rights are \ (\ frac{1} {n} \), so they are not fully accepted.
  • The gradient lifting tree can also be modified to achieve parallelization.
  • Gradient lifting tree has an explicit mathematical model. Therefore, any task that can write gradient can apply gradient lifting tree (such as ranking task). However, there is no clear mathematical model for random forest.