kaggle 调研不平衡数据集 top solution的处理方法

2月 24, 2022

kaggle 调研不平衡数据集 top solution的处理方法

1、Porto Seguro's Safe Driver Prediction

Class 0: 573518

Class 1: 21694

3.6%

top1到3都未用resampling技术

top1说尝试了upsampling，但是not work

2、TalkingData AdTracking Fraud Detection Challenge

1.8亿次点击

0.19% target rate

rank1采用了downsampling，没说具体的downsampling方法，但是从字里行间分析应该就是简单的随机downsampling，而且会随机downsampling 5次，建不同的模型，然后bagging

rank1采取抽样是因为计算开销，下采样后可以快速建模迭代

rank2的方案

采取了下采样，取了5%的好样本，好像还用了一个较小的scale_pos_weight，但通过tune，差异并不大，不确定有没有试scale_pos_weight=1

Q：Da Lao，when you use sample dataset，how did you setup your lightgbm hyperparameter？what is the value of scale_pos_weight？ How did you deal with sample dataset imbalance？

A：Good question. Intuitively, the scale_pos_weight param is directly proportional to the ratio of negative samples. So if you set scale_pos_weight to 100 when training the entire dataset, you should set it to 100*5%=5 after sampling 5% negative samples. (You can also tune the parameter, it does not make a big difference.)

rank2采取抽样的目的，感觉也有50%是因为计算开销和快速迭代的原因

rank3的方案

完全没有提到resampling，其中的超参也没提有没有加入权重

rank4不确定有没有sampling，rank4方案都很复杂，估计机器都很好可能没用，但不确定。rank4后续去github查了源码，发现scale_pos_weight默认值是32

rank5

提供了他的lightgbm参数，没有设置scale_pos_weight，没有提有没有下采样，感觉没有

rank6

用的全量数据，scale_pos_weight用的400

rank11

明确的全量训练，且没用scale_pos_weight，一样取得了不错的成绩

3、home_credit

8%，二分类问题

rank1 未提sample/scale

LB 0.806 thread 从其中1份代码看未用sample/scale

rank2 未提sample/scale

rank3 未提sample/scale

应该是没人用

4、IEEE-CIS Fraud Detection

20663/590540 = 3.5%

支付fraud检测

rank1 未提sample/scale

rank1其中一个队员的帖子里面明确没有用scale_pos_weight，也没有提到sample，下面有人提问是否用到sampled data来加速训练，说并没有

rank1中的carry队员说尝试了upsamling，就是赋予positive 2.5的权重，这个能帮助catboost，但是没能帮助xgb和lgb，他最后并没用catboost。

有一篇高票的帖子是down sampling，他说down-sampling可以快速发现找到好特征，验证好的假设，但是最终训练还是用全量的数据集，down sampling后，下降并不太多（损失了一些信息？）

rank21的帖子中未提及，他的代码中没有看到scale等参数

rank2 未提及

rank9 未提及，作者提到他选择特征，采取的类似forward的方式，但是很耗时，可以采用downsampling进行加速

看了4个很不平衡的比赛，评价指标都是类似AUC这类排序性指标，top方案少有用不平衡处理技术，只有talking data中用label weight，但也有没用的高rank方案，而且talking data用下采样的很大一个目的是为了加速，上采样的大多采用scale_pos_weight来实现。所以我认为更重要的是postive的绝对数量是否充足，而非positive rate是否过小，positive数量足够，无需采取imbalance处理技术，而且就算用准确率，也可以改变cut off来控制，准确率用cutoff来控制也没用，precision也没用，recall可以，凡是涉及到positive数和negative数交互的的指标都会存在问题，多数类会占主导作用。还是用排序性指标吧，这些比赛都是用的排序性指标。

搜尋此網誌

Silver Death

kaggle 调研不平衡数据集 top solution的处理方法

留言

張貼留言

熱門文章

把cell从一个notebook复制到另一个notebook

python调用win32api设置窗口位置和大小

kaggle 调研 不平衡数据集 top solution的处理方法

留言

張貼留言

熱門文章

把cell从一个notebook复制到另一个notebook

python调用win32api设置窗口位置和大小

kaggle 调研不平衡数据集 top solution的处理方法