one hot encoding转换的新特征数是否要unique count 减1

1月 20, 2022

one hot encoding转换的新特征数是否要unique count 减1

unique count为4的feature，使用one hot encoding后，是变成3个features，还是变成4个features？

sklearn.preprocessing.OneHotEncoder默认是生成4个features，通过参数drop控制是否丢弃1个，也就是变成3个features。

到底是变成4个features还是3个features要根据应用来。

如果是4个features的话，任意3个features是可以表征第4个feature，4个features求和必定为1，就类似于自由度为3，表达式形如x4 = 1-x1-x2-x3，存在共线性，所以如果是不含正则项的广义线性模型，把4个变量放进去，就会出问题。

但如果进行embedding，这个时候又需要4个features，应用中一般是用kears的embedding层

input_dim就是原始的unique_count，所以是4个features。

再比如xgboost等算法，4个features也没有关系。

去掉1个categorical level的坏处sklearn的解释，我并没看懂，

【However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.】

【但是，删除一个类别会破坏原始表示的对称性，因此可能会导致下游模型出现偏差，例如带正则的线性分类或回归模型。】

总的说来，一般one hot encoding转换成unique count数的新特征便可。

搜尋此網誌

Silver Death

one hot encoding转换的新特征数是否要unique count 减1

留言

張貼留言

熱門文章

把cell从一个notebook复制到另一个notebook

python调用win32api设置窗口位置和大小