網(wǎng)站首頁(yè) 編程語(yǔ)言正文

Python?sklearn庫(kù)三種常用編碼格式實(shí)例_python

作者：LoveAndProgram ? 更新時(shí)間： 2022-11-05 編程語(yǔ)言

OneHotEncoder獨(dú)熱編碼實(shí)例

class sklearn.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')

目的：將分類(lèi)要素編碼為one-hot數(shù)字?jǐn)?shù)組
輸入：為整數(shù)或字符串之類(lèi)的數(shù)組，表示分類(lèi)（離散）特征所采用的值。
這將為每個(gè)類(lèi)別創(chuàng)建一個(gè)二進(jìn)制列，并返回一個(gè)稀疏矩陣或密集數(shù)組（取決于稀疏參數(shù)）默認(rèn)情況下，編碼器會(huì)根據(jù)每個(gè)功能中的唯一值得出類(lèi)別（可改為手動(dòng)）
適用于GBDT、XGBoost、Lgb模型中效果都不錯(cuò) 注意：在最新版本的sklearn中，所有的數(shù)據(jù)都應(yīng)該是二維矩陣，所以當(dāng)它只是單獨(dú)一行或一列需要進(jìn)行reshape(1, -1)數(shù)據(jù)轉(zhuǎn)換，否則會(huì)報(bào)錯(cuò)ValueError: Expected 2D array, got 1D array instead

以下面數(shù)據(jù)為例（數(shù)據(jù)源）：

from sklearn.preprocessing import OneHotEncoder
import pandas as pd
train = pd.read_csv('./train.csv')
enc = OneHotEncoder(handle_unknown='ignore')
numerical_feature = ['policy_annual_premium','insured_education_level','capital-gains','incident_type','incident_severity',\
                   'property_damage','bodily_injuries','police_report_available','total_claim_amount','injury_claim','property_claim','vehicle_claim']
data = train[numerical_feature]
c = enc.fit_transform(data.values.reshape(1,-1))
c.toarray()#查看轉(zhuǎn)化后的數(shù)據(jù)

輸入數(shù)據(jù)由處理后的這種格式：

經(jīng)過(guò)編碼后得出編碼后的數(shù)據(jù)（數(shù)據(jù)量過(guò)大用元組的形式展現(xiàn)），全部由二進(jìn)制數(shù)0、1表示：

注意：在一對(duì)多的情況下y標(biāo)簽需要使用 sklearn.preprocessing.LabelBinarizer() 函數(shù)將多類(lèi)標(biāo)簽轉(zhuǎn)換為二進(jìn)制標(biāo)簽

LabelEncoder標(biāo)簽編碼實(shí)例

目的：對(duì)目標(biāo)標(biāo)簽進(jìn)行編碼，其值介于0和n_classes-1之間
輸入可以是數(shù)字標(biāo)簽，也可以是非數(shù)字標(biāo)簽，這里需要注意的是返回的類(lèi)型是NumPy的array形式，上述OneHotEncoder ()返回的是系數(shù)矩陣形式。

from sklearn.preprocessing import LabelEncoder
Enc=LabelEncoder()
def yuchuli(data):
    numerical_feature = ['policy_annual_premium','insured_education_level','capital-gains','incident_type','incident_severity',\
                       'property_damage','bodily_injuries','police_report_available','total_claim_amount','injury_claim','property_claim','vehicle_claim','auto_year']
data=pd.DataFrame()
for fea in numerical_feature:
    data.insert(len(data.columns),fea,Enc.fit_transform(train[fea].values))
    return data
train_data = yuchuli(train)

經(jīng)過(guò)編碼后得出編碼后的數(shù)據(jù)：

其中最清晰的就是標(biāo)黑的property_damage一列，使用One-hot編碼轉(zhuǎn)換后變成？屬于0，Yes屬于2，No屬于1。

LabelEncoder()只有一個(gè)class_屬性，是查看每個(gè)類(lèi)別的標(biāo)簽，在上述基礎(chǔ)上嘗試即最后一個(gè)特征所對(duì)應(yīng)的屬性標(biāo)簽，通俗來(lái)講就是這里面需要被編碼的個(gè)數(shù)就是這些數(shù)：

果然不出所料，因?yàn)檫@是循環(huán)，所以對(duì)應(yīng)的最后一個(gè)是auto_year,原數(shù)據(jù)如下圖：

注意：開(kāi)頭提到的編碼值介于 0 和 n_classes-1 之間于下圖可以清晰理解，里面有n種不同的值，就分成 n-1 類(lèi)，因?yàn)檫€包括 0

不過(guò) LabelEncoder 標(biāo)簽編碼我想對(duì)用的比較少，一般我都是使用 One-hot 獨(dú)熱編碼去處理離散特征。

OrdinalEncoder特征編碼實(shí)例

目的：將分類(lèi)特征編碼為整數(shù)數(shù)組。
輸入：是一個(gè)類(lèi)似數(shù)組的整數(shù)或字符串，表示分類(lèi)（離散）特征所采用的值，特征會(huì)被轉(zhuǎn)換為序數(shù)整數(shù)

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np
train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')
train.drop_duplicates()
Enc=LabelEncoder()
Enc=OneHotEncoder()
def yuchuli(data_train):
    numerical_feature = ['incident_severity', 'insured_hobbies', 'vehicle_claim', 'auto_model', 'insured_education_level', 'insured_zip', 'insured_relationship', 'incident_date','auto_year']
    data = pd.DataFrame()
    for fea in numerical_feature:
        data.insert(len(data.columns), fea, (Enc.fit_transform(train[fea].values.reshape(-1, 1))).tolist())
#     return data
train_data = yuchuli(train)

但是我通過(guò)輸出每一個(gè)特征結(jié)果的時(shí)候發(fā)現(xiàn)他和LabelEncoder()編碼出的數(shù)據(jù)大差不離，特征編碼則通過(guò)categories_查看編碼特征

總而言之就是結(jié)果數(shù)據(jù)是一樣的，但是類(lèi)型上是不同的，我通過(guò)本文了解到它們本質(zhì)的區(qū)別：

OrdinalEncoder 用于形狀為 2D 的數(shù)據(jù) (n_samples, n_features)
LabelEncoder用于形狀為 1D 的數(shù)據(jù)(n_samples,)

至于為什么，我們從上面兩者的代碼中就可以發(fā)現(xiàn)，OrdinalEncoder 編碼出的數(shù)據(jù)要想fit_transform擬合，就得使用.reshape(-1, 1)轉(zhuǎn)換成二維數(shù)據(jù)，這一塊和OneHotEncoder編碼相同，而LabelEncoder則直接放入即可擬合出數(shù)據(jù)來(lái)，這里也是使用過(guò)程中最容易出現(xiàn)的問(wèn)題。

OrdinalEncoder編碼還是有兩點(diǎn)需要注意的，第一點(diǎn)，他可以接受np.nan缺失值，可根據(jù)需求選擇是否處理缺失值；第二點(diǎn)，他有這么一個(gè)參數(shù)->handle_unknown=error(默認(rèn)) ,通過(guò)判斷是否存在未知的特征來(lái)選擇是否繼續(xù)進(jìn)行程序，當(dāng)我們們選擇handle_unknown=use_encoded_value時(shí)會(huì)將存在的未知特征打上unknown_value標(biāo)簽

#將缺失值全部處理為-1
Enc.set_params(encoded_missing_value=-1,handle_unknown=use_encoded_value).fit_transform()

原文鏈接：https://juejin.cn/post/7142325338553450503

上一篇：python中的bisect模塊與二分查找詳情_(kāi)python
下一篇：在jupyter?notebook中使用pytorch的方法

日本免费高清视频-国产福利视频导航-黄色在线播放国产-天天操天天操天天操天天操|www.shdianci.com

網(wǎng)站首頁(yè) 編程語(yǔ)言正文

Python?sklearn庫(kù)三種常用編碼格式實(shí)例_python

目錄

OneHotEncoder獨(dú)熱編碼實(shí)例

LabelEncoder標(biāo)簽編碼實(shí)例

OrdinalEncoder特征編碼實(shí)例

相關(guān)推薦

日本免费高清视频-国产福利视频导航-黄色在线播放国产-天天操天天操天天操天天操|www.shdianci.com

網(wǎng)站首頁(yè) 編程語(yǔ)言 正文

Python?sklearn庫(kù)三種常用編碼格式實(shí)例_python

目錄

OneHotEncoder獨(dú)熱編碼實(shí)例

LabelEncoder標(biāo)簽編碼實(shí)例

OrdinalEncoder特征編碼實(shí)例

相關(guān)推薦

網(wǎng)站首頁(yè) 編程語(yǔ)言正文