日本免费高清视频-国产福利视频导航-黄色在线播放国产-天天操天天操天天操天天操|www.shdianci.com

學(xué)無先后,達(dá)者為師

網(wǎng)站首頁 編程語言 正文

python?sklearn與pandas實現(xiàn)缺失值數(shù)據(jù)預(yù)處理流程詳解_python

作者:talle2021 ? 更新時間: 2022-11-16 編程語言

注:代碼用 jupyter notebook跑的,分割線線上為代碼,分割線下為運行結(jié)果

1.導(dǎo)入庫生成缺失值

通過pandas生成一個6行4列的矩陣,列名分別為'col1','col2','col3','col4',同時增加兩個缺失值數(shù)據(jù)。

import  numpy as np 
import pandas as pd
from sklearn.impute import SimpleImputer
#生成缺失數(shù)據(jù)
df=pd.DataFrame(np.random.randn(6,4),columns=['col1','col2','col3','col4']) #生成一份數(shù)據(jù)
#增加缺失值
df.iloc[1:2,1]=np.nan
df.iloc[4,3]=np.nan
df

? ? ? ? col1?? ? ? ?col2?? ? ? ?col3?? ? ? ?col4
0?? ?-0.480144?? ?1.463995?? ?0.454819?? ?-1.531419
1?? ?-0.418552?? ? ? NaN?? ? ? ?-0.931259?? ?-0.534846
2?? ?-0.028083?? ?-0.420394?? ?0.925346?? ?0.975792
3?? ?-0.144064?? ?-0.811569?? ?-0.013452?? ?0.110480
4?? ?-0.966490?? ?-0.822555?? ?0.228038?? ?NaN
5?? ?-0.017370?? ?-0.538245?? ?-2.083904?? ?0.230733

2.查看哪些值缺失(第2行第2列,第5行第4列)

nan_all=df.isnull() #獲得所有數(shù)據(jù)中的nan
nan_all

?? ?col1?? ?col2?? ?col3?? ?col4
0?? ?False?? ?False?? ?False?? ?False
1?? ?False?? ?True?? ?False?? ?False
2?? ?False?? ?False?? ?False?? ?False
3?? ?False?? ?False?? ?False?? ?False
4?? ?False?? ?False?? ?False?? ?True
5?? ?False?? ?False?? ?False?? ?False

3 any()方法來查找含有至少1個缺失值的列,all()方法來查找全部缺失值的列

#使用any方法
nan_col1=df.isnull().any() #獲得含有nan的列
print(nan_col1)

col1 ? ?False
col2 ? ? True
col3 ? ?False
col4 ? ? True
dtype: bool

#使用all方法
nan_col2=df.isnull().all() #獲得全部為nan的列
print(nan_col2)

col1 ???False
col2 ???False
col3 ???False
col4 ???False
dtype: bool

4.法一:直接丟棄缺失值

df1=df.dropna()#直接丟棄含有nan的行記錄
df1

col1?? ?col2?? ?col3?? ?col4
0?? ?-0.480144?? ?1.463995?? ?0.454819?? ?-1.531419
2?? ?-0.028083?? ?-0.420394?? ?0.925346?? ?0.975792
3?? ?-0.144064?? ?-0.811569?? ?-0.013452?? ?0.110480
5?? ?-0.017370?? ?-0.538245?? ?-2.083904?? ?0.230733

5.法二:使用sklearn將缺失值替換為特定值

首先通過SimpleImputer創(chuàng)建一個預(yù)處理對象,缺失值替換方法默認(rèn)用均值替換,及strategy=mean,還可以使用中位數(shù)median,眾數(shù)most_frequent進行替換,接著使用預(yù)處理對象的fit_transform對df進行處理,代碼如下:

#使用sklearn將缺失值替換為特定值
nan_mean=SimpleImputer(strategy='mean') #用均值填補
nan_median=SimpleImputer(strategy='median') #用中位數(shù)填補
nan_0=SimpleImputer(strategy='constant',fill_value=0) #用0填補
#應(yīng)用模型
nan_mean_result=nan_mean.fit_transform(df)
nan_median_result=nan_median.fit_transform(df)
nan_0_result=nan_0.fit_transform(df)
print(nan_mean_result)
print(nan_median_result)
print(nan_0_result)

?[-0.48014389 ?1.46399462 ?0.45481856 -1.53141863]
?[-0.4185523 ?-0.22575384 -0.93125874 -0.53484561]
?[-0.02808329 -0.42039426 ?0.925346 ? ?0.97579191]
?[-0.14406438 -0.81156913 -0.0134516 ? 0.11048025]
?[-0.96649028 -0.82255505 ?0.22803842 -0.14985173]
?[-0.01737047 -0.53824538 -2.0839036 ? 0.23073341]
?
?[-0.48014389 ?1.46399462 ?0.45481856 -1.53141863]
?[-0.4185523 ?-0.53824538 -0.93125874 -0.53484561]
?[-0.02808329 -0.42039426 ?0.925346 ? ?0.97579191]
?[-0.14406438 -0.81156913 -0.0134516 ? 0.11048025]
?[-0.96649028 -0.82255505 ?0.22803842 ?0.11048025]
?[-0.01737047 -0.53824538 -2.0839036 ? 0.23073341]
?
?[-0.48014389 ?1.46399462 ?0.45481856 -1.53141863]
?[-0.4185523 ? 0. ? ? ? ? -0.93125874 -0.53484561]
?[-0.02808329 -0.42039426 ?0.925346 ? ?0.97579191]
?[-0.14406438 -0.81156913 -0.0134516 ? 0.11048025]
?[-0.96649028 -0.82255505 ?0.22803842 ?0. ? ? ? ?]
?[-0.01737047 -0.53824538 -2.0839036 ? 0.23073341]

6.法三:使用pandas將缺失值替換為特定值

pandas對缺失值處理方法是df.fillna(),該方法的兩個主要參數(shù)是value和method。前者通過固定或手動指定的值替換缺失值,后者使用pandas提供的方法替換缺失值。以下是method支持的方法:

(1)pad和ffill:使用前面的值替換缺失值

(2)backfill和bfill:使用后面的值替換缺失值

(3)大多數(shù)情況下用均值、眾數(shù)、中位數(shù)的方法較為常用

#使用pandas將缺失值替換為特定值
nan_result_pd1=df.fillna(method='backfill')
nan_result_pd2=df.fillna(method='bfill',limit=1)#用后面的值替換缺失值,限制每列只能替換一個缺失值
nan_result_pd3=df.fillna(method='pad')
nan_result_pd4=df.fillna(0)
nan_result_pd5=df.fillna({'col2':1.1,'col4':1.2}) #手動指定兩個缺失值分別為1.1,1.2
nan_result_pd6=df.fillna(df.mean()['col2':'col4'])
nan_result_pd7=df.fillna(df.median()['col2':'col4'])
print(nan_result_pd1)
print(nan_result_pd2)
print(nan_result_pd3)
print(nan_result_pd4)
print(nan_result_pd5)
print(nan_result_pd6)
print(nan_result_pd7)

? col1 ? ? ?col2 ? ? ?col3 ? ? ?col4
0 -0.480144 ?1.463995 ?0.454819 -1.531419
1 -0.418552 -0.420394 -0.931259 -0.534846
2 -0.028083 -0.420394 ?0.925346 ?0.975792
3 -0.144064 -0.811569 -0.013452 ?0.110480
4 -0.966490 -0.822555 ?0.228038 ?0.230733
5 -0.017370 -0.538245 -2.083904 ?0.230733
? ? ? ?col1 ? ? ?col2 ? ? ?col3 ? ? ?col4
0 -0.480144 ?1.463995 ?0.454819 -1.531419
1 -0.418552 -0.420394 -0.931259 -0.534846
2 -0.028083 -0.420394 ?0.925346 ?0.975792
3 -0.144064 -0.811569 -0.013452 ?0.110480
4 -0.966490 -0.822555 ?0.228038 ?0.230733
5 -0.017370 -0.538245 -2.083904 ?0.230733
? ? ? ?col1 ? ? ?col2 ? ? ?col3 ? ? ?col4
0 -0.480144 ?1.463995 ?0.454819 -1.531419
1 -0.418552 ?1.463995 -0.931259 -0.534846
2 -0.028083 -0.420394 ?0.925346 ?0.975792
3 -0.144064 -0.811569 -0.013452 ?0.110480
4 -0.966490 -0.822555 ?0.228038 ?0.110480
5 -0.017370 -0.538245 -2.083904 ?0.230733
? ? ? ?col1 ? ? ?col2 ? ? ?col3 ? ? ?col4
0 -0.480144 ?1.463995 ?0.454819 -1.531419
1 -0.418552 ?0.000000 -0.931259 -0.534846
2 -0.028083 -0.420394 ?0.925346 ?0.975792
3 -0.144064 -0.811569 -0.013452 ?0.110480
4 -0.966490 -0.822555 ?0.228038 ?0.000000
5 -0.017370 -0.538245 -2.083904 ?0.230733
? ? ? ?col1 ? ? ?col2 ? ? ?col3 ? ? ?col4
0 -0.480144 ?1.463995 ?0.454819 -1.531419
1 -0.418552 ?1.100000 -0.931259 -0.534846
2 -0.028083 -0.420394 ?0.925346 ?0.975792
3 -0.144064 -0.811569 -0.013452 ?0.110480
4 -0.966490 -0.822555 ?0.228038 ?1.200000
5 -0.017370 -0.538245 -2.083904 ?0.230733
? ? ? ?col1 ? ? ?col2 ? ? ?col3 ? ? ?col4
0 -0.480144 ?1.463995 ?0.454819 -1.531419
1 -0.418552 -0.225754 -0.931259 -0.534846
2 -0.028083 -0.420394 ?0.925346 ?0.975792
3 -0.144064 -0.811569 -0.013452 ?0.110480
4 -0.966490 -0.822555 ?0.228038 -0.149852
5 -0.017370 -0.538245 -2.083904 ?0.230733
? ? ? ?col1 ? ? ?col2 ? ? ?col3 ? ? ?col4
0 -0.480144 ?1.463995 ?0.454819 -1.531419
1 -0.418552 -0.538245 -0.931259 -0.534846
2 -0.028083 -0.420394 ?0.925346 ?0.975792
3 -0.144064 -0.811569 -0.013452 ?0.110480
4 -0.966490 -0.822555 ?0.228038 ?0.110480
5 -0.017370 -0.538245 -2.083904 ?0.230733

另外,如果是直接替換為特定值,也可以考慮用pandas的replace功能,例如本示例可直接使用df.replace(np.nan,0),這種方法簡單粗暴,但也能達(dá)到效果。當(dāng)然replace的出現(xiàn)是為了解決各種替換用的,缺失值只是其中一種應(yīng)用而已。

原文鏈接:https://blog.csdn.net/weixin_60200880/article/details/126988094

欄目分類
最近更新