網站首頁編程語言正文

Python?pandas找出、刪除重復的數據實例_python

作者：william_cheng666 ? 更新時間： 2022-09-03 編程語言

前言

當我們使用pandas處理數據的時候，經常會遇到數據重復的問題，如何找出重復數據進而分析重復原因，或者如何直接刪除重復的數據是一個關鍵的步驟，pandas提供了很方便的方法：duplicated()和drop_duplicates()。

一、duplicated()

duplicated()可以被用在DataFrame的三種情況下，分別是pandas.DataFrame.duplicated、pandas.Series.duplicated和pandas.Index.duplicated。他們的用法都類似，前兩個會返回一個布爾值的Series，最后一個會返回一個布爾值的numpy.ndarray。

DataFrame.duplicated(subset=None, keep=‘first’)

subset：默認為None，需要標記重復的標簽或標簽序列

keep：默認為‘first’，如何標記重復標簽

first：將除第一次出現以外的重復數據標記為True
last：將除最后一次出現以外的重復數據標記為True
False：將所有重復的項都標記為True（不管是不是第一次出現）

Series.duplicated(keep=‘first’)

keep：與DataFrame.duplicated的keep相同

Index.duplicated(keep=‘first’)

keep：與DataFrame.duplicated的keep相同

例子：

import pandas as pd
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})
df

? ? brand style ?rating
0 ?Yum Yum ? cup ? ? 4.0
1 ?Yum Yum ? cup ? ? 4.0
2 ?Indomie ? cup ? ? 3.5
3 ?Indomie ?pack ? ?15.0
4 ?Indomie ?pack ? ? 5.0?

df.duplicated()

0 ? ?False
1 ? ? True
2 ? ?False
3 ? ?False
4 ? ?False
dtype: bool

df.duplicated(keep='last')

0 ? ? True
1 ? ?False
2 ? ?False
3 ? ?False
4 ? ?False
dtype: bool

df.duplicated(keep=False)

0 ? ? True
1 ? ? True
2 ? ?False
3 ? ?False
4 ? ?False
dtype: bool

df.duplicated(subset=['brand'])

0 ? ?False
1 ? ? True
2 ? ?False
3 ? ? True
4 ? ? True
dtype: bool

關于Index的重復標記：

df = df.set_index('brand')
df

? ? ? ? style ?rating
brand ? ? ? ? ? ? ? ?
Yum Yum ? cup ? ? 4.0
Yum Yum ? cup ? ? 4.0
Indomie ? cup ? ? 3.5
Indomie ?pack ? ?15.0
Indomie ?pack ? ? 5.0

df.index.duplicated()

array([False,  True, False,  True,  True])

二、drop_duplicates()

與duplicated()類似，drop_duplicates()是直接把重復值給刪掉。下面只會介紹一些含義不同的參數。

DataFrame.drop_duplicates(subset=None, keep=‘first’, inplace=False)

subset：與duplicated()中相同
keep：與duplicated()中相同
inplace：與pandas其他函數的inplace相同，選擇是修改現有數據還是返回新的數據

Series.drop_duplicates()相比Series.duplicated()也是多了一個inplace參數，和上訴介紹一樣，Index.drop_duplicates()與Index.duplicated()參數相同就不做贅述。下面是例子：

df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})
df

? ? ?brand style ?rating
0 ?Yum Yum ? cup ? ? 4.0
1 ?Yum Yum ? cup ? ? 4.0
2 ?Indomie ? cup ? ? 3.5
3 ?Indomie ?pack ? ?15.0
4 ?Indomie ?pack ? ? 5.0

df.drop_duplicates()

? ? ?brand style ?rating
0 ?Yum Yum ? cup ? ? 4.0
2 ?Indomie ? cup ? ? 3.5
3 ?Indomie ?pack ? ?15.0
4 ?Indomie ?pack ? ? 5.0

df.drop_duplicates(inplace = True)

df

? ? ?brand style ?rating
0 ?Yum Yum ? cup ? ? 4.0
2 ?Indomie ? cup ? ? 3.5
3 ?Indomie ?pack ? ?15.0
4 ?Indomie ?pack ? ? 5.0

總結

有剩余無，pandas有很多好用的庫，但是系統學下來很不現實，都是在實際項目中不斷的發現、積累、記錄下來。

原文鏈接：https://blog.csdn.net/weixin_43887421/article/details/114926685

上一篇：解決vmware上Ubuntu共享文件夾的問題_VMware
下一篇：解決Python3錯誤:SyntaxError:?unexp

日本免费高清视频-国产福利视频导航-黄色在线播放国产-天天操天天操天天操天天操|www.shdianci.com

網站首頁編程語言正文

Python?pandas找出、刪除重復的數據實例_python

目錄

前言

一、duplicated()

二、drop_duplicates()

總結

相關推薦

日本免费高清视频-国产福利视频导航-黄色在线播放国产-天天操天天操天天操天天操|www.shdianci.com

網站首頁 編程語言 正文

Python?pandas找出、刪除重復的數據實例_python

目錄

前言

一、duplicated()

二、drop_duplicates()

總結

相關推薦

網站首頁編程語言正文

一、duplicated()

二、drop_duplicates()