網(wǎng)站首頁 編程語言 正文
在實際生活中,經(jīng)常會有文件重復(fù)的困擾,即同一個文件可能既在A目錄中,又在B目錄中,更可惡的是,即便是同一個文件,文件名可能還不一樣。在文件較少的情況下,該類情況還比較容易處理,最不濟就是one by one的人工比較——即便如此,也很難保證你的眼神足夠犀利。倘若文件很多,這豈不是個impossible mission?最近在看《Python UNIX和Linux系統(tǒng)管理指南》,里面就有有關(guān)“數(shù)據(jù)比較”的內(nèi)容,在其基礎(chǔ)上,結(jié)合實際整理如下。
該腳本主要包括以下模塊:diskwalk,chechsum,find_dupes,delete。其中diskwalk模塊是遍歷文件的,給定路徑,遍歷輸出該路徑下的所有文件。chechsum模塊是求文件的md5值。find_dupes導(dǎo)入了diskwalk和chechsum模塊,根據(jù)md5的值來判斷文件是否相同。delete是刪除模塊。具體如下:
1. diskwalk.py
import os,sys class diskwalk(object): def __init__(self,path): self.path = path def paths(self): path=self.path path_collection=[] for dirpath,dirnames,filenames in os.walk(path): for file in filenames: fullpath=os.path.join(dirpath,file) path_collection.append(fullpath) return path_collection if __name__ == '__main__': for file in diskwalk(sys.argv[1]).paths(): print file
2.chechsum.py
import hashlib,sys def create_checksum(path): fp = open(path) checksum = hashlib.md5() while True: buffer = fp.read(8192) if not buffer:break checksum.update(buffer) fp.close() checksum = checksum.digest() return checksum if __name__ == '__main__': create_checksum(sys.argv[1])
3. find_dupes.py
from checksum import create_checksum from diskwalk import diskwalk from os.path import getsize import sys def findDupes(path): record = {} dup = {} d = diskwalk(path) files = d.paths() for file in files: compound_key = (getsize(file),create_checksum(file)) if compound_key in record: dup[file] = record[compound_key] else: record[compound_key]=file return dup if __name__ == '__main__': for file in findDupes(sys.argv[1]).items(): print "The duplicate file is %s" % file[0] print "The original file is %s\n" % file[1]
findDupes函數(shù)返回了字典dup,該字典的鍵是重復(fù)的文件,值是原文件。這樣就解答了很多人的疑惑,畢竟,你怎么確保你輸出的是重復(fù)的文件呢?
4. delete.py
import os,sys class deletefile(object): def __init__(self,file): self.file=file def delete(self): print "Deleting %s" % self.file os.remove(self.file) def dryrun(self): print "Dry Run: %s [NOT DELETED]" % self.file def interactive(self): answer=raw_input("Do you really want to delete: %s [Y/N]" % self.file) if answer.upper() == 'Y': os.remove(self.file) else: print "Skiping: %s" % self.file return if __name__ == '__main__': from find_dupes import findDupes dup=findDupes(sys.argv[1]) for file in dup.iterkeys(): delete=deletefile(file) #delete.dryrun() delete.interactive() #delete.delete()
deletefile類構(gòu)造了3個函數(shù),實現(xiàn)的都是文件刪除功能、其中delete函數(shù)是直接刪除文件,dryrun函數(shù)是試運行,文件并沒有刪除,interactive函數(shù)是交互模式,讓用戶來確定是否刪除。這充分了考慮了客戶的需求。
總結(jié):這四個模塊已封裝好,均可單獨使用實現(xiàn)各自的功能。組合起來就可批量刪除重復(fù)文件,只需輸入一個路徑。
最后,貼個完整版本的,兼容Python 2.0, 3.0。
#!/usr/bin/python # -*- coding: UTF-8 -*- from __future__ import print_function import os, sys, hashlib class diskwalk(object): def __init__(self, path): self.path = path def paths(self): path = self.path files_in_path = [] for dirpath, dirnames, filenames in os.walk(path): for each_file in filenames: fullpath = os.path.join(dirpath, each_file) files_in_path.append(fullpath) return files_in_path def create_checksum(path): fp = open(path,'rb') checksum = hashlib.md5() while True: buffer = fp.read(8192) if not buffer: break checksum.update(buffer) fp.close() checksum = checksum.digest() return checksum def findDupes(path): record = {} dup = {} d = diskwalk(path) files = d.paths() for each_file in files: compound_key = (os.path.getsize(each_file), create_checksum(each_file)) if compound_key in record: dup[each_file] = record[compound_key] else: record[compound_key] = each_file return dup class deletefile(object): def __init__(self, file_name): self.file_name = file_name def delete(self): print("Deleting %s" % self.file_name) os.remove(self.file_name) def dryrun(self): print("Dry Run: %s [NOT DELETED]" % self.file_name) def interactive(self): try: answer = raw_input("Do you really want to delete: %s [Y/N]" % self.file_name) except NameError: answer = input("Do you really want to delete: %s [Y/N]" % self.file_name) if answer.upper() == 'Y': os.remove(self.file_name) else: print("Skiping: %s" % self.file_name) return def main(): directory_to_check = sys.argv[1] duplicate_file = findDupes(directory_to_check) for each_file in duplicate_file: delete = deletefile(each_file) delete.interactive() if __name__ == '__main__': main()
其中,第一個參數(shù)是待檢測的目錄。
原文鏈接:https://www.cnblogs.com/ivictor/p/4377609.html
相關(guān)推薦
- 2022-10-14 matlab非線性最小二乘擬合
- 2022-09-05 SparkStreaming寫入Hive慢
- 2022-06-02 C++零基礎(chǔ)精通數(shù)據(jù)結(jié)構(gòu)之帶頭雙向循環(huán)鏈表_C 語言
- 2022-05-01 C#中的三種定時計時器Timer用法介紹_C#教程
- 2024-03-05 layui彈出層的表單驗證(form表單自帶的驗證不執(zhí)行)
- 2023-05-18 Go語言中map集合的具體使用_Golang
- 2023-07-26 TypeScript中的模塊與命名空間
- 2022-09-06 一文帶你了解Python中的雙下方法_python
- 最近更新
-
- window11 系統(tǒng)安裝 yarn
- 超詳細(xì)win安裝深度學(xué)習(xí)環(huán)境2025年最新版(
- Linux 中運行的top命令 怎么退出?
- MySQL 中decimal 的用法? 存儲小
- get 、set 、toString 方法的使
- @Resource和 @Autowired注解
- Java基礎(chǔ)操作-- 運算符,流程控制 Flo
- 1. Int 和Integer 的區(qū)別,Jav
- spring @retryable不生效的一種
- Spring Security之認(rèn)證信息的處理
- Spring Security之認(rèn)證過濾器
- Spring Security概述快速入門
- Spring Security之配置體系
- 【SpringBoot】SpringCache
- Spring Security之基于方法配置權(quán)
- redisson分布式鎖中waittime的設(shè)
- maven:解決release錯誤:Artif
- restTemplate使用總結(jié)
- Spring Security之安全異常處理
- MybatisPlus優(yōu)雅實現(xiàn)加密?
- Spring ioc容器與Bean的生命周期。
- 【探索SpringCloud】服務(wù)發(fā)現(xiàn)-Nac
- Spring Security之基于HttpR
- Redis 底層數(shù)據(jù)結(jié)構(gòu)-簡單動態(tài)字符串(SD
- arthas操作spring被代理目標(biāo)對象命令
- Spring中的單例模式應(yīng)用詳解
- 聊聊消息隊列,發(fā)送消息的4種方式
- bootspring第三方資源配置管理
- GIT同步修改后的遠(yuǎn)程分支