網站首頁 編程語言 正文
最近在抓取http://skell.sketchengine.eu網頁時,發現用requests無法獲得網頁的全部內容,所以我就用selenium先模擬瀏覽器打開網頁,再獲取網頁的源代碼,通過BeautifulSoup解析后拿到網頁中的例句,為了能讓循環持續進行,我們在循環體中加了refresh(),這樣當瀏覽器得到新網址時通過刷新再更新網頁內容,注意為了更好地獲取網頁內容,設定刷新后停留2秒,這樣可以降低抓不到網頁內容的機率。為了減少被封的可能,我們還加入了Chrome,請看以下代碼:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import time,re
path = Service("D:\\MyDrivers\\chromedriver.exe")#
# 配置不顯示瀏覽器
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('User-Agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36')
# 創建Chrome實例 。
driver = webdriver.Chrome(service=path,options=chrome_options)
lst=["happy","help","evening","great","think","adapt"]
for word in lst:
url="https://skell.sketchengine.eu/#result?lang=en&query="+word+"&f=concordance"
driver.get(url)
# 刷新網頁獲取新數據
driver.refresh()
time.sleep(2)
# page_source——》獲得頁面源碼
resp=driver.page_source
# 解析源碼
soup=BeautifulSoup(resp,"html.parser")
table = soup.find_all("td")
with open("eps.txt",'a+',encoding='utf-8') as f:
f.write(f"\n{word}的例子\n")
for i in table[0:6]:
text=i.text
#替換多余的空格
new=re.sub("\s+"," ",text)
#寫入txt文本
with open("eps.txt",'a+',encoding='utf-8') as f:
f.write(re.sub(r"^(\d+\.)",r"\n\1",new))
driver.close()
1. 為了加快訪問速度,我們設置不顯示瀏覽器,通過chrome.options實現
2. 最近通過re正則表達式來清理格式。
3. 我們設置table[0:6]來獲取前三個句子的內容,最后顯示結果如下。
happy的例子
1. This happy mood lasted roughly until last autumn.?
2. The lodging was neither convenient nor happy .?
3. One big happy family "fighting communism".?
help的例子
1. Applying hot moist towels may help relieve discomfort.?
2. The intense light helps reproduce colors more effectively.?
3. My survival route are self help books.?
evening的例子
1. The evening feast costs another $10.?
2. My evening hunt was pretty flat overall.?
3. The area nightclubs were active during evenings .?
great的例子
1. The three countries represented here are three great democracies.?
2. Our three different tour guides were great .?
3. Your receptionist "crew" is great !?
think的例子
1. I said yes immediately without thinking everything through.?
2. This book was shocking yet thought provoking.?
3. He thought "disgusting" was more appropriate.?
adapt的例子
1. The novel has been adapted several times.?
2. There are many ways plants can adapt .?
3. They must adapt quickly to changing deadlines.?
補充:經過代碼的優化以后,例句的爬取更加快捷,代碼如下:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import time,re
import os
# 配置模擬瀏覽器的位置
path = Service("D:\\MyDrivers\\chromedriver.exe")#
# 配置不顯示瀏覽器
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('User-Agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36')
# 創建Chrome實例 。
def get_wordlist():
wordlist=[]
with open("wordlist.txt",'r',encoding='utf-8') as f:
lines=f.readlines()
for line in lines:
word=line.strip()
wordlist.append(word)
return wordlist
def main(lst):
driver = webdriver.Chrome(service=path,options=chrome_options)
for word in lst:
url="https://skell.sketchengine.eu/#result?lang=en&query="+word+"&f=concordance"
driver.get(url)
driver.refresh()
time.sleep(2)
# page_source——》頁面源碼
resp=driver.page_source
# 解析源碼
soup=BeautifulSoup(resp,"html.parser")
table = soup.find_all("td")
with open("examples.txt",'a+',encoding='utf-8') as f:
f.writelines(f"\n{word}的例子\n")
for i in table[0:6]:
text=i.text
new=re.sub("\s+"," ",text)
with open("eps.txt",'a+',encoding='utf-8') as f:
f.write(new)
# f.writelines(re.sub("(\.\s)(\d+\.)","\1\n\2",new))
if __name__=="__main__":
lst=get_wordlist()
main(lst)
os.startfile("examples.txt")
總結
原文鏈接:https://blog.csdn.net/henanlion/article/details/122757040
相關推薦
- 2023-05-15 golang中的時間格式化_Golang
- 2022-07-11 UVM中UVM_ERROR到達一定數量后結束
- 2022-10-10 pycharm創建并使用虛擬環境的詳細圖文教程_python
- 2023-03-03 AJAX亂碼與異步同步以及封裝jQuery庫實現步驟詳解_AJAX相關
- 2022-11-01 Flask路由尾部有沒有斜杠有什么區別_python
- 2022-09-28 Python使用captcha制作驗證碼的實現示例_python
- 2022-12-12 C語言中帶頭雙向循環鏈表基本操作的實現詳解_C 語言
- 2022-03-24 Sublime?Text3安裝Go語言相關插件gosublime時搜不到gosublime的解決方法
- 最近更新
-
- window11 系統安裝 yarn
- 超詳細win安裝深度學習環境2025年最新版(
- Linux 中運行的top命令 怎么退出?
- MySQL 中decimal 的用法? 存儲小
- get 、set 、toString 方法的使
- @Resource和 @Autowired注解
- Java基礎操作-- 運算符,流程控制 Flo
- 1. Int 和Integer 的區別,Jav
- spring @retryable不生效的一種
- Spring Security之認證信息的處理
- Spring Security之認證過濾器
- Spring Security概述快速入門
- Spring Security之配置體系
- 【SpringBoot】SpringCache
- Spring Security之基于方法配置權
- redisson分布式鎖中waittime的設
- maven:解決release錯誤:Artif
- restTemplate使用總結
- Spring Security之安全異常處理
- MybatisPlus優雅實現加密?
- Spring ioc容器與Bean的生命周期。
- 【探索SpringCloud】服務發現-Nac
- Spring Security之基于HttpR
- Redis 底層數據結構-簡單動態字符串(SD
- arthas操作spring被代理目標對象命令
- Spring中的單例模式應用詳解
- 聊聊消息隊列,發送消息的4種方式
- bootspring第三方資源配置管理
- GIT同步修改后的遠程分支