網站首頁 編程語言 正文
??lxml中有多種方式可以提取HTML標簽中的內容,這篇博客的重點在于各個方法的不同。
import lxml
from lxml import etree
import collections
doc='''
<html>
<head>
<base />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html' id="xxx">Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
<h5>test</h5>
<a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
<a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
<a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
<a href='image5.html' class='li li-item' name='items'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
<a href='image6.html' name='items'><span><h5>test</h5></span>Name: My image 6 <br /><img src='image6_thumb.jpg' /></a>
<p>hello world hello world <strong> hello world,hello world</strong>你好啊,李銀河</p>
</div>
</body>
</html>
'''
html=etree.HTML(doc)
tree=html.getroottree()
all_nodes=html.xpath('//*')
xpath=[]
for node in all_nodes:
xpath.append(tree.getpath(node))
print('==============node.text方法=====================')
for node,path in zip(all_nodes,xpath):
print('{}: {}'.format(path,node.text))
print('==============node.itertext方法=====================')
for node,path in zip(all_nodes,xpath):
print('{}: {}'.format(path,''.join(node.itertext())))
print('==============xpath方法=====================')
for node,path in zip(all_nodes,xpath):
print('{}: {}'.format(path,''.join(html.xpath(path+'//text()'))))
node.text結果如下:
==============node.text方法=====================
/html:
/html/head:
/html/head/base: None
/html/head/title: Example website
/html/body:
/html/body/div:
/html/body/div/a[1]: Name: My image 1
/html/body/div/a[1]/br: None
/html/body/div/a[1]/img: None
/html/body/div/h5: test
/html/body/div/a[2]: Name: My image 2
/html/body/div/a[2]/br: None
/html/body/div/a[2]/img: None
/html/body/div/a[3]: Name: My image 3
/html/body/div/a[3]/br: None
/html/body/div/a[3]/img: None
/html/body/div/a[4]: Name: My image 4
/html/body/div/a[4]/br: None
/html/body/div/a[4]/img: None
/html/body/div/a[5]: Name: My image 5
/html/body/div/a[5]/br: None
/html/body/div/a[5]/img: None
/html/body/div/a[6]: None
/html/body/div/a[6]/span: None
/html/body/div/a[6]/span/h5: test
/html/body/div/a[6]/br: None
/html/body/div/a[6]/img: None
/html/body/div/p: hello world hello world
/html/body/div/p/strong: hello world,hello world
node.itertext結果如下:
/html:
Example website
Name: My image 1
test
Name: My image 2
Name: My image 3
Name: My image 4
Name: My image 5
testName: My image 6
hello world hello world hello world,hello world你好啊,李銀河
/html/head:
Example website
/html/head/base:
/html/head/title: Example website
/html/body:
Name: My image 1
test
Name: My image 2
Name: My image 3
Name: My image 4
Name: My image 5
testName: My image 6
hello world hello world hello world,hello world你好啊,李銀河
/html/body/div:
Name: My image 1
test
Name: My image 2
Name: My image 3
Name: My image 4
Name: My image 5
testName: My image 6
hello world hello world hello world,hello world你好啊,李銀河
/html/body/div/a[1]: Name: My image 1
/html/body/div/a[1]/br:
/html/body/div/a[1]/img:
/html/body/div/h5: test
/html/body/div/a[2]: Name: My image 2
/html/body/div/a[2]/br:
/html/body/div/a[2]/img:
/html/body/div/a[3]: Name: My image 3
/html/body/div/a[3]/br:
/html/body/div/a[3]/img:
/html/body/div/a[4]: Name: My image 4
/html/body/div/a[4]/br:
/html/body/div/a[4]/img:
/html/body/div/a[5]: Name: My image 5
/html/body/div/a[5]/br:
/html/body/div/a[5]/img:
/html/body/div/a[6]: testName: My image 6
/html/body/div/a[6]/span: test
/html/body/div/a[6]/span/h5: test
/html/body/div/a[6]/br:
/html/body/div/a[6]/img:
/html/body/div/p: hello world hello world hello world,hello world你好啊,李銀河
/html/body/div/p/strong: hello world,hello world
xpath結果如下:
==============xpath方法=====================
/html:
Example website
Name: My image 1
test
Name: My image 2
Name: My image 3
Name: My image 4
Name: My image 5
testName: My image 6
hello world hello world hello world,hello world你好啊,李銀河
/html/head:
Example website
/html/head/base:
/html/head/title: Example website
/html/body:
Name: My image 1
test
Name: My image 2
Name: My image 3
Name: My image 4
Name: My image 5
testName: My image 6
hello world hello world hello world,hello world你好啊,李銀河
/html/body/div:
Name: My image 1
test
Name: My image 2
Name: My image 3
Name: My image 4
Name: My image 5
testName: My image 6
hello world hello world hello world,hello world你好啊,李銀河
/html/body/div/a[1]: Name: My image 1
/html/body/div/a[1]/br:
/html/body/div/a[1]/img:
/html/body/div/h5: test
/html/body/div/a[2]: Name: My image 2
/html/body/div/a[2]/br:
/html/body/div/a[2]/img:
/html/body/div/a[3]: Name: My image 3
/html/body/div/a[3]/br:
/html/body/div/a[3]/img:
/html/body/div/a[4]: Name: My image 4
/html/body/div/a[4]/br:
/html/body/div/a[4]/img:
/html/body/div/a[5]: Name: My image 5
/html/body/div/a[5]/br:
/html/body/div/a[5]/img:
/html/body/div/a[6]: testName: My image 6
/html/body/div/a[6]/span: test
/html/body/div/a[6]/span/h5: test
/html/body/div/a[6]/br:
/html/body/div/a[6]/img:
/html/body/div/p: hello world hello world hello world,hello world你好啊,李銀河
/html/body/div/p/strong: hello world,hello world
總結:
- node.text 在取文本時不會包含該節點的子節點里的內容。
- node.itertext和xpath方法可以將其子節點中的內容都包含進去。并且這兩種方法取得文本內容相同。
原文鏈接:https://blog.csdn.net/yeshang_lady/article/details/122370152
相關推薦
- 2022-05-07 Qt+OpenCV實現目標檢測詳解_C 語言
- 2022-08-28 IntelliJ IDEA 下debugger熱加載(Hot Swap)有時候失效解決
- 2022-08-17 python可視化分析繪制帶趨勢線的散點圖和邊緣直方圖_python
- 2022-07-02 Python列表1~n輸出步長為3的分組實例_python
- 2022-12-15 C++?Boost?Lockfree超詳細講解使用方法_C 語言
- 2023-04-20 elementUI無線滾動+監聽滾動條觸底
- 2023-03-01 shell輸出重定向的實現_linux shell
- 2022-04-04 git: master (pre-receive hook declined)
- 最近更新
-
- window11 系統安裝 yarn
- 超詳細win安裝深度學習環境2025年最新版(
- Linux 中運行的top命令 怎么退出?
- MySQL 中decimal 的用法? 存儲小
- get 、set 、toString 方法的使
- @Resource和 @Autowired注解
- Java基礎操作-- 運算符,流程控制 Flo
- 1. Int 和Integer 的區別,Jav
- spring @retryable不生效的一種
- Spring Security之認證信息的處理
- Spring Security之認證過濾器
- Spring Security概述快速入門
- Spring Security之配置體系
- 【SpringBoot】SpringCache
- Spring Security之基于方法配置權
- redisson分布式鎖中waittime的設
- maven:解決release錯誤:Artif
- restTemplate使用總結
- Spring Security之安全異常處理
- MybatisPlus優雅實現加密?
- Spring ioc容器與Bean的生命周期。
- 【探索SpringCloud】服務發現-Nac
- Spring Security之基于HttpR
- Redis 底層數據結構-簡單動態字符串(SD
- arthas操作spring被代理目標對象命令
- Spring中的單例模式應用詳解
- 聊聊消息隊列,發送消息的4種方式
- bootspring第三方資源配置管理
- GIT同步修改后的遠程分支