日本免费高清视频-国产福利视频导航-黄色在线播放国产-天天操天天操天天操天天操|www.shdianci.com

學無先后,達者為師

網站首頁 編程語言 正文

lxml:提取html標簽中的內容

作者:Sun_Sherry 更新時間: 2022-02-05 編程語言

??lxml中有多種方式可以提取HTML標簽中的內容,這篇博客的重點在于各個方法的不同。

import lxml
from lxml import etree
import collections

doc='''
<html>
 <head>
  <base  />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html' id="xxx">Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <h5>test</h5>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html' class='li li-item' name='items'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
   <a href='image6.html' name='items'><span><h5>test</h5></span>Name: My image 6 <br /><img src='image6_thumb.jpg' /></a>
   <p>hello world hello world <strong> hello world,hello world</strong>你好啊,李銀河</p>
  </div>
 </body>
</html>
'''
html=etree.HTML(doc)
tree=html.getroottree()
all_nodes=html.xpath('//*')
xpath=[]
for node in all_nodes:
    xpath.append(tree.getpath(node))
    
print('==============node.text方法=====================')
for node,path in zip(all_nodes,xpath):
    print('{}:  {}'.format(path,node.text))
print('==============node.itertext方法=====================')
for node,path in zip(all_nodes,xpath):
    print('{}:  {}'.format(path,''.join(node.itertext())))
print('==============xpath方法=====================')
for node,path in zip(all_nodes,xpath):
    print('{}:  {}'.format(path,''.join(html.xpath(path+'//text()'))))

node.text結果如下:

==============node.text方法=====================
/html:  
 
/html/head:  
  
/html/head/base:  None
/html/head/title:  Example website
/html/body:  
  
/html/body/div:  
   
/html/body/div/a[1]:  Name: My image 1 
/html/body/div/a[1]/br:  None
/html/body/div/a[1]/img:  None
/html/body/div/h5:  test
/html/body/div/a[2]:  Name: My image 2 
/html/body/div/a[2]/br:  None
/html/body/div/a[2]/img:  None
/html/body/div/a[3]:  Name: My image 3 
/html/body/div/a[3]/br:  None
/html/body/div/a[3]/img:  None
/html/body/div/a[4]:  Name: My image 4 
/html/body/div/a[4]/br:  None
/html/body/div/a[4]/img:  None
/html/body/div/a[5]:  Name: My image 5 
/html/body/div/a[5]/br:  None
/html/body/div/a[5]/img:  None
/html/body/div/a[6]:  None
/html/body/div/a[6]/span:  None
/html/body/div/a[6]/span/h5:  test
/html/body/div/a[6]/br:  None
/html/body/div/a[6]/img:  None
/html/body/div/p:  hello world hello world 
/html/body/div/p/strong:   hello world,hello world

node.itertext結果如下:

/html:  
 
  
  Example website
 
 
  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊,李銀河
  
 

/html/head:  
  
  Example website
 
/html/head/base:  
/html/head/title:  Example website
/html/body:  
  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊,李銀河
  
 
/html/body/div:  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊,李銀河
  
/html/body/div/a[1]:  Name: My image 1 
/html/body/div/a[1]/br:  
/html/body/div/a[1]/img:  
/html/body/div/h5:  test
/html/body/div/a[2]:  Name: My image 2 
/html/body/div/a[2]/br:  
/html/body/div/a[2]/img:  
/html/body/div/a[3]:  Name: My image 3 
/html/body/div/a[3]/br:  
/html/body/div/a[3]/img:  
/html/body/div/a[4]:  Name: My image 4 
/html/body/div/a[4]/br:  
/html/body/div/a[4]/img:  
/html/body/div/a[5]:  Name: My image 5 
/html/body/div/a[5]/br:  
/html/body/div/a[5]/img:  
/html/body/div/a[6]:  testName: My image 6 
/html/body/div/a[6]/span:  test
/html/body/div/a[6]/span/h5:  test
/html/body/div/a[6]/br:  
/html/body/div/a[6]/img:  
/html/body/div/p:  hello world hello world  hello world,hello world你好啊,李銀河
/html/body/div/p/strong:   hello world,hello world

xpath結果如下:


==============xpath方法=====================
/html:  
 
  
  Example website
 
 
  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊,李銀河
  
 

/html/head:  
  
  Example website
 
/html/head/base:  
/html/head/title:  Example website
/html/body:  
  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊,李銀河
  
 
/html/body/div:  
   Name: My image 1 
   test
   Name: My image 2 
   Name: My image 3 
   Name: My image 4 
   Name: My image 5 
   testName: My image 6 
   hello world hello world  hello world,hello world你好啊,李銀河
  
/html/body/div/a[1]:  Name: My image 1 
/html/body/div/a[1]/br:  
/html/body/div/a[1]/img:  
/html/body/div/h5:  test
/html/body/div/a[2]:  Name: My image 2 
/html/body/div/a[2]/br:  
/html/body/div/a[2]/img:  
/html/body/div/a[3]:  Name: My image 3 
/html/body/div/a[3]/br:  
/html/body/div/a[3]/img:  
/html/body/div/a[4]:  Name: My image 4 
/html/body/div/a[4]/br:  
/html/body/div/a[4]/img:  
/html/body/div/a[5]:  Name: My image 5 
/html/body/div/a[5]/br:  
/html/body/div/a[5]/img:  
/html/body/div/a[6]:  testName: My image 6 
/html/body/div/a[6]/span:  test
/html/body/div/a[6]/span/h5:  test
/html/body/div/a[6]/br:  
/html/body/div/a[6]/img:  
/html/body/div/p:  hello world hello world  hello world,hello world你好啊,李銀河
/html/body/div/p/strong:   hello world,hello world

總結:

  1. node.text 在取文本時不會包含該節點的子節點里的內容。
  2. node.itertext和xpath方法可以將其子節點中的內容都包含進去。并且這兩種方法取得文本內容相同。

原文鏈接:https://blog.csdn.net/yeshang_lady/article/details/122370152

欄目分類
最近更新