python爬蟲學習筆記–BeautifulSoup4庫的使用詳解

Posted on 2021-08-25 by WalkonNet

使用范例

from bs4 import BeautifulSoup
#創建 Beautiful Soup 對象
# 使用lxml來進行解析
soup = BeautifulSoup(html,"lxml")
print(soup.prettify())

返回結果

在這裡插入圖片描述

常用的對象–Tag

就是 HTML 中的一個個標簽

在上面范例的基礎上添加

from bs4 import BeautifulSoup
#創建 Beautiful Soup 對象
# 使用lxml來進行解析
soup = BeautifulSoup(html,"lxml")
#print(soup.prettify())


#創建 Beautiful Soup 對象
soup = BeautifulSoup(html,'lxml')

print (soup.title)#None因為這裡沒有tiele標簽所以返回none

print (soup.head)#None因為這裡沒有head標簽所以返回none

print (soup.a)#返回 <a class="fill-dec" href="//my.csdn.net" target="_blank">編輯自我介紹，讓更多人瞭解你<span class="write-icon"></span></a>


print (type(soup.p))#返回 <class 'bs4.element.Tag'>

print( soup.p)

其中print( soup.p)

返回結果為

在這裡插入圖片描述

同樣地，在上面地基礎上添加

print (soup.name)# [document] #soup 對象本身比較特殊，它的 name 即為 [document]

在這裡插入圖片描述

print (soup.head.name)#head #對於其他內部標簽，輸出的值為標簽本身的名稱

print (soup.p.attrs)##把p標簽的所有屬性打印出來,得到的類型是一個字典。

在這裡插入圖片描述

print (soup.p['class'])#獲取P標簽下地class標簽

soup.p['class'] = "newClass"
print (soup.p) # 可以對這些屬性和內容等等進行修改

在這裡插入圖片描述

常用的對象–NavigableString

前面地基礎上添加

print (soup.p.string)
# The Dormouse's story
print (type(soup.p.string))
# <class 'bs4.element.NavigableString'>thon

返回結果

在這裡插入圖片描述

常用的對象–BeautifulSoup

beautiful soup對象表示文檔的全部內容。大多數情況下，它可以被視為標記對象。它支持遍歷文檔樹並搜索文檔樹中描述的大多數方法因為Beauty soup對象不是真正的HTML或XML標記，所以它沒有名稱和屬性。但是，有時查看其內容很方便。Name屬性，因此美麗的湯對象包含一個特殊屬性。值為“[文檔]”的名稱

print(soup.name)
#返回 '[document]'

常用的對象–Comment

用於解釋註釋部分的內容

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

對文檔樹的遍歷

在上面的基礎上添加

head_tag = soup.div
# 返回所有子節點的列表
print(head_tag.contents)

在這裡插入圖片描述

同理

head_tag = soup.div

# 返回所有子節點的迭代器
for child in head_tag.children:
    print(child)

在這裡插入圖片描述

tag中包含多個字符串的情況

可用 .strings 來循環獲取

for string in soup.strings:
    print(repr(string))

在這裡插入圖片描述

.stripped_strings 去除空白內容

for string in soup.stripped_strings:
    print(repr(string))

在這裡插入圖片描述

搜索文檔樹–find和find_all

找到所有

print(soup.find_all("a",id='link2'))

find方法是找到第一個滿足條件的標簽後立即返回，返回一個元素。find_all方法是把所有滿足條件的標簽都選到，然後返回。

select方法(各種查找)

#通過標簽名查找：
print(soup.select('a'))
#通過類名查找：
#通過類名，則應該在類的前面加一個'.'
print(soup.select('.sister'))
#通過id查找：
#通過id查找，應該在id的名字前面加一個＃號
print(soup.select("#link1"))

查找a標簽返回的結果

在這裡插入圖片描述

其他因為網頁本身沒有，返回的是一個空列表

組合查找

print(soup.select("p #link1"))#查找 p 標簽中，id 等於 link1的內容

子標簽查找

print(soup.select("head > title"))

通過屬性查找

print(soup.select('a[href="http://example.com/elsie"]'))#屬性與標簽屬同一節點，中間不能有空格

獲取內容

先查看類型

print (type(soup.select('div')))

在這裡插入圖片描述

for title in soup.select('div'):
    print (title.get_text())

在這裡插入圖片描述

print (soup.select('div')[20].get_text())#選取第20個div標簽的內容

在這裡插入圖片描述

總結

本篇文章就到這裡瞭，希望能給你帶來幫助，也希望您能夠多多關註WalkonNet的更多內容!

python爬蟲學習筆記–BeautifulSoup4庫的使用詳解

目錄

使用范例

常用的對象–Tag

常用的對象–NavigableString

常用的對象–BeautifulSoup

常用的對象–Comment

對文檔樹的遍歷

tag中包含多個字符串的情況

.stripped_strings 去除空白內容

搜索文檔樹–find和find_all

select方法(各種查找)

獲取內容

總結

推薦閱讀：

發佈留言取消回覆

近期文章

目錄

使用范例

常用的對象–Tag

常用的對象–NavigableString

常用的對象–BeautifulSoup

常用的對象–Comment

對文檔樹的遍歷

tag中包含多個字符串的情況

.stripped_strings 去除空白內容

搜索文檔樹–find和find_all

select方法(各種查找)

獲取內容

總結

推薦閱讀：

發佈留言 取消回覆

近期文章

標籤

發佈留言取消回覆