Python爬蟲之urllib庫詳解

Posted on 2022-02-09 by WalkonNet

一、說明：

urllib庫是python內置的一個http請求庫，requests庫就是基於該庫開發出來的，雖然requests庫使用更方便，但作為最最基本的請求庫，瞭解一下原理和用法還是很有必要的。

二、urllib四個模塊組成：

urllib.request　　
請求模塊(就像在瀏覽器輸入網址，敲回車一樣)

urllib.error　　　
異常處理模塊(出現請求錯誤，可以捕捉這些異常)

urllib.parse　　
url解析模塊

urllib.robotparser
robots.txt解析模塊，判斷哪個網站可以爬，哪個不可以爬，用的比較少

在python2與python3中有所不同

在python2中：

import urllib2
response = urllib2.urlopen('http://www.baidu.com')

在python3中：

import  urllib.request
response = urllib.request.urlopen('http://www.baidu.com')

三、urllib.request

1、urlopen函數

urllib.request.urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,*, cafile=None, capath=None, cadefault=False, context=None)

url參數

from urllib import request
response = request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

data參數

沒有data參數時，發送的是一個get請求，加上data參數後，請求就變成瞭post方式(利用’http://httpbin.org測試網址)

import urllib.request
import urllib.parse

data1= bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')
response = urllib.request.urlopen('http://httpbin.org/post',data = data1)
print(response.read())

data參數需要bytes類型，所以需要使用bytes()函數進行編碼，而bytes函數的第一個參數需要時str類型，所以使用urllib.parse.urlencode將字典轉化為字符串。

timeout參數

設置一個超時的時間，如果在這個時間內沒有響應，便會拋出異常

import urllib.request

try:
    response = urllib.request.urlopen('http://www.baidu.com', timeout=0.001)
    print(response.read())
except:
    print('error')

將超時時間設置為0.001秒，在這個時間內，沒有響應，輸出error

2、response 響應類型

import urllib
from urllib import request
 
response = urllib.request.urlopen('http://www.baidu.com')
print(type(response))

狀態碼與響應頭

import urllib
from urllib import request

response = urllib.request.urlopen('http://www.baidu.com')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

read方法

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')
print(type(response.read()))
print(response.read().decode('utf-8'))

response.read()返回的是bytes形式的數據，所以需要用decode(‘utf-8’)進行解碼。

3、Request對象　

如果我們需要發送復雜的請求，在urllib庫中就需要使用一個Request對象

import urllib.request
 
#直接聲明一個Request對象，並把url當作參數直接傳遞進來
request = urllib.request.Request('http://www.baidu.com')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

聲明瞭一個Request對象，把url當作參數傳遞給這個對象，然後把這個對昂作為urlopen函數的參數

更復雜的請求，加headers

#利用Request對象實現一個post請求

import urllib.request
url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
data = {'word':'hello'}
data = bytes(str(data),encoding='utf-8')
req = urllib.request.Request(url = url,data = data,headers = headers,method = 'POST')
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))

上面的這個請求包含瞭請求方式、url，請求頭，請求體，邏輯清晰。

Request對象還有一個add_header方法，這樣也可以添加多個鍵值對的header

4、高級請求方式

設置代理

很多網站會檢測某一段時間某個IP的訪問次數(通過流量統計，系統日志等)，如果訪問次數多的不像正常人，它會禁止這個IP的訪問。ProxyHandler(設置代理的handler)，可以變換自己的IP地址。

from urllib import request # 導入request模塊
 
url = 'http://httpbin.org' # url地址
handler = request.ProxyHandler({'http': '122.193.244.243:9999'}) # 使用request模塊ProxyHandler類創建代理
#handler = request.ProxyHandler({"http":"賬號:密碼@'122.193.244.243:9999'"})
#付費代理模式　
opener = request.build_opener(handler) # 用handler創建opener
resp = opener.open(url) # 使用opener.open()發送請求
print(resp.read()) # 打印返回結果

cookie

import urllib.request
import urllib.parse

url = 'https://weibo.cn/5273088553/info'
# 正常的方式進行訪問
# headers = {
#     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'
# }
#攜帶cookie進行訪問
headers = {
    'GET https': '//weibo.cn/5273088553/info HTTP/1.1',
    'Host': ' weibo.cn',
    'Connection': ' keep-alive',
    'Upgrade-Insecure-Requests': ' 1',
    'User-Agent': ' Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
    'Accept': ' text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    # 'Referer: https':'//weibo.cn/',
    'Accept-Language': ' zh-CN,zh;q=0.9',
    'Cookie': ' _T_WM=c1913301844388de10cba9d0bb7bbf1e; SUB=_2A253Wy_dDeRhGeNM7FER-CbJzj-IHXVUp7GVrDV6PUJbkdANLXPdkW1NSesPJZ6v1GA5MyW2HEUb9ytQW3NYy19U; SUHB=0bt8SpepeGz439; SCF=Aua-HpSw5-z78-02NmUv8CTwXZCMN4XJ91qYSHkDXH4W9W0fCBpEI6Hy5E6vObeDqTXtfqobcD2D32r0O_5jSRk.; SSOLoginState=1516199821',
}
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
# 輸出所有
# print(response.read().decode('gbk'))
# 將內容寫入文件中
with open('weibo.html', 'wb') as fp:
    fp.write(response.read())

四、urllib.error

可以捕獲三種異常：URLError,HTTPError(是URLError類的一個子類)，ContentTooShortError

URLError隻有一個reason屬性

HTTPError有三個屬性：code,reason,headers

import urllib.request
from urllib import error

try:
    response = urllib.request.urlopen('http://123.com')
except error.URLError as e:
    print(e.reason)

import urllib
from urllib import request
from urllib import error
#先捕捉http異常，再捕捉url異常
try:
    response = urllib.request.urlopen('http://123.com')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers)
except error.URLError as e:
    print(e.reason)
else:
    print('RequestSucess!')

五、URL解析urllib.parse

urlparse函數

該函數是對傳入的url進行分割,分割成幾部分，並對每部分進行賦值

import urllib
from urllib import parse

result = urllib.parse.urlparse('http://www,baidu.com/index.html;user?id=5#comment')
print(type(result))
print(result)

結果方便的拆分瞭url

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='http', netloc='www,baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
Process finished with exit code 0

從輸出結果可以看出，這幾部分包括：協議類型、域名、路徑、參數、query、fragment

urlparse有幾個參數：url,scheme,allow_fragments

在使用urlparse時，可以通過參數scheme = 'http’的方式來指定默認的協議類型,如果url有協議類型，scheme參數就不會生效瞭

urlunparse函數

與urlparse函數作用相反，是對url進行拼接的　

在這裡插入圖片描述

urljoin函數

用來拼接url

在這裡插入圖片描述

urlencode函數

可以把一個字典轉化為get請求參數

在這裡插入圖片描述

六、urllib.robotparser

使用較少，可作為瞭解

總結

到此這篇關於Python爬蟲之urllib庫詳解的文章就介紹到這瞭,更多相關Python urllib庫內容請搜索WalkonNet以前的文章或繼續瀏覽下面的相關文章希望大傢以後多多支持WalkonNet！

Python爬蟲之urllib庫詳解

目錄

一、說明：

二、urllib四個模塊組成：

三、urllib.request

1、urlopen函數

2、response 響應類型

3、Request對象

4、高級請求方式

四、urllib.error

五、URL解析urllib.parse

六、urllib.robotparser

總結

推薦閱讀：

發佈留言取消回覆

近期文章

目錄

一、說明：

二、urllib四個模塊組成：

三、urllib.request

1、urlopen函數

2、response 響應類型

3、Request對象

4、高級請求方式

四、urllib.error

五、URL解析urllib.parse

六、urllib.robotparser

總結

推薦閱讀：

發佈留言 取消回覆

近期文章

標籤

3、Request對象　

發佈留言取消回覆