利用python實現查看溧陽的攝影圈

Posted on 2022-05-17 by WalkonNet

目標站點分析

本次要采集的目標站點分頁規則如下：

http://www.jsly001.com/thread-htm-fid-45-page-{頁碼}.html

代碼采用多線程 threading 模塊+requests 模塊+BeautifulSoup 模塊編寫。

采取規則依據列表頁 → 詳情頁：

溧陽攝影圈圖片采集代碼

本案例屬於實操案例，先展示完整代碼，然後基於註釋與重點函數進行說明。

主要實現步驟如下所示：

設置日志輸出級別
聲明一個 LiYang 類，其繼承自 threading.Thread
實例化多線程對象
每個線程都去獲取全局資源
調用html解析函數
獲取板塊主題分割區域，主要為防止獲取置頂的主題
使用 lxml 進行解析
解析出標題與數據
解析圖片地址
保存圖片

import random
import threading
import logging
from bs4 import BeautifulSoup
import requests
import lxml
logging.basicConfig(level=logging.NOTSET) # 設置日志輸出級別
# 聲明一個 LiYang 類，其繼承自 threading.Thread
class LiYangThread(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self) # 實例化多線程對象
        self._headers = self._get_headers() # 隨機獲取 ua
        self._timeout = 5 # 設置超時時間

    # 每個線程都去獲取全局資源
    def run(self):
        # while True: # 此處為多線程開啟位置
        try:
            res = requests.get(url="http://www.jsly001.com/thread-htm-fid-45-page-1.html", headers=self._headers,
                               timeout=self._timeout) # 測試獲取第一頁數據
        except Exception as e:
            logging.error(e)
        if res is not None:
            html_text = res.text
            self._format_html(html_text) # 調用html解析函數

    def _format_html(self, html):
        # 使用 lxml 進行解析
        soup = BeautifulSoup(html, 'lxml')

        # 獲取板塊主題分割區域，主要為防止獲取置頂的主題
        part_tr = soup.find(attrs={'class': 'bbs_tr4'})

        if part_tr is not None:
            items = part_tr.find_all_next(attrs={"name": "readlink"}) # 獲取詳情頁地址
        else:
            items = soup.find_all(attrs={"name": "readlink"})
        # 解析出標題與數據
        data = [(item.text, f'http://www.jsly001.com/{item["href"]}') for item in items]
        # 進入標題內頁
        for name, url in data:
            self._get_imgs(name, url)

    def _get_imgs(self, name, url):
        """解析圖片地址"""
        try:
            res = requests.get(url=url, headers=self._headers, timeout=self._timeout)
        except Exception as e:
            logging.error(e)
		# 圖片提取邏輯
        if res is not None:
            soup = BeautifulSoup(res.text, 'lxml')
            origin_div1 = soup.find(attrs={'class': 'tpc_content'})
            origin_div2 = soup.find(attrs={'class': 'imgList'})
            content = origin_div2 if origin_div2 else origin_div1

            if content is not None:
                imgs = content.find_all('img')

                # print([img.get("src") for img in imgs])
                self._save_img(name, imgs) # 保存圖片
    def _save_img(self, name, imgs):
        """保存圖片"""
        for img in imgs:
            url = img.get("src")
            if url.find('http') < 0:
                continue
            # 尋找父標簽中的 id 屬性
            id_ = img.find_parent('span').get("id")

            try:
                res = requests.get(url=url, headers=self._headers, timeout=self._timeout)
            except Exception as e:
                logging.error(e)

            if res is not None:
                name = name.replace("/", "_")
                with open(f'./imgs/{name}_{id_}.jpg', "wb+") as f: # 註意在 python 運行時目錄提前創建 imgs 文件夾
                    f.write(res.content)
    def _get_headers(self):
        uas = [
            "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
        ]
        ua = random.choice(uas)
        headers = {
            "user-agent": ua
        }
        return headers
if __name__ == '__main__':
    my_thread = LiYangThread()
    my_thread.run()

本次案例采用中，BeautifulSoup 模塊采用 lxml 解析器 對 HTML 數據進行解析，後續多采用此解析器，在使用前註意先導入 lxml 模塊。

數據提取部分采用 soup.find() 與 soup.find_all() 兩個函數進行，代碼中還使用瞭 find_parent() 函數，用於采集父級標簽中的 id 屬性。

# 尋找父標簽中的 id 屬性
id_ = img.find_parent('span').get("id")

代碼運行過程出現 DEBUG 信息，控制 logging 日志輸出級別即可。![用python看溧陽攝影圈，裡面照片非常真

到此這篇關於利用python實現查看溧陽的攝影圈的文章就介紹到這瞭,更多相關python查看攝影圈內容請搜索WalkonNet以前的文章或繼續瀏覽下面的相關文章希望大傢以後多多支持WalkonNet！

利用python實現查看溧陽的攝影圈

目錄

目標站點分析

溧陽攝影圈圖片采集代碼

推薦閱讀：

發佈留言取消回覆

近期文章

目錄

目標站點分析

溧陽攝影圈圖片采集代碼

推薦閱讀：

發佈留言 取消回覆

近期文章

標籤

發佈留言取消回覆