Python Scrapy實戰之古詩文網的爬取

Posted on 2022-05-20 by WalkonNet

需求

通過python,Scrapy框架，爬取古詩文網上的詩詞數據，具體包括詩詞的標題信息，作者，朝代，詩詞內容，及譯文。爬取過程需要逐頁爬取，共4頁。第一頁的url為（https://www.gushiwen.cn/default_1.aspx）。

1. Scrapy項目創建

首先創建Scrapy項目及爬蟲程序

在目標目錄下，創建一個名為prose的項目：

scrapy startproject prose

進入項目目錄下，然後創建一個名為gs的爬蟲程序，爬取范圍為 gushiwen.cn

cd prose
scrapy genspider gs gushiwen.cn

2. 全局配置 settings.py

對配置文件settings.py做如下編輯：

①選擇不遵守robots協議

②下載間隙設置為1

③並添加請求頭，啟用管道

④此外設置打印等級：LOG_LEVEL=“WARNING”

具體如下：

# Scrapy settings for prose project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'prose'

SPIDER_MODULES = ['prose.spiders']
NEWSPIDER_MODULE = 'prose.spiders'

LOG_LEVEL = "WARNING"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'prose (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'prose.middlewares.ProseSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'prose.middlewares.ProseDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'prose.pipelines.ProsePipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

3. 爬蟲程序.py

首先是進行頁面分析，這裡不再贅述該過程。

這部分代碼，也即需要編輯的核心部分。

首先是要把初始URL加以修改，修改為要爬取的界面的第一頁，而非古詩文網的首頁。

需求：我們要爬取的內容包括：詩詞的標題信息，作者，朝代，詩詞內容，及譯文。爬取過程需要逐頁爬取。

其中，標題信息，作者，朝代，詩詞內容，及譯文都存在於同一個<div>標簽中。

為瞭體現兩種不同的操作方式，

標題信息，作者，朝代，詩詞內容四項，我們使用一種方法獲取。並在該for循環中使用到一個異常處理語句（try…except…）來避免取到空值時使用索引導致的報錯；

對於譯文，我們額外定義一個parse_detail函數，並在scrapy.Request()中傳入其，來獲取。

關於翻頁，我們的思路是：遍歷獲取完每一頁需要的數據後（即一大輪循環結束後），從當前頁面上獲取下一頁的鏈接，然後判斷獲取到的鏈接是否為空。如若不為空則表示獲取到瞭，則再一次使用scrapy.Requests()方法，傳入該鏈接，並再次調用parse函數。如果為空，則表明這已經是最後一頁瞭，程序就會在此處結束。

具體代碼如下：

import scrapy
from prose.items import ProseItem


class GsSpider(scrapy.Spider):
    name = 'gs'
    allowed_domains = ['gushiwen.cn']
    start_urls = ['https://www.gushiwen.cn/default_1.aspx']

    # 解析列表頁面
    def parse(self, response):
        # 一個class="sons"對應的是一首詩
        div_list = response.xpath('//div[@class="left"]/div[@class="sons"]')
        for div in div_list:
            try:
                # 提取詩詞標題信息
                title = div.xpath('.//b/text()').get()
                # 提取作者和朝代
                source = div.xpath('.//p[@class="source"]/a/text()').getall()
                # 作者
                # replace
                author = source[0]
                # 朝代
                dynasty = source[1]
                content_list = div.xpath('.//div[@class="contson"]//text()').getall()
                content_plus = ''.join(content_list).strip()
                # 拿到詩詞詳情頁面的url
                detail_url = div.xpath('.//p/a/@href').get()
                item = ProseItem(title=title, author=author, dynasty=dynasty, content_plus=content_plus, detail_url=detail_url)
                # print(item)
                yield scrapy.Request(
                    url=detail_url,
                    callback=self.parse_detail,
                    meta={'prose_item': item}
                )
            except:
                pass

        next_url = response.xpath('//a[@id="amore"]/@href').get()
        if next_url:
            print(next_url)
            yield scrapy.Request(
                url=next_url,
                callback=self.parse
            )


    # 用於解析詳情頁面
    def parse_detail(self, response):
        item = response.meta.get('prose_item')
        translation = response.xpath('//div[@class="sons"]/div[@class="contyishang"]/p//text()').getall()
        item['translation'] = ''.join(translation).strip()
        # print(item)
        yield item
        pass

4. 數據結構 items.py

在這裡定義瞭ProseItem類，以便在上邊的爬蟲程序中調用。（此外要註意的是，爬蟲程序中導入瞭該模塊，有必要時需要將合適的文件夾標記為根目錄。）

import scrapy


class ProseItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 標題
    title = scrapy.Field()
    # 作者
    author = scrapy.Field()
    # 朝代
    dynasty = scrapy.Field()
    # 詩詞內容
    content_plus = scrapy.Field()
    # 詳情頁面的url
    detail_url = scrapy.Field()
    # 譯文
    translation = scrapy.Field()
    pass

5. 管道 pipelines.py

管道，在這裡編輯數據存儲的過程。

from itemadapter import ItemAdapter
import json


class ProsePipeline:
    def __init__(self):
        self.f = open('gs.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
    	# 將item先轉化為字典， 再轉化為 json類型的字符串
        item_json = json.dumps(dict(item), ensure_ascii=False)
        self.f.write(item_json + '\n')
        return item

    def close_spider(self, spider):
        self.f.close()

6. 程序執行 start.py

定義一個執行命令的程序。

from scrapy import cmdline

cmdline.execute('scrapy crawl gs'.split())

程序執行效果如下：

我們需要的數據，被保存在瞭一個名為gs.txt的文本文件中瞭。

以上就是Python Scrapy實戰之古詩文網的爬取的詳細內容，更多關於Python Scrapy爬取古詩文網的資料請關註WalkonNet其它相關文章！

Python Scrapy實戰之古詩文網的爬取

目錄

需求

1. Scrapy項目創建

2. 全局配置 settings.py

3. 爬蟲程序.py

4. 數據結構 items.py

5. 管道 pipelines.py

6. 程序執行 start.py

推薦閱讀：

發佈留言取消回覆

近期文章

目錄

需求

1. Scrapy項目創建

2. 全局配置 settings.py

3. 爬蟲程序.py

4. 數據結構 items.py

5. 管道 pipelines.py

6. 程序執行 start.py

推薦閱讀：

發佈留言 取消回覆

近期文章

標籤

發佈留言取消回覆