使用python scrapy爬取天氣並導出csv文件

Posted on 2022-08-06 by WalkonNet

爬取xxx天氣

爬取網址：https://tianqi.2345.com/today-60038.htm

安裝

pip install scrapy

我使用的版本是scrapy 2.5

創建scray爬蟲項目

在命令行如下輸入命令

scrapy startproject name

name為項目名稱
如，scrapy startproject spider_weather
之後再輸入

scrapy genspider spider_name 域名

如，scrapy genspider changshu tianqi.2345.com

查看文件夾

– spider_weather
   – spider
       – __init__.py
       – changshu.py
   – __init__.py
   – items.py
   – middlewares.py
   – pipelines.py
   – settings.py
– scrapy.cfg

文件說明

名稱	作用
scrapy.cfg	項目的配置信息，主要為Scrapy命令行工具提供一個基礎的配置信息。（真正爬蟲相關的配置信息在settings.py文件中）
items.py	設置數據存儲模板，用於結構化數據，如：Django的Model
pipelines	數據處理行為，如：一般結構化的數據持久化
settings.py	配置文件，如：遞歸的層數、並發數，延遲下載等
spiders	爬蟲目錄，如：創建文件，編寫爬蟲規則

開始爬蟲

1.在spiders文件夾裡面對自己創建的爬蟲文件進行數據爬取、如在此案例中的spiders/changshu.py

代碼演示如下

import scrapy

class ChangshuSpider(scrapy.Spider):
    name = 'changshu'
    allowed_domains = ['tianqi.2345.com']
    start_urls = ['https://tianqi.2345.com/today-60038.htm']

    def parse(self, response):
        # 日期、天氣狀態、溫度、風級
        # 利用xpath解析數據、不會xpath的同學可以去稍微學習一下，語法簡單
        dates = response.xpath('//a[@class="seven-day-item "]/em/text()').getall()
        states = response.xpath('//a[@class="seven-day-item "]/i/text()').getall()
        temps = response.xpath('//a[@class="seven-day-item "]/span[@class="tem-show"]/text()').getall()
        winds = response.xpath('//a[@class="seven-day-item "]/span[@class="wind-name"]/text()').getall()
        # 返回每條數據
        for date, state, temp, wind in zip(dates,states,temps,winds):
            yield {
                'date' : date,
                'state': state,
                'temp': temp,
                'wind': wind
            }

2.在settings.py文件中進行配置

修改UA

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'

修改機器爬蟲配置

ROBOTSTXT_OBEY = False

整個文件如下：

# Scrapy settings for spider_weather project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'spider_weather'

SPIDER_MODULES = ['spider_weather.spiders']
NEWSPIDER_MODULE = 'spider_weather.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'spider_weather.middlewares.SpiderWeatherSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'spider_weather.middlewares.SpiderWeatherDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
#    'spider_weather.pipelines.SpiderWeatherPipeline': 300,
# }

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

3.然後在命令行中輸入如下代碼

scrapy crawl changshu -o weather.csv

註意：需要進入spider_weather路徑下運行
scrapy crawl 文件名 -o weather.csv（導出文件）

4.結果如下

補充：scrapy導出csv時字段的一些問題

scrapy -o csv格式輸出的時候，發現輸出文件中字段的順序不是按照items.py中的順序，也不是爬蟲文件中寫入的順序，這樣導出的數據因為某些字段變得不好看，此外，導出得csv文件不同的item之間被空行隔開，本文主要描述解決這些問題的方法。

1.字段順序問題：

需要在scrapy的spiders同層目錄，新建csv_item_exporter.py文件內容如下（文件名可改，目錄定死）

from scrapy.conf import settings
from scrapy.contrib.exporter import CsvItemExporter

class MyProjectCsvItemExporter(CsvItemExporter):
def init(self, *args, **kwargs):
delimiter = settings.get(‘CSV_DELIMITER', ‘,')
kwargs[‘delimiter'] = delimiter
fields_to_export = settings.get(‘FIELDS_TO_EXPORT', [])
if fields_to_export :
kwargs[‘fields_to_export'] = fields_to_export
super(MyProjectCsvItemExporter, self).init(*args, **kwargs)

2)在settings.py中新增以下內容

#定義輸出格式
FEED_EXPORTERS = {
‘csv': ‘project_name.spiders.csv_item_exporter.MyProjectCsvItemExporter',
}
#指定csv輸出字段的順序
FIELDS_TO_EXPORT = [
‘name',
‘title',
‘info'
]
#指定分隔符
CSV_DELIMITER = ‘,'

設定完畢，執行scrapy crawl spider -o spider.csv的時候，字段就按順序來瞭

2.輸出csv有空行的問題

此時你可能會發現csv文件中有空行，這是因為scrapy默認輸出時，每個item之間的分隔符是空行

解決辦法：

在找到exporters.py的CsvItemExporter類，大概在215行中增加newline="",即可。

也可以繼承重寫CsvItemExporter類

總結

到此這篇關於使用python scrapy爬取天氣並導出csv文件的文章就介紹到這瞭,更多相關scrapy爬取天氣導出csv內容請搜索WalkonNet以前的文章或繼續瀏覽下面的相關文章希望大傢以後多多支持WalkonNet！

使用python scrapy爬取天氣並導出csv文件

目錄

爬取xxx天氣

安裝

創建scray爬蟲項目

文件說明

開始爬蟲

補充：scrapy導出csv時字段的一些問題

1.字段順序問題：

2.輸出csv有空行的問題

總結

推薦閱讀：

發佈留言取消回覆

近期文章

目錄

爬取xxx天氣

安裝

創建scray爬蟲項目

文件說明

開始爬蟲

補充：scrapy導出csv時字段的一些問題

1.字段順序問題：

2.輸出csv有空行的問題

總結

推薦閱讀：

發佈留言 取消回覆

近期文章

標籤

發佈留言取消回覆