使用python scrapy爬取天氣並導出csv文件

爬取xxx天氣

爬取網址:https://tianqi.2345.com/today-60038.htm

安裝

pip install scrapy

我使用的版本是scrapy 2.5

創建scray爬蟲項目

在命令行如下輸入命令

scrapy startproject name

name為項目名稱
如,scrapy startproject spider_weather
之後再輸入

scrapy genspider spider_name 域名

如,scrapy genspider changshu tianqi.2345.com

查看文件夾

– spider_weather
    – spider
        – __init__.py
        – changshu.py
    – __init__.py
    – items.py
    – middlewares.py
    – pipelines.py
    – settings.py 
– scrapy.cfg

文件說明

名稱 作用
scrapy.cfg 項目的配置信息,主要為Scrapy命令行工具提供一個基礎的配置信息。(真正爬蟲相關的配置信息在settings.py文件中)
items.py 設置數據存儲模板,用於結構化數據,如:Django的Model
pipelines 數據處理行為,如:一般結構化的數據持久化
settings.py 配置文件,如:遞歸的層數、並發數,延遲下載等
spiders 爬蟲目錄,如:創建文件,編寫爬蟲規則

開始爬蟲

1.在spiders文件夾裡面對自己創建的爬蟲文件進行數據爬取、如在此案例中的spiders/changshu.py

代碼演示如下

import scrapy

class ChangshuSpider(scrapy.Spider):
    name = 'changshu'
    allowed_domains = ['tianqi.2345.com']
    start_urls = ['https://tianqi.2345.com/today-60038.htm']

    def parse(self, response):
        # 日期、天氣狀態、溫度、風級
        # 利用xpath解析數據、不會xpath的同學可以去稍微學習一下,語法簡單
        dates = response.xpath('//a[@class="seven-day-item "]/em/text()').getall()
        states = response.xpath('//a[@class="seven-day-item "]/i/text()').getall()
        temps = response.xpath('//a[@class="seven-day-item "]/span[@class="tem-show"]/text()').getall()
        winds = response.xpath('//a[@class="seven-day-item "]/span[@class="wind-name"]/text()').getall()
        # 返回每條數據
        for date, state, temp, wind in zip(dates,states,temps,winds):
            yield {
                'date' : date,
                'state': state,
                'temp': temp,
                'wind': wind
            }

2.在settings.py文件中進行配置

修改UA

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'

修改機器爬蟲配置

ROBOTSTXT_OBEY = False

整個文件如下:

# Scrapy settings for spider_weather project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'spider_weather'

SPIDER_MODULES = ['spider_weather.spiders']
NEWSPIDER_MODULE = 'spider_weather.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'spider_weather.middlewares.SpiderWeatherSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'spider_weather.middlewares.SpiderWeatherDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
#    'spider_weather.pipelines.SpiderWeatherPipeline': 300,
# }

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

3.然後在命令行中輸入如下代碼

scrapy crawl changshu -o weather.csv

註意:需要進入spider_weather路徑下運行
scrapy crawl 文件名 -o weather.csv(導出文件)

4.結果如下

補充:scrapy導出csv時字段的一些問題

scrapy -o csv格式輸出的時候,發現輸出文件中字段的順序不是按照items.py中的順序,也不是爬蟲文件中寫入的順序,這樣導出的數據因為某些字段變得不好看,此外,導出得csv文件不同的item之間被空行隔開,本文主要描述解決這些問題的方法。

1.字段順序問題:

需要在scrapy的spiders同層目錄,新建csv_item_exporter.py文件內容如下(文件名可改,目錄定死)

from scrapy.conf import settings
from scrapy.contrib.exporter import CsvItemExporter

class MyProjectCsvItemExporter(CsvItemExporter):
def init(self, *args, **kwargs):
delimiter = settings.get(‘CSV_DELIMITER', ‘,')
kwargs[‘delimiter'] = delimiter
fields_to_export = settings.get(‘FIELDS_TO_EXPORT', [])
if fields_to_export :
kwargs[‘fields_to_export'] = fields_to_export
super(MyProjectCsvItemExporter, self).init(*args, **kwargs)

2)在settings.py中新增以下內容

#定義輸出格式
FEED_EXPORTERS = {
‘csv': ‘project_name.spiders.csv_item_exporter.MyProjectCsvItemExporter',
}
#指定csv輸出字段的順序
FIELDS_TO_EXPORT = [
‘name',
‘title',
‘info'
]
#指定分隔符
CSV_DELIMITER = ‘,'

設定完畢,執行scrapy crawl spider -o spider.csv的時候,字段就按順序來瞭

2.輸出csv有空行的問題

此時你可能會發現csv文件中有空行,這是因為scrapy默認輸出時,每個item之間的分隔符是空行

解決辦法:

在找到exporters.py的CsvItemExporter類,大概在215行中增加newline="",即可。

也可以繼承重寫CsvItemExporter類

總結 

到此這篇關於使用python scrapy爬取天氣並導出csv文件的文章就介紹到這瞭,更多相關scrapy爬取天氣導出csv內容請搜索WalkonNet以前的文章或繼續瀏覽下面的相關文章希望大傢以後多多支持WalkonNet!

推薦閱讀: