使用python scrapy爬取天氣並導出csv文件
爬取xxx天氣
爬取網址:https://tianqi.2345.com/today-60038.htm
安裝
pip install scrapy
我使用的版本是scrapy 2.5
創建scray爬蟲項目
在命令行如下輸入命令
scrapy startproject name
name為項目名稱
如,scrapy startproject spider_weather
之後再輸入
scrapy genspider spider_name 域名
如,scrapy genspider changshu tianqi.2345.com
查看文件夾
– spider_weather
– spider
– __init__.py
– changshu.py
– __init__.py
– items.py
– middlewares.py
– pipelines.py
– settings.py
– scrapy.cfg
文件說明
名稱 | 作用 |
---|---|
scrapy.cfg | 項目的配置信息,主要為Scrapy命令行工具提供一個基礎的配置信息。(真正爬蟲相關的配置信息在settings.py文件中) |
items.py | 設置數據存儲模板,用於結構化數據,如:Django的Model |
pipelines | 數據處理行為,如:一般結構化的數據持久化 |
settings.py | 配置文件,如:遞歸的層數、並發數,延遲下載等 |
spiders | 爬蟲目錄,如:創建文件,編寫爬蟲規則 |
開始爬蟲
1.在spiders文件夾裡面對自己創建的爬蟲文件進行數據爬取、如在此案例中的spiders/changshu.py
代碼演示如下
import scrapy class ChangshuSpider(scrapy.Spider): name = 'changshu' allowed_domains = ['tianqi.2345.com'] start_urls = ['https://tianqi.2345.com/today-60038.htm'] def parse(self, response): # 日期、天氣狀態、溫度、風級 # 利用xpath解析數據、不會xpath的同學可以去稍微學習一下,語法簡單 dates = response.xpath('//a[@class="seven-day-item "]/em/text()').getall() states = response.xpath('//a[@class="seven-day-item "]/i/text()').getall() temps = response.xpath('//a[@class="seven-day-item "]/span[@class="tem-show"]/text()').getall() winds = response.xpath('//a[@class="seven-day-item "]/span[@class="wind-name"]/text()').getall() # 返回每條數據 for date, state, temp, wind in zip(dates,states,temps,winds): yield { 'date' : date, 'state': state, 'temp': temp, 'wind': wind }
2.在settings.py文件中進行配置
修改UA
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
修改機器爬蟲配置
ROBOTSTXT_OBEY = False
整個文件如下:
# Scrapy settings for spider_weather project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'spider_weather' SPIDER_MODULES = ['spider_weather.spiders'] NEWSPIDER_MODULE = 'spider_weather.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'spider_weather.middlewares.SpiderWeatherSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'spider_weather.middlewares.SpiderWeatherDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html # ITEM_PIPELINES = { # 'spider_weather.pipelines.SpiderWeatherPipeline': 300, # } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
3.然後在命令行中輸入如下代碼
scrapy crawl changshu -o weather.csv
註意:需要進入spider_weather路徑下運行
scrapy crawl 文件名 -o weather.csv(導出文件)
4.結果如下
補充:scrapy導出csv時字段的一些問題
scrapy -o csv格式輸出的時候,發現輸出文件中字段的順序不是按照items.py中的順序,也不是爬蟲文件中寫入的順序,這樣導出的數據因為某些字段變得不好看,此外,導出得csv文件不同的item之間被空行隔開,本文主要描述解決這些問題的方法。
1.字段順序問題:
需要在scrapy的spiders同層目錄,新建csv_item_exporter.py文件內容如下(文件名可改,目錄定死)
from scrapy.conf import settings from scrapy.contrib.exporter import CsvItemExporter class MyProjectCsvItemExporter(CsvItemExporter): def init(self, *args, **kwargs): delimiter = settings.get(‘CSV_DELIMITER', ‘,') kwargs[‘delimiter'] = delimiter fields_to_export = settings.get(‘FIELDS_TO_EXPORT', []) if fields_to_export : kwargs[‘fields_to_export'] = fields_to_export super(MyProjectCsvItemExporter, self).init(*args, **kwargs)
2)在settings.py中新增以下內容
#定義輸出格式 FEED_EXPORTERS = { ‘csv': ‘project_name.spiders.csv_item_exporter.MyProjectCsvItemExporter', } #指定csv輸出字段的順序 FIELDS_TO_EXPORT = [ ‘name', ‘title', ‘info' ] #指定分隔符 CSV_DELIMITER = ‘,'
設定完畢,執行scrapy crawl spider -o spider.csv的時候,字段就按順序來瞭
2.輸出csv有空行的問題
此時你可能會發現csv文件中有空行,這是因為scrapy默認輸出時,每個item之間的分隔符是空行
解決辦法:
在找到exporters.py的CsvItemExporter類,大概在215行中增加newline="",即可。
也可以繼承重寫CsvItemExporter類
總結
到此這篇關於使用python scrapy爬取天氣並導出csv文件的文章就介紹到這瞭,更多相關scrapy爬取天氣導出csv內容請搜索WalkonNet以前的文章或繼續瀏覽下面的相關文章希望大傢以後多多支持WalkonNet!
推薦閱讀:
- Python爬蟲實戰之使用Scrapy爬取豆瓣圖片
- Python Scrapy實戰之古詩文網的爬取
- 詳解Python之Scrapy爬蟲教程NBA球員數據存放到Mysql數據庫
- Python 詳解通過Scrapy框架實現爬取百度新冠疫情數據流程
- python實戰之Scrapy框架爬蟲爬取微博熱搜