Scrapy常规用法不能抓取带有js渲染的内容,这个时候需要用到一些模拟的手段来渲染页面,scrapy-splash能很好的完成此任务.
Scrapy常规用法不能抓取带有js渲染的内容,这个时候需要用到一些模拟的手段来渲染页面,scrapy-splash能很好的完成此任务.
一个常规抓取页面的例子
# -*- coding: utf-8 -*-
import scrapy
from fake_useragent import UserAgent
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from scrapy_splash import SplashRequest
from creeper.items import SimpleBlog
ua = UserAgent()
class ZgjmSpider(scrapy.spiders.CrawlSpider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://www.99166.com/']
all_urls = set()
rules = (
# Rule(LinkExtractor(allow=('category\.php',))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('dream\/',), allow_domains=('99166.com')), callback='parse_item', follow=True),
)
def parse_item(self, response):
for href in response.css('a::attr(href)').getall():
if href is not None:
if href not in self.all_urls and href.startswith('http'):
self.all_urls.add(href)
yield response.follow(href, self.parse)
# 解析
jiemeng_div = response.css('div.jiemeng')
if jiemeng_div:
if jiemeng_div.xpath('//div[@class=\'ltbox\']/h2'):
title = jiemeng_div.xpath('//div[@class=\'ltbox\']/h2/text()').get()
table1 = jiemeng_div.xpath('(//div[@class=\'listb\']//table)[1]//text()').getall()
content = "".join(table1)
item = self.createItem(response)
item['content'] = content
item['title'] = title
yield item
def createItem(self, response):
# title
item = SimpleBlog()
item['site'] = self.name
item['link'] = response.request.url
return item
抓取的过程中我发现少了很多页面,于是打开浏览器去里面搜索我所看到的链接,没有找到,判断是js写进去的,这个时候就要用到splash
了
splash是什么
splash本身是一个渲染服务,用来渲染javascript.
下面是官方的介绍
Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5. The (twisted) QT reactor is used to make the service fully asynchronous allowing to take advantage of webkit concurrency via QT main loop. Some of Splash features:
- process multiple webpages in parallel;
- get HTML results and/or take screenshots;
- turn OFF images or use Adblock Plus rules to make rendering faster;
- execute custom JavaScript in page context;
- write Lua browsing scripts;
- develop Splash Lua scripts in Splash-Jupyter Notebooks.
- get detailed rendering info in HAR format.
大概的意思是,包含了一个轻量级的web浏览器,内嵌http的api,由python3的Twisted and QT5来实现的.特点很多,如可以多线程,截屏,选择性屏蔽,执行js脚本来渲染页面,自定义Lua脚本,(好像是可以实现一些Cookie和Session等工具的时候有用)等等
安装Splash
我再这里不得不再次感叹一下Docker
真是这20年软件行业最伟大的发明之一,用Docker以后非常简单.一行命令直接启动.(Docker怎么安装...自行解决吧)
docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash:latest
这里其实只需要开8050端口就够了,然后打开浏览器地址http://127.0.0.1:8050
进行验证.看到如下图片就可以使用了.
代理端口已经可以正常使用
安装python的scrapy-splash包
直接pip,跳过
使用splash
1.修改settings
添加middleware
的相关部分
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
添加重复过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
添加Splash服务器
SPLASH_URL = 'http://127.0.0.1:8050'
2.修改代码
# -*- coding: utf-8 -*-
import scrapy
from fake_useragent import UserAgent
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from scrapy_splash import SplashRequest
from creeper.items import SimpleBlog
ua = UserAgent()
class ZgjmSpider(scrapy.spiders.CrawlSpider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://www.99166.com/']
all_urls = set()
rules = (
# Rule(LinkExtractor(allow=('category\.php',))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('dream\/',), allow_domains=('99166.com')), callback='parse_item', follow=True),
)
def parse_item(self, response):
for href in response.css('a::attr(href)').getall():
if href is not None:
if href not in self.all_urls and href.startswith('http'):
self.all_urls.add(href)
# yield response.follow(href, self.parse)
yield SplashRequest(href, self.parse_item,
args={
# optional; parameters passed to Splash HTTP API
'wait': 0.5,
# 'url' is prefilled from request url
# 'http_method' is set to 'POST' for POST requests
# 'body' is set to request body for POST requests
},
# endpoint='render.json', # optional; default is render.html
# splash_url='<url>', # optional; overrides SPLASH_URL
# slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN, # optional
)
# 解析
jiemeng_div = response.css('div.jiemeng')
if jiemeng_div:
if jiemeng_div.xpath('//div[@class=\'ltbox\']/h2'):
title = jiemeng_div.xpath('//div[@class=\'ltbox\']/h2/text()').get()
table1 = jiemeng_div.xpath('(//div[@class=\'listb\']//table)[1]//text()').getall()
content = "".join(table1)
item = self.createItem(response)
item['content'] = content
item['title'] = title
yield item
def createItem(self, response):
# title
item = SimpleBlog()
item['site'] = self.name
item['link'] = response.request.url
return item
说白了就是让SplashRequest
方法来参与解析.
详细参考: Splash官方文档和splash-scrapy文档