splash快速开始

splash服务

splash是一个javascript渲染服务。轻量浏览器,http API (python, twisted, qt5)

安装splash(linux)
  1. install docker
  2. pull the image
    $ sudo docker pull scrapinghub/splash
  3. start the container
    $ sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
splash http api
  1. render.html 返回js渲染页面(html)
    curl 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'

  2. render.png 返回js渲染页面的截屏(png)

    1
    2
    3
    4
    # render with timeout
    curl 'http://localhost:8050/render.png?url=http://domain.com/page-with-javascript.html&timeout=10'
    # 320x240 thumbnail
    curl 'http://localhost:8050/render.png?url=http://domain.com/page-with-javascript.html&width=320&height=240'
  3. render.jpeg 返回js渲染页面的截屏(jpeg)

    1
    2
    3
    4
    # render with default quality
    curl 'http://localhost:8050/render.jpeg?url=http://domain.com/'
    # render with low quality
    curl 'http://localhost:8050/render.jpeg?url=http://domain.com/&quality=30'
  4. render.har 返回信息(request, respones, headers)

  5. render.josn 返回信息(html,png,etc)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    curl 'http://localhost:8050/render.json?url=http://domain.com/page-with-iframes.html&png=1&html=1&iframes=1'
    # HTML and meta information of page itself and all its iframes
    curl 'http://localhost:8050/render.json?url=http://domain.com/page-with-iframes.html&html=1&iframes=1'
    # only meta information (like page/iframes titles and urls)
    curl 'http://localhost:8050/render.json?url=http://domain.com/page-with-iframes.html&iframes=1'
    # render html and 320x240 thumbnail at once; do not return info about iframes
    curl 'http://localhost:8050/render.json?url=http://domain.com/page-with-iframes.html&html=1&png=1&width=320&height=240'
    # Render page and execute simple Javascript function, display the js output
    curl -X POST -H 'content-type: application/javascript' \
    -d 'function getAd(x){ return x; } getAd("abc");' \
    'http://localhost:8050/render.json?url=http://domain.com&script=1'
    # Render page and execute simple Javascript function, display the js output and the console output
    curl -X POST -H 'content-type: application/javascript' \
    -d 'function getAd(x){ return x; }; console.log("some log"); console.log("another log"); getAd("abc");' \
    'http://localhost:8050/render.json?url=http://domain.com&script=1&console=1'

scrapy-splash插件

1. 安装scrapy-splash

$ pip install scrapy-splash

2. 运行splash

docker run -p 8050:8050 scrapinghub/splash

3. 配置scrapy (setting.py)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 增加splash服务
SPLASH_URL = 'http://192.168.59.103:8050'
# 开启splash middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# 开启middleware
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# 设置一个自定义dupefilterclass
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
#
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
4. 用法
1
2
3
4
5
6
7
8
9
10
11
12
13
yield SplashRequest(url, self.parse_result,
args={
# optional; parameters passed to Splash HTTP API
'wait': 0.5,

# 'url' is prefilled from request url
# 'http_method' is set to 'POST' for POST requests
# 'body' is set to request body for POST requests
},
endpoint='render.json', # optional; default is render.html
splash_url='<url>', # optional; overrides SPLASH_URL
slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN, # optional
)