splash快速开始
splash服务
splash是一个javascript渲染服务。轻量浏览器,http API (python, twisted, qt5)
安装splash(linux)
- install docker
- pull the image
$ sudo docker pull scrapinghub/splash
- start the container
$ sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
splash http api
render.html 返回js渲染页面(html)
curl 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'
render.png 返回js渲染页面的截屏(png)
1
2
3
4render with timeout
curl 'http://localhost:8050/render.png?url=http://domain.com/page-with-javascript.html&timeout=10'
320x240 thumbnail
curl 'http://localhost:8050/render.png?url=http://domain.com/page-with-javascript.html&width=320&height=240'render.jpeg 返回js渲染页面的截屏(jpeg)
1
2
3
4# render with default quality
curl 'http://localhost:8050/render.jpeg?url=http://domain.com/'
# render with low quality
curl 'http://localhost:8050/render.jpeg?url=http://domain.com/&quality=30'render.har 返回信息(request, respones, headers)
render.josn 返回信息(html,png,etc)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15curl 'http://localhost:8050/render.json?url=http://domain.com/page-with-iframes.html&png=1&html=1&iframes=1'
# HTML and meta information of page itself and all its iframes
curl 'http://localhost:8050/render.json?url=http://domain.com/page-with-iframes.html&html=1&iframes=1'
# only meta information (like page/iframes titles and urls)
curl 'http://localhost:8050/render.json?url=http://domain.com/page-with-iframes.html&iframes=1'
# render html and 320x240 thumbnail at once; do not return info about iframes
curl 'http://localhost:8050/render.json?url=http://domain.com/page-with-iframes.html&html=1&png=1&width=320&height=240'
# Render page and execute simple Javascript function, display the js output
curl -X POST -H 'content-type: application/javascript' \
-d 'function getAd(x){ return x; } getAd("abc");' \
'http://localhost:8050/render.json?url=http://domain.com&script=1'
# Render page and execute simple Javascript function, display the js output and the console output
curl -X POST -H 'content-type: application/javascript' \
-d 'function getAd(x){ return x; }; console.log("some log"); console.log("another log"); getAd("abc");' \
'http://localhost:8050/render.json?url=http://domain.com&script=1&console=1'
scrapy-splash插件
1. 安装scrapy-splash
$ pip install scrapy-splash
2. 运行splash
docker run -p 8050:8050 scrapinghub/splash
3. 配置scrapy (setting.py)
1 | # 增加splash服务 |
4. 用法
1 | yield SplashRequest(url, self.parse_result, |