Documentation
Introduction
αScraper is a scraping service API. You can choose from two options based on your needs:
- The lightweight and quick Cloud Crawler is a superfast, scalable basic crawler. Use sessions, custom headers, custom cookies, proxy rotation
- Our Chrome Cluster for parsing complex or dynamic Javascript sites, where a real browser might be needed for rendering
Some basics about our service:
- 2 mb per request limit
- If you exceed your allotted request limit, we’ll return a 429 header status
- Max 60 second timeout per request; please take this into account in your code
- You will not be charged for unsuccessful requests (we handle all 5xx status codes). You will only be charged for 200 and 404 codes
- Images and other binary content are allowed for scraping (2 mb per request)
Using our services
Basically we have one endpoint with two request types
- http://api.ascraper.com/crawl
-
GET request
Returns result in JSON format
To return result in plain HTML, set GET parameter format=html
Supports just 3 GET parameters: userId=API_KEY, url and selector. Simple and fast. - http://api.ascraper.com/crawl
- POST-request
Supports custom headers and cookies, sessions, js-rendering. Accepts and returns result in JSON
Example
JSON
curl "http://api.ascraper.com/crawl?userId=API_KEY&url=https://amazon.com"
Result
{ "status":{ "code":"OK" }, "cookies":[ ], "headers":{ "Server":"gunicorn/19.9.0", "Access-Control-Allow-Origin":"*", "Access-Control-Allow-Credentials":"true", "Connection":"keep-alive", "Content-Length":"33", "Date":"Sat, 31 Oct 2020 16:19:39 GMT", "Content-Type":"application/json" }, "html_source":"{\n \"origin\": \"185.233.83.124\"\n}\n" }If you want to return only one HTML selector, you can add a jquery-style selector &selector='SELECTOR' to the query string. Don't forget to apply the urlencode css selector.
curl "http://api.ascraper.com/crawl?userId=API_KEY&selector=title&url=https://google.com"
Result
{ "source":"[\"<title>Google</title>\"]", "status":{ "code":"OK" }, "cookies":[ { ... } ], "headers":{ ... } }
Example
If you need plain HTMLUse format=html parameter
curl "http://api.ascraper.com/crawl?userId=API_KEY&url=https://google.com&format=html"
Result
<!doctype html> <html itemscope="" itemtype="http://schema.org/WebPage" lang="ru"> <head> <meta charset="UTF-8"> <meta content="origin" name="referrer"> <link href="/searchdomaincheck?format=opensearch" title="Поиск в Google" rel="search" type="application/opensearchdescription+xml"> <link href="/manifest?pwa=webhp" crossorigin="use-credentials" rel="manifest"> ...
Responses
We'll send you the following answer options:
- 200 Successful request
- 400 Bad request — Check your request parameters (url, userId or other parameters).Contact support
- 401 Unauthorized — Provided userId doesn’t exists
- 406 Not Acceptable — Chrome requests are not allowed for your plan
- 408 Request Timeout — 60 timeout reached. Contact support
- 429 Too Many Requests Request — limit exceeded
- 451 Unavailable For Legal Reasons — We was unable to reach the url. Contact support
- 503 Internal server error — Something goes wrong. Check you request and contact support
- 509 Bandwidth Limit Exceeded — Concurrent sessions amount exceeded or you request one url too fast
- 510 Not Extended — Chrome concurrent sessions exceeded
Custom Headers
To make a request with custom headers or custom cookies you have to use special endpoint:
http://api.ascraper.com/crawl
Use POST request
Example
curl -X POST -H "Content-Type: application/json" --data '{"url": "http://httpbin.org/headers","userId": "API_KEY","headers": [{"name" : "name", "value" : "value"}]}' 'http://api.ascraper.com/crawl'
Result
{ "status":{ "code":"OK" }, "sessionId":"21798b55-153d-4ae8-b785-271c40f761ca", "cookies":[ ], "headers":{ "Server":"gunicorn/19.9.0", "Access-Control-Allow-Origin":"*", "Access-Control-Allow-Credentials":"true", "Connection":"keep-alive", "Content-Length":"485", "Date":"Sat, 31 Oct 2020 19:45:20 GMT", "Content-Type":"application/json" }, "html_source":"{\n \"headers\": {\n \"Accept\": \"*/*\", \n \"Host\": \"httpbin.org\", \n \"Name\": \"value\", \n \"User-Agent\": \"Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 OPR/43.0.2442.1144\", \n \"X-Amzn-Trace-Id\": \"Root=1-5f9dbed0-597dd69e27590a322779a228\", \n \"X-B3-Parentspanid\": \"c15cd9bc743da160\", \n \"X-B3-Sampled\": \"1\", \n \"X-B3-Spanid\": \"a7c628a49b2b6db0\", \n \"X-B3-Traceid\": \"5f9dbecf7a33412fae164ec446116fd9\"\n }\n}\n" }
Custom Cookies
To make a request with custom headers or custom cookies you have to use special endpoint:
http://api.ascraper.com/crawl
Use POST request
Example
curl -X POST -H "Content-Type: application/json" --data '{"url": "http://httpbin.org/cookies","userId": "API_KEY","cookies": [{"name" : "name", "value" : "value"}]}' 'http://api.ascraper.com/crawl'
Result
{ "status":{ "code":"OK" }, "sessionId":"31d5efdf-069e-4bcc-98de-0e99eea024af", "cookies":[ { "domain":".httpbin.org", "hostOnly":false, "httpOnly":false, "name":"name", "path":"/", "sameSite":"None", "secure":false, "session":false, "storeId":false, "value":"value", "id":null } ], "headers":{ "Server":"gunicorn/19.9.0", "Access-Control-Allow-Origin":"*", "Access-Control-Allow-Credentials":"true", "Connection":"keep-alive", "Content-Length":"43", "Date":"Sat, 31 Oct 2020 19:51:44 GMT", "Content-Type":"application/json" }, "html_source":"{\n \"cookies\": {\n \"name\": \"value\"\n }\n}\n" }
Sessions
By default we rotate the IP with every request. But if you need to reuse an IP or cookie, simply use the &session_id= flag (e.g. session_id=123). The value of session can be any integer, simply send a new integer to create a new session (this will allow you to continue using the same proxy for each request with that session id). Sessions expire 15 minutes after the last usage.
Example
curl -X POST -H "Content-Type: application/json" --data '{"url": "http://httpbin.org/cookies","userId": "API_KEY","cookies": [{"name" : "name", "value" : "value"}]}' 'http://api.ascraper.com/crawl'
curl -X POST -H "Content-Type: application/json" --data '{"url": "http://httpbin.org/cookies","userId": "API_KEY","sessionId": "31d5efdf-069e-4bcc-98de-0e99eea024af"}' 'http://api.ascraper.com/crawl'
Result
{ "status":{ "code":"OK" }, "sessionId":"31d5efdf-069e-4bcc-98de-0e99eea024af" }
Chrome Cluster
If you need a real browser to get page contents or javascript rendering, use the render=true parameter. By default we disable all css files and images
Example
curl -X POST -H "Content-Type: application/json" --data '{"url": "http://httpbin.org/ip","userId": "API_KEY","render" : true}' 'http://api.ascraper.com/crawl'
curl "http://api.ascraper.com/crawl?userId=API_KEY&selector=title&url=https://google.com&render=true"
Result
{ "sessionId":"3caf3d06-cce2-4ba2-a7e0-03db4b65f8d4", "cookies":[ ], "headers":{ }, "html_source":"<pre style=\"word-wrap: break-word; white-space: pre-wrap;\">{\n \"origin\": \"172.19.0.1, 185.233.80.89\"\n}\n</pre>" }
Proxy Mode
Also you can send all your requests to a proxy-frontend. The proxy mode will pass all your requests through our service. So you'll get all the benefits, such as ip rotation, auto retries, and others. Just like in basic mode, we'll handle requests in the same way:200, 404 requests - successful requests
500 - unsuccessful requests
429 - out of limits
You can use the proxy for binary content scraping, we handle it like normal traffic.
Also you can pass all service parameters to a proxy:
- render
- session_id
- headers
- cookies
Set the parameters like this:
ascraper;render=true;session_id=session@API_KEY:proxy.ascraper.com:8080
Any headers you set for your proxy requests will be automatically sent to the site you are scraping.
To properly pass your requests through the API your code must be configured to not verify SSL certificates.