Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
Remember to check the Help Center!
Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.
Do i need to install scrapy, python and all the stuff for that?
project.schedule('ScrapeVerge') Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/scrapinghub.py", line 196, in schedule result = self._post('schedule', 'json', params) File "/usr/local/lib/python2.7/dist-packages/scrapinghub.py", line 178, in _post return self._request_proxy._post(method, format, params, headers, raw, files) File "/usr/local/lib/python2.7/dist-packages/scrapinghub.py", line 105, in _post return self._request(url, params, headers, format, raw, files) File "/usr/local/lib/python2.7/dist-packages/scrapinghub.py", line 125, in _request return self._decode_response(response, format, raw) File "/usr/local/lib/python2.7/dist-packages/scrapinghub.py", line 137, in _decode_response raise APIError(data['message']) scrapinghub.APIError: You cannot have more than 1 active jobs in beta mode.
This was fixed a long time ago.
I was testing your scrapping service and it looks great.
Unfortunately for my use case where I'm trying to scrap Arabic websites the encoding returned is incorrect, there for I'm unable to read the scaped content.
here is the site URL I scapped :
and this is a screenshot of the resulting scrapping:
Thanks for reporting. Even though the link does not work anymore, we believe this is fixed now. Otherwise, please comment.
I want to suggest removing "Confirm navigation" dialog triggered on Compare With pages. It's a bit irritating.
This is the url that is generated. But no csv files.
In addition of the suggestion below about regular expression, you also need to ensure that the filters does not ban the pages that leads to the products. Check this
section of the documentation http://doc.scrapinghub.com/autoscraping.html#considerations-when-using-url-filters
For example, ask for an email to be sent if some condition is met (or not met), possibly compared to previous jobs:
- job crawled 0 items
- job crawled 50% less than previous N last job(s)
- job is taking 200% longer than usual
- some stats counter(s) is too high
Something à-la Pingdom maybe
Email is one option, Sentry or Graphite events would be another.
This is supported through the powerful (although still very under-documented) Monitoring addon.
I want to get all the external URLs on a site.
I have uploaded the package to your project. In order to use it, you would need to enable the middleware with:
LINKSEXTRACTOR_ENABLED = 1
LINKSEXTRACTOR_EXCLUDE_PATTERNS = <some patterns to exclude>
Portia supports CSS selectors now.
It would be nice to have an additional field in the custom settings (project-wide, per-spider, etc) to add a comment about why that specific setting was added.
For example: "Auto throttle disabled because the spider does broad crawls and scrapes few pages within a single domain."
HubStorage: writing items to http://storage.scrapinghub.com/items/1891/1/1
It is just debug information, that is not intended to be accessed as a link. It is rendered as a link in the log (because it is an url), but it is not a link. It is just the api url that the job uses to store items which, in order to be called, needs other parameters (for instance, the api user key)
- On Portia V1.0 this is implemented already from project menu:
- On Portia V2.0 is planned to be implemented in two phases: first via API on the next Portia release and secondly via the User Interface.
Customer support service by UserEcho