Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
Remember to check the Help Center!
Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.
Can we send signals in our spider?
For example, if there is a case in which I want to collect data from different pages but only one item will be returned when spider_idle occurs.
There is a case when I usually need to test particular item URL with callback (e.g, parse_item) in my some spiders. I write more or less same code for all spiders to achieve my testing needs and I think this is legit to add it to bare scrapy.
Following code helps me just scrape single item of whole lot items found on page.
Here's how I use it.
item_requests = 
def __init__(self, item_url=None, item_callback=None, **kwargs):
super(BaseSpider, self).__init__(item_url=item_url, item_callback=item_callback, **kwargs)
if item_url and item_callback:
Request(url=item_url, callback=getattr(self, item_callback)))
for req in self.item_requests:
reqs = super(BaseSpider, self).start_requests()
for req in reqs:
scrapy shell [spider-name] -a item_url="http://item-details-page-url..." -a item_callback="parse_item"
Is this is good to add it to Scrapy or it should be kept project specific?
This is already there.
and more precisely the example 2.
I've disabled the AutoThrottle extension and increased the concurrency in the Panel, but I'm not sure if those changes are applied to the currently running jobs, are they?
Settings are applied when the job is started, so changing settings doesn't have any effect on running jobs. It does on pending jobs that haven't started yet though.
In addition to the annotated items on the page , I would also like to scrape specific components from html source
Question is in the subject line. I feel like several challenges I'm having will be resolved by understanding the use cases for that particular bit.
I created a spider, run it and when i went to download it, the csv file, it provided a strange file with title "1" and not regognising it as a csv file. Someone can help?
Annotating mode is only for template development. Once you developed the templates, you have to remove the annotating tag and make a new run.
Hi, just started exploring the Autoscraper, and I'm very impressed by what it has to offer. Natalia's screencast has been very helpful for seeing how it works.
As I use it I'm keeping a list of UI suggestions. Do you want me to post these right now?
If you're still hashing out the UI I don't want to bombard you over things that will be changed anyway.
If you want me to post the suggestions just let me know.
We are aware of many things to improve in the UI. And we are already implementing many changes which will be released in some weeks, and even more improvements specifically in the annotation tool that will be available later. Feel free to do any suggestion but consider that many of them could have been already issued or planned to do, and others even could become not applicable.
Are services down? it wont start any schedule and simply get stuck on pending.
This was caused by over-demand last night, which affected customers with no dedicated servers. Load is back to normal now.
I've scraped some items from a website, but when downloading data they don't show up, only URL, body and cookies in JSON. Shoudn't be an ITEM column?
Check the documentation:
first section (basic concepts and procedures)
In order to switch from annotating mode to normal
mode, you have to remove the "annotating" tag from the spiders
and run again. But important! If you did not get good results in
annotating mode, you will not get good results in normal mode. So ensure
you have thoroughly tested in annotating mode.
As a fast intro, AS basically runs in two different modes: annotating mode and normal mode. The "annotating" mode is only for the purpose of capture pages, add templates and test them. The normal mode is what you need to actually get the items, once you tested everything properly in annotating mode.
It takes about 1 sec to scrape one item group. Why is it so slow on this platform ?, My spider is running since about 20 hours and still very slow. I have used scrapy and it was pretty fast.
It is because of the Autothrottle addon. Check this documentation for scrapinghub users, which explains why we limit spider speed with Autothrottle and how to change the behaviour:
I got the message closespider_pagecount what does that mean?
It means the spider has reached the maximum number of pages allowed to crawl, and was terminated because of that.
Autoscraping runs in annotating mode always have that limit in place.
The actual name "closespider_pagecount" comes from the Scrapy extension that powers the shutdown: https://scrapy.readthedocs.org/en/latest/topics/extensions.html#closespider-pagecount
I had followed instructions to annotate in the template, but the spider is not extracting any fields. Onlu body,url items are extracted.
If you cannot see extracted data in an annotating mode run it usually means that the templates are not extracting (even they have not annotated) all the required fields. Check how you defined the fields of the item, in particular their Required flag, and also check that the template is annotating all the required ones, or remove the Required flag from those fields that you really don't expect to annotate or extract with every template.
For more detailed info please check the autoscraping documentation, in particular the section that explains how templates are used in the extraction process.
Customer support service by UserEcho