Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

Remember to check the Help Center!

Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.

0
Answered
Umair Ashraf 4 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 4 years ago 4

Can we send signals in our spider?


For example, if there is a case in which I want to collect data from different pages but only one item will be returned when spider_idle occurs.

0
Declined
Umair Ashraf 4 years ago in Scrapy Cloud • updated 4 years ago 2

There is a case when I usually need to test particular item URL with callback (e.g, parse_item) in my some spiders. I write more or less same code for all spiders to achieve my testing needs and I think this is legit to add it to bare scrapy.


Following code helps me just scrape single item of whole lot items found on page.


item_requests = []

def __init__(self, item_url=None, item_callback=None, **kwargs):

    super(BaseSpider, self).__init__(item_url=item_url, item_callback=item_callback, **kwargs)

    if item_url and item_callback:

        self.item_requests.append(

            Request(url=item_url, callback=getattr(self, item_callback)))

def start_requests(self):

    if self.item_requests:

        for req in self.item_requests:

            yield req

    else:

        reqs = super(BaseSpider, self).start_requests()

        for req in reqs:

            yield req

Here's how I use it.


scrapy shell [spider-name] -a item_url="http://item-details-page-url..." -a item_callback="parse_item"


Is this is good to add it to Scrapy or it should be kept project specific?

Answer
Umair Ashraf 4 years ago

This is already there.

0
Answered
Martin Olveyra (Engineer) 4 years ago in Portia • updated 3 years ago 4
Answer
In short, you have to add extra required attributes. Check autoscraping documentation, in particular the section

http://help.scrapinghub.com/autoscraping.html#extra-required-annotations

and more precisely the example 2.


0
Answered
Andrés Moreira (Technical Sales) 4 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 4 years ago 1

I've disabled the AutoThrottle extension and increased the concurrency in the Panel, but I'm not sure if those changes are applied to the currently running jobs, are they? 


Thanks!

Answer

Settings are applied when the job is started, so changing settings doesn't have any effect on running jobs. It does on pending jobs that haven't started yet though.

0
Answered
drsumm 4 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 2

In addition to the annotated items on the page , I would also like to scrape specific components from html source

e.g: Some of the latitude longitude data is hardcoded as javascript assignments in a javascript tag, which are straightforward to extract from the html source. So how to do this?

0
Answered
Brad Attaway 4 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 9

Question is in the subject line.  I feel like several challenges I'm having will be resolved by understanding the use cases for that particular bit.

0
Answered
hsantos 4 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 4 years ago 3

I created a spider, run it and when i went to download it, the csv file, it provided a strange file with title "1" and not regognising it as a csv file. Someone can help?

Answer
It opens OK to me. Anyway, probably the program you are using has some problem with cells with big data and html source. One of the problems is that you are trying to download items from a job that ran in annotating mode, so you are not downloading the scraped data, but the captured pages instead.

Annotating mode is only for template development. Once you developed the templates, you have to remove the annotating tag and make a new run.
0
Completed
Alexander Dorsk 4 years ago in Scrapy Cloud • updated by Oleg Tarasenko (Support Engineer) 3 years ago 0

Hi, just started exploring the Autoscraper, and I'm very impressed by what it has to offer. Natalia's screencast has been very helpful for seeing how it works.


As I use it I'm keeping a list of UI suggestions. Do you want me to post these right now? 


If you're still hashing out the UI I don't want to bombard you over things that will be changed anyway.


If you want me to post the suggestions just let me know.

Answer
Hi, Alexander. Thanks.

We are aware of many things to improve in the UI. And we are already implementing many changes which will be released in some weeks, and even more improvements specifically in the annotation tool that will be available later. Feel free to do any suggestion but consider that many of them could have been already issued or planned to do, and others even could become not applicable.

0
Fixed
mustaffa 4 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 4 years ago 11

Are services down? it wont start any schedule and simply get stuck on pending.

Answer

This was caused by over-demand last night, which affected customers with no dedicated servers. Load is back to normal now.

0
Answered
Rodolpho Ramirez 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 9

I've scraped some items from a website, but when downloading data they don't show up, only URL, body and cookies in JSON. Shoudn't be an ITEM column?

Answer

Check the documentation:


http://help.scrapinghub.com/autoscraping.html

first section (basic concepts and procedures)


As a fast intro, AS basically runs in two different modes: annotating mode and normal mode. The "annotating" mode is only for the purpose of capture pages, add templates and test them. The normal mode is what you need to actually get the items, once you tested everything properly in annotating mode.

In order to switch from annotating mode to normal mode, you have to remove the "annotating" tag from the spiders properties and run again. But important! If you did not get good results in annotating mode, you will not get good results in normal mode. So ensure you have thoroughly tested in annotating mode.

0
Answered
drsumm 4 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 4 years ago 3

It takes about   1 sec to scrape one item group. Why is it so slow on this platform ?, My spider is running since about 20 hours and still very slow. I have used scrapy and it was pretty fast.

Answer

It is because of the Autothrottle addon. Check this documentation for scrapinghub users, which explains why we limit spider speed with Autothrottle and how to change the behaviour:


http://help.scrapinghub.com/addons.html#autothrottle

0
Answered
drsumm 4 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 4 years ago 1

I got the message closespider_pagecount what does that mean?

Answer

It means the spider has reached the maximum number of pages allowed to crawl, and was terminated because of that.


Autoscraping runs in annotating mode always have that limit in place.


The actual name "closespider_pagecount" comes from the Scrapy extension that powers the shutdown: https://scrapy.readthedocs.org/en/latest/topics/extensions.html#closespider-pagecount

+1
Answered
drsumm 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 10

I had followed instructions to annotate in the template, but the spider is not extracting any fields. Onlu body,url items are extracted.

Answer

If you cannot see extracted data in an annotating mode run it usually means that the templates are not extracting (even they have not annotated) all the required fields. Check how you defined the fields of the item, in particular their Required flag, and also check that the template is annotating all the required ones, or remove the Required flag from those fields that you really don't expect to annotate or extract with every template.

For more detailed info please check the autoscraping documentation, in particular the section that explains how templates are used in the extraction process.

+1
Completed
Nicolas Ramírez 4 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 3 years ago 1

It would be nice to be able to download a single item (instead of all) from the panel in JSON format, for testing.

0
Answered
Nicolas Ramírez 4 years ago in Crawlera • updated by Pablo Hoffman (Director) 12 months ago 0
Answer
Pablo Hoffman (Director) 12 months ago

Use the X-Crawlera-Cookies header.