Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

Remember to check the Help Center!

Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.

0
Answered
Jean Maynier 4 years ago in Portia • updated by Paul Tremberth (Engineer) 2 years ago 5
I would like to update the start urls for an autoscraping spider. I tried


curl http://dash.scrapinghub.com/api/schedule.json -d project=155 -d spider=myspider -u <your api key>: -d start_urls="$(cat start_urls.txt)"

but it schedule a job and only use the start urls for that execution. My AS is a periodic job, and I want my start url to persist. is it possible ?
Thanks 

Answer

Hi, Jean,

you can edit the start url of a spider at any moment, by editing the spider autoscraping properties in the panel. At top right in the panel, you have a red button to see autoscraping properties of a spider, and once there, you have a red button to edit them.

The url you used is not for that. It is just as you said, for scheduling a spider and set the start url for that job only.


0
Fixed
Jean Maynier 4 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 4 years ago 2

One of my periodic job stopped to work for the last 6 days without notice. It appears that a running job was still active for the last 6 days (usually the job take few minutes to complete).

0
Declined
Serge 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 7

Hello to ScrapingHub Team !

Making the first tests I've seen that Spider has no an option INCLUDE Pattern (for urls)

I suppose that basing on scrap logic a user must check website what is planned for spidering and scraping.

I also suppose that 95% that pages for scraping are made under one standard on a website.

Let's say usually all products will be under /product/*.html url pattern.


So it seems to be logic if we could use for spidering an option INCLUDE Pattern what means that Spider goes through all pages but collect and scrap only addresses where url contains

/product/

but ignore all others like

/contact/

/aboutus/

/news/

and so on...


It's easier to adjust in Spider settings and, may be, even better for Spider speed.

Would be glad to know your opinions about this feature,

Have a nice day !

Answer

Hi Serge,


you already have that. In "Links to follow", you have the option "Follow links that matches the following patterns"


Check also documentation


http://help.scrapinghub.com/autoscraping.html#url-filters
0
Answered
Pavel Liubinski 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 0

I have scraped some site and setup the template. On "Items" page I see many scraped pages. Many of them have only "url" and "body" fields. Also there are pages with scraped items (that I have setup in template). How can I export the data with structured items only?

There is image attached to the post explaining what I want.


Answer

check this thread


http://support.scrapinghub.com/topic/171296-scraped-items-dont-show-up-in-json-only-url-body-and-cookies/


(note that you can search by yourself similar question by other users, before asking a new one)

+3
Answered
Pavel Liubinski 4 years ago in Portia • updated by Paul Tremberth (Engineer) 3 years ago 38

I would like to parse everyday a site about forecoming concerts in my town. For example this page: http://www.samru.ru/?module=article&action=showAll&id=198&subrazdel_id=41

All concerts are simply listed in a table, there are no page for each concert. I would like each concert to be an item in autoscraping. How should I setup autoscraping for scraping lists of items?

Thank you


Answer
At moment the method is indirect. You annotate products as variants


http://help.scrapinghub.com/autoscraping.html#variants


and then use a post processor to split variants into separate products (we can deploy to your project a split variant post processor)


In future we will allow to directly annotate separate products

+2
Answered
Dimitry Izotov 4 years ago in Portia • updated by Pablo Hoffman (Director) 9 months ago 11

Hi, I have a website that sells 10,000 parts, they all have part numbers, i would like to scrape additional detail of every part. Can I feed some .csv file with all parts and return defined fileds (image, description, weight, price, etc.)? I cannot seem to find an option to scrape from list...

Answer
Shane Evans (Director) 9 months ago

We are thinking of allowing a URL to contain the start urls to seed the crawl. I guess we could extend this idea to allow the URL to point to a CSV document, and have a pattern to make urls from it e.g. http://mysite.com/part-{0} would create start URLs from the first field in the csv.

0
Thanks
Juan Catalano 4 years ago in Crawlera • updated by Martin Olveyra (Engineer) 4 years ago 0

It really helps me to throttle my requests and avoid being banned from servers. Its incredibly easy to use and totally transparent! Bravo!

0
Answered
Rodolpho Ramirez 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 2

Hi guys.


First of all, Autoscraping is awesome. After learning the basic concepts it has everything one needs to scrap the hell of the web.


I am running a spider at a website, and the lame coder at the other side forgot to write some external links with "http://" in some pages (there are over 13000 pages/products in that website).


So what happens is the the spider interprets that as an internal link, like this:

http://www.beeingscrapedwebsite.com/www.externallink.com


The problem is that this renders (loads) the exact same webpage, with the same crap link, that the spider again interprets as an internal link, and tries to scrape the following page:


http://www.beeingscrapedwebsite.com/www.externallink.com/www.externallink.com


And that goes on in an infinite loop until the spider stops for "fews items scraped".


I would like to know if there is a way to stop the spider to do that.


Thanks in advance.


Rodolpho

Answer
Hi, Rodolpho.

You can add the pattern
www.externallink.com


into excluded patterns property of spider.

0
Answered
Umair Ashraf 4 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 4 years ago 4

Can we send signals in our spider?


For example, if there is a case in which I want to collect data from different pages but only one item will be returned when spider_idle occurs.

0
Declined
Umair Ashraf 4 years ago in Scrapy Cloud • updated 4 years ago 2

There is a case when I usually need to test particular item URL with callback (e.g, parse_item) in my some spiders. I write more or less same code for all spiders to achieve my testing needs and I think this is legit to add it to bare scrapy.


Following code helps me just scrape single item of whole lot items found on page.


item_requests = []

def __init__(self, item_url=None, item_callback=None, **kwargs):

    super(BaseSpider, self).__init__(item_url=item_url, item_callback=item_callback, **kwargs)

    if item_url and item_callback:

        self.item_requests.append(

            Request(url=item_url, callback=getattr(self, item_callback)))

def start_requests(self):

    if self.item_requests:

        for req in self.item_requests:

            yield req

    else:

        reqs = super(BaseSpider, self).start_requests()

        for req in reqs:

            yield req

Here's how I use it.


scrapy shell [spider-name] -a item_url="http://item-details-page-url..." -a item_callback="parse_item"


Is this is good to add it to Scrapy or it should be kept project specific?

Answer
Umair Ashraf 4 years ago

This is already there.

0
Answered
Martin Olveyra (Engineer) 4 years ago in Portia • updated 3 years ago 4
Answer
In short, you have to add extra required attributes. Check autoscraping documentation, in particular the section

http://help.scrapinghub.com/autoscraping.html#extra-required-annotations

and more precisely the example 2.


0
Answered
Andrés Moreira (Technical Sales) 4 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 4 years ago 1

I've disabled the AutoThrottle extension and increased the concurrency in the Panel, but I'm not sure if those changes are applied to the currently running jobs, are they? 


Thanks!

Answer

Settings are applied when the job is started, so changing settings doesn't have any effect on running jobs. It does on pending jobs that haven't started yet though.

0
Answered
drsumm 4 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 2

In addition to the annotated items on the page , I would also like to scrape specific components from html source

e.g: Some of the latitude longitude data is hardcoded as javascript assignments in a javascript tag, which are straightforward to extract from the html source. So how to do this?

0
Answered
Brad Attaway 4 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 9

Question is in the subject line.  I feel like several challenges I'm having will be resolved by understanding the use cases for that particular bit.

0
Answered
hsantos 4 years ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 4 years ago 3

I created a spider, run it and when i went to download it, the csv file, it provided a strange file with title "1" and not regognising it as a csv file. Someone can help?

Answer
It opens OK to me. Anyway, probably the program you are using has some problem with cells with big data and html source. One of the problems is that you are trying to download items from a job that ran in annotating mode, so you are not downloading the scraped data, but the captured pages instead.

Annotating mode is only for template development. Once you developed the templates, you have to remove the annotating tag and make a new run.