Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

Remember to check the Help Center!

Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.

+26
Completed
Shane Evans (Director) 4 years ago in Portia • updated by Pablo Hoffman (Director) 2 years ago 4
We could have an addon to execute JavaScript (say, using splash). This would be suitable for smaller websites that require JavaScript execution in order for Portia to work.
+1
Completed
Shane Evans (Director) 4 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 10 months ago 2

It should be possible to share job data data (items, logs or pages) publicly.


Project permissions should also be more flexible, and allow public access (read only, or schedule jobs, etc.).

Answer
Pablo Hoffman (Director) 10 months ago

This is possible now through the Datasets catalog:

https://blog.scrapinghub.com/2016/06/09/introducing-the-datasets-catalog/

+4
Completed
Shane Evans (Director) 4 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 3 years ago 1

Create a new page for each user showing their profile. This could use a gravatar. Users can share something about themselves. When viewing other users, projects in common should be shown. Later we can add more social features.

+3
Completed
Shane Evans (Director) 4 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 10 months ago 3

When viewing logs: allow jumping to the end (like tail -f), make scrolling quicker and preserve the illusion you have a single huge document. Move "go to line" to the top with other controls.


Make the items and pages tabs the same as the improved logs tab.



Answer
Pablo Hoffman (Director) 10 months ago

We're going to be incorporating these feature as part of the work for:

http://support.scrapinghub.com/topics/1941-show-logs-in-terminal-format/

+1
Answered
Pablo Hoffman (Director) 4 years ago in Scrapy Cloud • updated 3 years ago 2

I'm seeing a lot of "referred to by" lines in the log, like these:


referred to by <built-in method __reduce_ex__ of _UnixWaker object at 0x20c0ed0>

Answer

This is usually caused by non-primitive objects passed in request.meta, like Item Loaders or Responses. It happens because Scrapy Cloud internally uses a disk-based scheduler to reduce memory consumption.

0
Answered
fernando Almeida 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 0
Answer

Hi Fernando,


please check this feedback with the same question


http://support.scrapinghub.com/topic/194782-scrape-page-with-list-of-items/?sso_token=

0
Completed
fernando Almeida 4 years ago in Portia • updated by Andrés Pérez-Albela H. 3 years ago 4

insert at the end of the url something like <starturl:endurl> ;

for example if you know a website has 500 products you would setup like this:

http://www.example.com/catalog/product_id=<1:500> and set the it to not follow links

0
Answered
Narcissus Emi 4 years ago in Crawlera • updated by Martin Olveyra (Engineer) 4 years ago 1

Hi,


I'm a new comer to Crawlera, now I'm creating a spider to crawl a site but need to use session key and tokens to validate the form, ip change will cause the server not recognize this request.


Is there any option or way to achieve this?

Answer

Sorry, this is not supported yet for public access.

0
Thanks
Castedo Ellerman 4 years ago in Scrapy Cloud • updated by Andrés Pérez-Albela H. 3 years ago 1

I just "scrapy deploy"'d the navscraper git project to Scrap Cloud, ran some spiders and it really worked!


0
Answered
Castedo Ellerman 4 years ago in Scrapy Cloud • updated by Pablo Hoffman (Director) 4 years ago 3

Under "Settings" > "Scrapy Deply" then "Git" I entered

https://github.com/scrapinghub/navscraper

then "Save"

then "Deploy from Git"

I get a dark screen with "Processing..."

and then a red popup of "Deploy Failed".


When I follow the instructions under "Deploy configuration" and run "scrapy deploy" from my local project clone it works fine.

0
Answered
drsumm 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 3

How is it possible to download images in autoscraping , also if we have our own datastore how can it be connected with autoscraping?

Answer

You can use the Images addon


Please check this doc on how to use


http://doc.scrapinghub.com/addons.html#images

+1
Completed
drsumm 4 years ago in Portia • updated by Pablo Hoffman (Director) 10 months ago 3

Often I find that items are not defined by me , and only when I see the template I can decide on the item fields to be extracted. So there should be a feature to create new item fields while in the template mode. Its not efficient to go back and define items each time.

Answer
Pablo Hoffman (Director) 10 months ago

Portia already supports this.

0
Answered
drsumm 4 years ago in Portia • updated by Martin Olveyra (Engineer) 4 years ago 2

I had annotated sticky field annotations and I require them to also be extracted, but its not being extracted

Answer

Hi drsumm,


data extracted in sticky annotations are intended to provide a kind of annotation that must be thrown away.

If you want to have the extracted data in your item, you must use a normal item field (and mark the annotation as required). Does not have sense to have a sticky annotation which extracts the data because it is the same as a normal field.


0
Answered
drsumm 4 years ago in Portia • updated by Oleg Tarasenko (Support Engineer) 3 years ago 1

I need to display only those pages where there were no items extracted to understand those pages, Is that possible now?