Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
Remember to check the Help Center!
Please remember to check the Scrapinghub Help Center before asking here, your question may be already answered there.
It should be possible to share job data data (items, logs or pages) publicly.
Project permissions should also be more flexible, and allow public access (read only, or schedule jobs, etc.).
This is possible now through the Datasets catalog:
Create a new page for each user showing their profile. This could use a gravatar. Users can share something about themselves. When viewing other users, projects in common should be shown. Later we can add more social features.
When viewing logs: allow jumping to the end (like tail -f), make scrolling quicker and preserve the illusion you have a single huge document. Move "go to line" to the top with other controls.
Make the items and pages tabs the same as the improved logs tab.
We're going to be incorporating these feature as part of the work for:
I'm seeing a lot of "referred to by" lines in the log, like these:
|referred to by <built-in method __reduce_ex__ of _UnixWaker object at 0x20c0ed0>|
This is usually caused by non-primitive objects passed in request.meta, like Item Loaders or Responses. It happens because Scrapy Cloud internally uses a disk-based scheduler to reduce memory consumption.
is there a way to extract all the items when a page lists multiple products like http://www.tecon-gmbh.de/advanced_search_result.php?keywords=+&x=26&y=12&categories_id=&inc_subcat=1&manufacturers_id=&pfrom=1&pto=&dfrom=&dto=
please check this feedback with the same question
insert at the end of the url something like <starturl:endurl> ;
for example if you know a website has 500 products you would setup like this:
http://www.example.com/catalog/product_id=<1:500> and set the it to not follow links
I'm a new comer to Crawlera, now I'm creating a spider to crawl a site but need to use session key and tokens to validate the form, ip change will cause the server not recognize this request.
Is there any option or way to achieve this?
Sorry, this is not supported yet for public access.
Under "Settings" > "Scrapy Deply" then "Git" I entered
then "Deploy from Git"
I get a dark screen with "Processing..."
and then a red popup of "Deploy Failed".
When I follow the instructions under "Deploy configuration" and run "scrapy deploy" from my local project clone it works fine.
How is it possible to download images in autoscraping , also if we have our own datastore how can it be connected with autoscraping?
Often I find that items are not defined by me , and only when I see the template I can decide on the item fields to be extracted. So there should be a feature to create new item fields while in the template mode. Its not efficient to go back and define items each time.
Portia already supports this.
I had annotated sticky field annotations and I require them to also be extracted, but its not being extracted
data extracted in sticky annotations are intended to provide a kind of annotation that must be thrown away.
If you want to have the extracted data in your item, you must use a normal item field (and mark the annotation as required). Does not have sense to have a sticky annotation which extracts the data because it is the same as a normal field.
Customer support service by UserEcho