Our Forums have moved to a new location - please visit Scrapinghub Help Center to post new topics.
You can still browse older topics on this page.
Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
It should be possible to share job data data (items, logs or pages) publicly.
Project permissions should also be more flexible, and allow public access (read only, or schedule jobs, etc.).
This is possible now through the Datasets catalog:
Create a new page for each user showing their profile. This could use a gravatar. Users can share something about themselves. When viewing other users, projects in common should be shown. Later we can add more social features.
When viewing logs: allow jumping to the end (like tail -f), make scrolling quicker and preserve the illusion you have a single huge document. Move "go to line" to the top with other controls.
Make the items and pages tabs the same as the improved logs tab.
We're going to be incorporating these feature as part of the work for:
I'm seeing a lot of "referred to by" lines in the log, like these:
|referred to by <built-in method __reduce_ex__ of _UnixWaker object at 0x20c0ed0>|
This is usually caused by non-primitive objects passed in request.meta, like Item Loaders or Responses. It happens because Scrapy Cloud internally uses a disk-based scheduler to reduce memory consumption.
is there a way to extract all the items when a page lists multiple products like http://www.tecon-gmbh.de/advanced_search_result.php?keywords=+&x=26&y=12&categories_id=&inc_subcat=1&manufacturers_id=&pfrom=1&pto=&dfrom=&dto=
please check this feedback with the same question
insert at the end of the url something like <starturl:endurl> ;
for example if you know a website has 500 products you would setup like this:
http://www.example.com/catalog/product_id=<1:500> and set the it to not follow links
I'm a new comer to Crawlera, now I'm creating a spider to crawl a site but need to use session key and tokens to validate the form, ip change will cause the server not recognize this request.
Is there any option or way to achieve this?
Sorry, this is not supported yet for public access.
Under "Settings" > "Scrapy Deply" then "Git" I entered
then "Deploy from Git"
I get a dark screen with "Processing..."
and then a red popup of "Deploy Failed".
When I follow the instructions under "Deploy configuration" and run "scrapy deploy" from my local project clone it works fine.
How is it possible to download images in autoscraping , also if we have our own datastore how can it be connected with autoscraping?
Often I find that items are not defined by me , and only when I see the template I can decide on the item fields to be extracted. So there should be a feature to create new item fields while in the template mode. Its not efficient to go back and define items each time.
Portia already supports this.
I had annotated sticky field annotations and I require them to also be extracted, but its not being extracted
data extracted in sticky annotations are intended to provide a kind of annotation that must be thrown away.
If you want to have the extracted data in your item, you must use a normal item field (and mark the annotation as required). Does not have sense to have a sticky annotation which extracts the data because it is the same as a normal field.
I am scraping the playstore and its scraping at the rate of 6 items\min which is unrealistically slow , here is the job id:
CONCURRENT_ITEMS = 100
CONCURRENT_REQUESTS_PER_IP = 10
DELTAFETCH_ENABLED = 1
DOTSCRAPY_ENABLED = 1
Hi, drsumm, do you remember autothrottle? Check this feedback:http://support.scrapinghub.com/topic/168025-slow-scraping/
About CONCURRENT_ITEMS, i don't think that is what you need. Check this:
This settings gives the max limit of concurrent items. It will not accelerate the crawling speed.
Also, the number of duplicates indicates that the same items are being scraped once and again, and that reduces drastically the relation items/pages.You should add url/parameter filters in order to avoid to scrape unneeded pages. Check issues 6 and 8 in this section of the help:
If you look at the dropped log lines, for example:
Dropped: Duplicate product scraped at <https://play.google.com/store/apps/details?id=com.app.vodio&reviewId=Z3A6QU9xcFRPR0FvM1p0aGRCSmtpN2ZMTExEWjR2ZUhQZzhoRUE1X2pRb0Q4UXhvWUFBLTZkb0pXYk1zN3Z0SXpkLWszVDZiLXZCNU5ya0t2ZE1CdHRpamc>, first one was scraped at <https://play.google.com/store/apps/details?id=com.app.vodio
You will see that the same page is being visited with two different urls. You must use the QueryCleaner addon in order to remove the parameter reviewId from the urls.
Customer support service by UserEcho