Start a new topic
Answered

Add URLs to Delta Fetch manually

How can I manually add URL's to crawlera? So a pre set list of URLs are not crawled? 


Best Answer

You can try writing to .scrapy folder using: https://support.scrapinghub.com/support/solutions/articles/22000200401-dotscrapy-persistence-addon. But DF doesn't store URLs, it stores a key (identifier) based on URLs that got items on previous runs.


Answer

You can try writing to .scrapy folder using: https://support.scrapinghub.com/support/solutions/articles/22000200401-dotscrapy-persistence-addon. But DF doesn't store URLs, it stores a key (identifier) based on URLs that got items on previous runs.

These urls are from 

1) Previously crawled jobs BEFORE DF was active

2) Other crawler


Is their away to manually add these urls? 

I renamed and moved the topic to the appropriate section.

Where do those old URLs come from? If they were already crawled in previous jobs of the spider when DeltaFetch was enabled, then they should be added automatically to the list of URLs not to crawl because they were already crawled in previous jobs of the spider, DF does that automatically.

Sorry Nestor, should read: 


Add URLs to Delta Fetch manually. 


Currently, we have Delta Fetch added, and its working for any new URL, but I have 250,000 old urls that I do not want crawled again, I am looking to add these to the list of crawled urls, so they are not crawled again. 


I'm sorry, but I don't quite understand what you want to do. You mention Crawlera first, which is not possible to set anything to, because it is proxy API. The point of DeltaFetch is to not crawl those URLs which you've already crawled. Please provide more details to what you want to do, so I can provide assistance.

DeltaFetch and DotScrapy Persistance is blacking re crawled data, I am surprised I can not do it, where ever that data is being kept. 

This is not something you set on Crawlera level, but on your spider. If you use Scrapy, you could manually set dont_proxy in request.meta for those URLs that you don't want to use the proxy for.

Login to post a comment