Start a new topic
Answered

How to add crawling date to items?

Hi all,


I'd like to run my Portia crawler for a specific website once every hour but I'd need the crawler to get only the fresh articles in other words to avoid the already crawled duplicates. Now I understand the DeltaFetch addon will ignore duplicates within the same crawling job but not within the subsequent jobs?


If that is the case I was thinking about somehow adding the crawling date (which is shown on the UI next to each item number in the job items list) to the crawl data so when I download the scraped data I can identify duplicates.


Anyone got ideas/experience on this front?


Thanks!


Best Answer

Deltafetch will ignore duplicates for subsequent jobs https://support.scrapinghub.com/support/solutions/articles/22000200411-delta-fetch-addon. Dupefilter is for duplicates within the same job.


Answer

Deltafetch will ignore duplicates for subsequent jobs https://support.scrapinghub.com/support/solutions/articles/22000200411-delta-fetch-addon. Dupefilter is for duplicates within the same job.


1 person likes this

Thanks @nestor, have also managed to inject the crawldate into the crawled items with MagciFields in the meantime. 

Login to post a comment