0
Answered
Andreas Dreyer Hysing 2 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 1

After we started to use version 2 of Portia we are experiencing unwanted deduplication of similar items from crawl. Looking through the logs of these crawl reveals that these items does indeed have different values for at least one field in each item. As we see it these items are not duplicates, and should not be discarded.


As a note. All fields in the item are configured with the Vary-option enabled, and both Required-options disabled.



Crawl logs were read on the web interface from https://app.scrapinghub.com/p/110257/36/4/log


Answer

Answer
Answered

Hi Andreas!

Our experts suggests to disable Vary-option. This should improve your crawling for this particular case. All the fields in the data format used by that spider have vary = True set, so they're ignored when checking for duplicates.

Let me know if this was helpful.

Kind regards,

Pablo

Answer
Answered

Hi Andreas!

Our experts suggests to disable Vary-option. This should improve your crawling for this particular case. All the fields in the data format used by that spider have vary = True set, so they're ignored when checking for duplicates.

Let me know if this was helpful.

Kind regards,

Pablo