Start a new topic
Answered

Sraping with Portia a website that uses schema

Hi there,


I need help scraping a website that uses schema.

On the detail page I can select the 10 items I need to scrape but I cant make the spider navigate to the detail pages ?

The detail page is called www.website.com/products/item-054245

but the list page is called www.website.com/folder1/folder2/folder3/


I have set the start page to: www.website.com

I have set the crawling rules to match patterns:  /products/


I can scrape max 8 items this way (should get about 500 items)


Anyone have an idea how to set this up correctly ?

Thanks,


Best Answer

Sorry for the late reply, I was not able to get it to work. The website is heavy JS dependent and the product pages are not being followed from the list view, so it is not possible to reach them. I would suggest you try Scrapy instead http://scrapy.readthedocs.io/en/latest/index.html.


You might need to add a pattern for pagination.

Ok.

Could you or anyone else be interested in helping me with the scraping setup for several webshops ?

Could you provide the website you're trying to scrape with Portia? Along with a sample of a detail and a list page.

Thanks a lot,


Here is the website wwwskjold-burne.dk

List page: wwwskjold-burne.dk/vin/typer/roedvin

 

Detail page: wwwskjold-burne.dk/produkter/carlo-alfano-roccamora-0551007


From detail page I trying to scrape name, image, price and some more which is working ok.

Answer

Sorry for the late reply, I was not able to get it to work. The website is heavy JS dependent and the product pages are not being followed from the list view, so it is not possible to reach them. I would suggest you try Scrapy instead http://scrapy.readthedocs.io/en/latest/index.html.

Ok, no prob. Thanks for trying.

I have to look into using Scrapy then I guess.

Login to post a comment