0
Answered
Eric 3 months ago in Portia • updated by Pablo Vaz (Support Engineer) 2 months ago 3

I am trying to download comments from kickstarter campaigns by using Portia. For each campaign, there is a "show older comments" link after the most recent set of 50 comments. Selecting that link shows an additional 50 comments. I'm not a developer and have been trying to come up with a work around to download all comments with one scrape job.


I've been able to use the "configure url patterns" feature to go from 50 to 100 comments (details below). However, the crawl stops there and doesn't follow the link to the next set of 50 comments (in total there are ~600 comments). Since the "show older comments" URL doesn't change, I'm not sure why it is stopping. Any help would be much appreciated!


Website to be crawled: https://www.kickstarter.com/projects/1865494715/apollo-7-worlds-most-compact-true-wireless-earphon/comments


"show older comments" URL: https://www.kickstarter.com/projects/1865494715/apollo-7-worlds-most-compact-true-wireless-earphon/comments?cursor=14162300


Regular expression for the "follow links that match this expression": (https:\/\/www\.kickstarter\.com\/projects\/1865494715\/apollo-7-worlds-most-compact-true-wireless-earphon\/comments\?cursor=14162300)+



Answer

Answer
Answered

Hi Eric,

I've tried to replicate your project and obtained the same results as you.

Then I enable javascript using the same regex in the spider config in Portia and changed the follow pattern to follow all in domain links.

At the moment still running and more than 1000 items are scraped, but I suspect is not what you want since you are interested just in comments to apolo-7 correct?

Not sure if once you have all items scraped using this method even filtering the data once more using regex you obtain the same 50 initial comments for apolo-7.

Another solution could be using Splash to interact with the page and open all comments first and then parse the data.

If interested into use Splash: https://splash.readthedocs.io/en/stable/

I'll trying to accomplish using regex instead, if you have any updates don't hesitate to share with us!

Best regards!

GOOD, I'M SATISFIED

Great response time and helpful info.

Satisfaction mark by Eric 3 months ago
Answer
Answered

Hi Eric,

I've tried to replicate your project and obtained the same results as you.

Then I enable javascript using the same regex in the spider config in Portia and changed the follow pattern to follow all in domain links.

At the moment still running and more than 1000 items are scraped, but I suspect is not what you want since you are interested just in comments to apolo-7 correct?

Not sure if once you have all items scraped using this method even filtering the data once more using regex you obtain the same 50 initial comments for apolo-7.

Another solution could be using Splash to interact with the page and open all comments first and then parse the data.

If interested into use Splash: https://splash.readthedocs.io/en/stable/

I'll trying to accomplish using regex instead, if you have any updates don't hesitate to share with us!

Best regards!

+1

Pablo, really appreciate the guidance! I'll see if I can make progress with Splash - will update here if I find a good solution