Start a new topic
Answered

URL generation not working as expected

I'm using URL generation to set the list of starting pages where pagination is present on a list of categories:


Sample URL:

https://basedomain/first_path/category/#/page-2

The generated URLs that portia shows are valid, but when running the project it queries the base category (1st page) 3 times, not the next ones.
Note that the first page can also be: 

I also tried setting the starting pages for each category and following the "next" pagination links, but the regex is not working either:

I use "page-2" as the expression. Do I need to put the full URL? escape the characters as I would in a regex? 

Any ideas on any of it?


Best Answer

The URL is being stripped at #, this is a known bug unfortunately.

I was checking the website out and it seems like you can use the parameter "?p=<pagenumber>" to bypass this issue.

For example:

- https://www.milcervezas.com/cervezas-por-estilos/cervezas-ale/?p=2


Hey,


Could you provide the real case url and/or the name of your spider?

Hi Nestor,


Spider is www.milcervezas.com.


Thanks!

The URLs you are setting don't exist that's why they are repeating themselves. Some filtered product types only have 3 - 4 products so they don't have more than 1 page. And also note that some say "cerverza" and not "cervezas"

Hi again Nestor,


Good catch there! I already adjusted that. Still having some issues though.

For the 2 styles that have a 2nd page (cervezas-ale and cervezas-ipa-apa) the URLs that show up in the requests section are not the ones that should be generated:

Any ideas?
The other option would be to use the https://www.milcervezas.com/cervezas-por-estilos/cervezas-ale/ url as starting page and set a pattern for the link crawling. Didn't manage to get his one working either. Tried "page-2" and a few other combinations. What should be the correct syntax?

Thanks!

Answer

The URL is being stripped at #, this is a known bug unfortunately.

I was checking the website out and it seems like you can use the parameter "?p=<pagenumber>" to bypass this issue.

For example:

- https://www.milcervezas.com/cervezas-por-estilos/cervezas-ale/?p=2

Thanks Nestor, this did the trick :)

Login to post a comment