Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
Answered
Tristan Bailey 1 week ago in Portia • updated by Pablo Vaz (Support Engineer) 7 days ago 1

What is the best way to deal with robots.txt and crawl blocking?


This is for a site that wants to approve crawling.


Just do something like this:
-- robots.txt

User-agent: Scrapy

Disallow:


Or something more.

Answer

Hey Tristan!

According our more experienced support agents, you could check this article posted on our Blog:
https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/

which gives some ideas and tips on the content to those type of files and how to handle (or not).

Best regards!

0
Answered
Andreas Dreyer Hysing 2 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 1

After we started to use version 2 of Portia we are experiencing unwanted deduplication of similar items from crawl. Looking through the logs of these crawl reveals that these items does indeed have different values for at least one field in each item. As we see it these items are not duplicates, and should not be discarded.


As a note. All fields in the item are configured with the Vary-option enabled, and both Required-options disabled.



Crawl logs were read on the web interface from https://app.scrapinghub.com/p/110257/36/4/log


Answer

Hi Andreas!

Our experts suggests to disable Vary-option. This should improve your crawling for this particular case. All the fields in the data format used by that spider have vary = True set, so they're ignored when checking for duplicates.

Let me know if this was helpful.

Kind regards,

Pablo

0
Answered
brandonmp 2 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 3

Searched for this topic but no luck, apologies if this is a duplicate


I'm scraping a few pages where I need to extract a few non-visible pieces of data. Specifically, the `src` attributes on images, some `data-*` attributes on misc. `html` tags, and some raw text from the content of a few `<script>` tags.


Is this possible to do in Portia? I haven't been able to figure it out on my own.


If not possible, is it possible to augment a Portia scraper with custom python? Or does a job have to be either all-Scrapy/Python or All-Portia?

Answer

Yes Brandon! You can add an extractor to the annotation. In the same options where you configured the CSS selector you can add an extractor which will process the text with your pre-defined regex.

+1
Fixed
ched 2 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 5

I'm having massive troubles with Portia today: Almost every action triggers a backend error (red notifications). Interacting with a website doesn't work, neither does deleting data formats.


It appears as if you'd switched my account from the old portia to portia 2.0, because yesterday the interface was totally different.


I also have a suggestion for an improvement: Your error messages are not very helpful :(.


Can anyone help me out or let me know what else I should post in order to resolve these problems?

Answer

Thanks Ched for your feeback!. I will forward your suggestion to our Portia team, it has a lot of sense what you propose.
Best regards!

0
Answered
Tristan Bailey 3 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 4

Is there an api command to submit or to add to the project on the website, to limit the pages crawled?

I would like to stop after 1000 for my testing phase.

Answer

Hi Tristan, you can also try with DEPTH_LIMIT and set values to 3 to 5 for example using CLOSESPIDER_PAGECOUNT to 100.
I've obtained different number of requests and items changing Depth Limit for the same Closespider pagecount.
Regards.

0
Answered
kochhar 3 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 7 days ago 3

hi, I'm having trouble getting portia to crawl pages linked from the starting page.


I've tried both regex based configuration as well as following all in-domain links. Both options have not resulted in following any further links -- the scraped results only contain data from the first page.

Answer

Hi Kochar. We turned our Portia suport to version 2.0 this week and are hardly working on backend issues, QA tests, and bugs if reported from users.
Please let me know if the issues you experienced still appear. If so, please provide all detailed information as you can in order to help you further. Thanks.
Best regards!

0
Answered
Eric 3 weeks ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 3

I am trying to download comments from kickstarter campaigns by using Portia. For each campaign, there is a "show older comments" link after the most recent set of 50 comments. Selecting that link shows an additional 50 comments. I'm not a developer and have been trying to come up with a work around to download all comments with one scrape job.


I've been able to use the "configure url patterns" feature to go from 50 to 100 comments (details below). However, the crawl stops there and doesn't follow the link to the next set of 50 comments (in total there are ~600 comments). Since the "show older comments" URL doesn't change, I'm not sure why it is stopping. Any help would be much appreciated!


Website to be crawled: https://www.kickstarter.com/projects/1865494715/apollo-7-worlds-most-compact-true-wireless-earphon/comments


"show older comments" URL: https://www.kickstarter.com/projects/1865494715/apollo-7-worlds-most-compact-true-wireless-earphon/comments?cursor=14162300


Regular expression for the "follow links that match this expression": (https:\/\/www\.kickstarter\.com\/projects\/1865494715\/apollo-7-worlds-most-compact-true-wireless-earphon\/comments\?cursor=14162300)+



Answer

Hi Eric,

I've tried to replicate your project and obtained the same results as you.

Then I enable javascript using the same regex in the spider config in Portia and changed the follow pattern to follow all in domain links.

At the moment still running and more than 1000 items are scraped, but I suspect is not what you want since you are interested just in comments to apolo-7 correct?

Not sure if once you have all items scraped using this method even filtering the data once more using regex you obtain the same 50 initial comments for apolo-7.

Another solution could be using Splash to interact with the page and open all comments first and then parse the data.

If interested into use Splash: https://splash.readthedocs.io/en/stable/

I'll trying to accomplish using regex instead, if you have any updates don't hesitate to share with us!

Best regards!

+1
Answered
Mark Fallu 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 7 days ago 4

Within the Portia 2.0 beta, I can create a project and a spider within that project... but I can not create samples.


I have tried on multiple browsers.


Via the console it appears there are a few errors:


I am recieving internal server 500 errors for https://portia-beta.scrapinghub.com/api/projects/132127/spiders/app.qualaroo.com/samples 500 (Internal Server Error)


This means I cant create samples - this persists over browser restarts and fresh logins.



Answer

Hi Mark and Alex. We turned our Portia suport to version 2.0 this week and are hardly working on backend issues, QA tests, and bugs if reported from users.
Please let me know if the issues you experienced still appear. If so, please provide all detailed information as you can in order to help you further. Thanks.
Best regards!

+1
Answered
ben.hammond 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 2 weeks ago 6

Hello, I'm trying to work out how to upload a list of URLs, extract 3 pieces of data from each page (each page is laid out the same), and then export the data to a csv file. (So just to clarify, i'm just looking to extract data from specific pages, not crawl a whole website)


Can anyone give me any advice on how to set this up as a scrapy or portia project? or point me in the direction of a tutorial on how to do it, I really just don't know where to start?


Much appreciated!

Thank you

Ben

Answer

Hi Ben!


Our developers team are working on a new release of Portia which will allow you to set a bulk of start urls using a list (from Dropbox for example).


We hope to get this new feature among others, very soon!

Regards.

0
Answered
refplane 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 4

I have tables on the site I want to download. http://prntscr.com/dhsauz

Is it possible to load such pages in the form of multiple items, values pairs. Or only to mark each value as a field, 10 rows in the table - 20 different fields?


Answer

Hey Reflplane!

Some other users have tried to parse a table using Portia, with a customer have tried to do this before:
https://support.scrapinghub.com/topics/2301-returning-only-one-field-when-it-should-return-multiple/

When your work involves tables, maybe the best solution as I commented before is using Scrapy. Even Portia is powerful and very intuitive has its limitations.


Portia will release soon a new version of Portia with enhanced features, stay alert. This could be a feature to improve in next releases.

Best regards!

0
Answered
refplane 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 1 week ago 4

I want to download a site which has a search results page with pagination at the bottom. So I am starting e.g. from
https://rospravosudie.com/vidpr-ugolovnoe/result-obvinitelnyj-prigovor-s/section-acts/page-10000/

this is page 10000. It contains links to pages 10001 and 10002, these pages contain links to pages 10002, 10003 and 10004 and so on. Each page also contains 20 links to article pages for which I also created URL rule and sample page.

All results I get are 70-100 items downloaded (I tried 3 times). I limited pages to 10* and I suspect following depth is 12, but this should get about 23 pages and 460 articles. Am I doing something wrong?


The site has anti-ddos system, it blocks immediately if more than 3 requests are made simultaneously from one IP or after some time with less simult. requests (after about 500 requests). After this it starts to return error 503. Does Scrapinghub retry to load the same page after one downloader was blocked when trying to download it? Will it "self-learn" about anti-ddos system's pattern or sooner or later all download hosts will be blocked?

Answer

You're welcome Reflplane. Regarding your questions, you need first to set up a Crawlera account, sorry if I didn't mention this wasn't free!

Many of our customers have started with free plans and once they need to accomplish more complex crawling projects they move to paid plans. Please take a few minutes to read more about Crawlera and our plans:

https://scrapinghub.com/crawlera/

Crawlera addon is robust, easy to configure and has a lot of features that helps avoid be banned. Of course you have to balance with your projects needs. Sometimes it saves you a lot of time and resources, others you just don't need it.


About the containers, they improve the efficiency, gives you more cpu power and brings you more data storage.

Let me know if you need further assistance.

Best regards!

0
Fixed
Patrick 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 3 weeks ago 2

When I add this url to Portia: http://www.saford.com/new-inventory/index.htm


Then, it loads for a while. When I click "New Spider" I start seeing errors. It doesn't save that page as the start url, errors show in red rectangles, and there is no way of annotating the page to tell it what data to collect.
Error I get:

"The backend responded with an error. An error occurred while communicating with the server. Our developers have already been notified."


Really hoping I can get passed this to finalize a scraper.
+1
Answered
temp2 2 months ago in Portia • updated by Pablo Vaz (Support Engineer) 2 months ago 0

I see that frames are not allowed, but I don't see a button "Toggle page styling"

is there a web scraper that doesn't use any of the targeted sites code to scrape? one that visually see's whats on the page and visually parses the text?





Answer

Hi! You can enable "Toggle page styling button" after setting your New spider and after clicking "Edit sample" button.

There, you will find a menu with the "Toggle" button. Looks like this:



On the other hand, iframes are not supported on Portia at this moment.

I hope this has been useful. Regards.

0
Answered
bjoern.juergensen 3 months ago in Portia • updated by Nestor Toledo Koplin (Support Engineer) 2 months ago 1

hi, i am sorry i don't have a clue what to type into the "run spider" dialogue.

I mean i have created one but typing its name into the field has no effect.

can you assist, pls?

thanks

Answer

Hello,


You'll need to publish the spider from Portia's UI and then you'll be able to run the spider in Scrapy Cloud. Note: the icon looks like a green cloud.

+1
Answered
Mahmood 3 months ago in Portia • updated by Pablo Vaz (Support Engineer) 2 months ago 1

i have created a new spider in portia. is it possible to set request per second for this spider when it's running in job section?

Answer

Hi Mahmood,


You are absolutely right.



You can also check on the request/min rate, by picking the request data shown on the running job, that the rate should not overpass the settings values. In your case 0.25s or 15 requests per minute.


Thanks for participating in the forum!