Our Forums have moved to a new location - please visit Scrapinghub Help Center to post new topics.

You can still browse older topics on this page.


Welcome to the Scrapinghub community forum! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.

0
Answered
Isaan Online 1 month ago in Portia • updated 1 month ago 2

I am trying using Portia for the first time, first url I want to extract data from, it never stops loading, anything to do to prevent that. In web browser it load normally.


Before I start buying slots for Cloud, I will prefer that I can use Portia


Supatra

Answer

Hi Isaan, for most sites Portia loads the page correctly and allows you to extract without any worries. Unfortunately some other are too complex to interact with this browser.


For this kind of sites I suggest as soon as possible to start with Scrapy, involves more coding but it's more powerful.

If you are interested in to learn more please visit our new learning center:

https://learn.scrapinghub.com

BTW there are more resources for Portia too.


Best regards,


Pablo

0
Answered
1044204605 1 month ago in Crawlera • updated by Pablo Vaz (Support Engineer) 1 month ago 1

i bought the crawlera service just now, and successfully run my scrapy poject using crawlera proxy, but the usage of my crawlera service is always 0% and there's nothing in the histogram on the website, why?

Answer

Hi,


The stats plots are slightly delayed. You will start to see the requests made after 20min or half hour.


If the stats still showing 0% usage let us know through our support platform to investigate in our internal stats to see if you are indeed making requests through Crawlera or bypassing it.


Best regards,


Pablo Vaz

Support Engineer

0
Answered
han 1 month ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 month ago 1

Hi, I know that there is a 24 hour limit for free accounts.

If I have a large scraping job that will definitely take more than 24 hours to run, is there a way I can continue from where the scraping has stopped?

Answer

Hey Han, yes you can.

When you upgrade your Scrapy unit, it allows you to crawl as long as you need.


Here are more details:

Scrapy Units


Best regards,

Pablo 

0
Answered
Sergej 1 month ago in Crawlera • updated 4 days ago 5

Hello,


I am experiencing issues by using Crawlera on https sites with Scrapy and PhantomJs. My Config is:


        service_args = [
            '--proxy=proxy.crawlera.com:8010',
            '--proxy-type=http',
            '--proxy-auth=XXXXXXXXXXXXXXXXX:',
            '--webdriver-logfile=phantom.log',
            '--webdriver-loglevel=DEBUG',
            '--ssl-protocol=any',
            '--ssl-client-certificate-file=crawlera-ca.crt',
            '--ignore-ssl-errors=true',
            ]


Tough I always get this error and the result is empty:

"errorCode":99,"errorString":"Cannot provide a certificate with no key, "


I am stuck for hours on this problem. Any help much appreciated. 


Thank you!

Sergej

Answer

Hey Sergej, even there's no exact ETA we expect in terms of some weeks.

There has been an increasing demand on Casper, Nightmare and PhantomJS support last months and our team has taken this as a priority.

We commonly use our blog to post there new releases and info about cool features:
Scrapinghub Blog


Best regards,

Pablo

0
Answered
oday_merhi 1 month ago in Portia • updated by Pablo Vaz (Support Engineer) 1 month ago 1

Hello guys,


So i'm trying to scrape articles with certain texts in them. Can i teach my spider to scrape specific articles. For example if i was on a food site that had articles and i only wanted recipes with banana. Is there a way for me to set up the spider to only scrape the articles with the key word "banana"?


Thank you for your help!

Answer

Hey Oday, yes I think is possible to set up some kind of extractor using regular expressions.


If not possible with Portia, you can try with Scrapy:

https://helpdesk.scrapinghub.com/support/solutions/articles/22000201028-learn-scrapy-video-tutorials-


Best regards,


Pablo

0
Not a bug
MikeLambert 1 month ago in Splash • updated by Pablo Vaz (Support Engineer) 4 weeks ago 1

It seems Distil is blocking my use of Splash now. I'm not sure if the site I'm accessing just started using Distil, or if Distil just started detecting Splash.


Is there any information as to how Distil is detecting Splash? I've seen examples for Selenium needing to delete certain properties from document.window, but am unclear as to exactly how Splash is automating things, and what it might stick in the browser that make it detectable.


I did find https://www.quora.com/How-do-you-crawl-Crunchbase-and-bypass-the-bot-protection , where Pablo Hoffman (scrapinghub co-founder) suggests contacting scrapinghub to help with crawling. I'm not sure what the costs for a full consulting gig to do this are (any estimates?)


I'm already using scrapinghub/splash and pay for an instance, but if it's impossible to get through Distil, I'll just have to turn off my spaslh JS instance and remove the feature, so any pointers (public or private to mlambert@gmail.com) would be appreciated!

Answer

Hi Mike, Distil Network is a powerful anti-bot system. I can see you contacted our team through: https://scrapinghub.com/quote

Our team will evaluate what could be the most suitable option in your case.


Best regards!


Pablo

0
Not a bug
joaquin 1 month ago in Crawlera • updated by Pablo Vaz (Support Engineer) 3 weeks ago 14

I am running some crawlers on a site, all of them using crawlera, and I am getting several 429 error statuses (which mean too many requests), so crawlera doesn't seem to be adjusting its throttling to accommodate for these errors.


Does your throttling algorithm consider 429 status codes?


I am using scrapy plus the crawlera middleware btw.

Answer

Also, plan max concurrency is overall account usage, which may be used for crawling many different sites at same time.

0
Answered
ysappier 1 month ago in Scrapy Cloud • updated by Martin Olveyra (Engineer) 1 month ago 2
0
Answered
han 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 1 month ago 1

Hi there, I started using scrapinghub a week ago, and have been using it to scrape some ecommerce websites.


I notice that the crawl job for a particular website keeps ending prematurely without any error logs.

On some instances, I try to visit the website and I found out that I got blocked.

So, I activated crawlera and the results is the same.


What could I be missing out?

Answer

Hi Han, even we can't provide ban assistance or crawl tuning for standard accounts, there are some best practices you can keep in mind when enabling Crawlera for a particular project.


Please take some minutes to explore more details in:

Crawlera best practices


Best regards,

Pablo

0
Answered
Rodrigo Barria 2 months ago in Portia • updated by Pablo Vaz (Support Engineer) 1 month ago 1

Hi,


would like to know how to make the Spider go inside each one of the real-estate properties listed in this pages:


http://www.portalinmobiliario.com/venta/departamento/santa-isabel-santiago-santiago-metropolitana?tp=2&op=1&ca=2&ts=1&dd=2&dh=6&bd=0&bh=6&or=f-des&mn=1&sf=1&sp=0&sd=47%2C00



In addition, that principal page consists of several pages, how can I make the spider go to the 2nd page and so on...


thank you very much

Answer

Hi Rodrigo,


You have many options on how Portia crawls the site, try with "Follow all in-domain links"



If this doesn't work, try with different alternatives. Take a few minutes to explore these articles:

Portia > List of URLs

Portia > Pagination


Best regards,


Pablo


0
Not a bug
parulchouhan1990 2 months ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 2 months ago 1

I am trying to insert input urls using the text file.

def start_requests(self):
        # read file data
        with open(self.url, 'r') as f:
            content = f.readlines()
        
        for url in content:
            yield scrapy.Request(url)


Using above code but getting error 

IOError: [Errno 2] No such file or directory
Answer

Hi Parul, 


As seen in our Article:


You need to declare the files in the <strong>package_data</strong>  section of your <strong>setup.py</strong>  file.

For example, if your Scrapy project has the following structure:

myproject/
  __init__.py
  settings.py
  resources/
    cities.txt
scrapy.cfg
setup.py

You would use the following in your <strong>setup.py</strong>  to include the <strong>cities.txt</strong>  file:

HTML

setup(
    name='myproject',
    version='1.0',
    packages=find_packages(),
    package_data={
        'myproject': ['resources/*.txt']
    },
    entry_points={
        'scrapy': ['settings = myproject.settings']
    },
    zip_safe=False,
)

Note that the <strong>zip_safe</strong> flag is set to <strong>False</strong> , as this may be needed in some cases.

Now you can access the <strong>cities.txt</strong>  file content from <strong>setting.py</strong> like this:

import pkgutil
data = pkgutil.get_data("myproject", "resources/cities.txt")

Note that this code works for the example Scrapy project structure defined at the beginning of the article. If your project has different structure - you will need to adjust <strong>package_data</strong> section and your code accordingly.

For advanced resource access take a look at setuptools pkg_resources module.


Best regards,


Pablo

0
Answered
Matthew Sealey 2 months ago in Portia • updated 2 months ago 5

I'm trying to use Portia to pull data from a possible list of pages based on a list. I know a lot of pages don't exist, but I don't know which ones.


So far Portia gets stuck in a loop of reattempting pages multiple times. That increases the request limit unnecessarily. Is there a way of limiting Portia to perhaps just two attempts at a single page before it discards it from attempting again?

Answer

Hi Matthew!

Have you tried some extra setting using regex? Perhaps, you don't know exactly which pages are unnecessary but you know some more information about the URL and avoid it.


Check this article:

Portia > Regex


Best regards!

Pablo

0
Answered
mark80 2 months ago in Portia • updated by Pablo Vaz (Support Engineer) 2 months ago 5

it's my first project and it seams to me great app!! 

here http://www.sia.ch/it/affiliazione/elenco-dei-membri/socii-individuali/ we have 255 page (little number on top of list) and i need not only extract these 4 visible column but either mail and telephone inside every name of the list..

i've yet extracted 255 page with main 4 column sample of the link, but i don't know how go one level deeper in every name

can i do all job with a single crawler project?

Answer

Hey Mark, I think I could make it. Sent you an invitation to take a look into the project.

Feel free to open portia to check the settings I made.


Best,


Pablo

0
Answered
Jorge 2 months ago in Portia • updated by Pablo Vaz (Support Engineer) 2 months ago 1

it's posible apply filter to scraped data (i know it is posible) but i would like to download .JSON code with the filter criteria, and dodge rest of data, it is posible?


thanks in advance

Answer

Hola Jorge!


I think you can play a bit sharing data with spiders as shown in this article:

https://helpdesk.scrapinghub.com/support/solutions/articles/22000200420-sharing-data-between-spiders

but not sure if this is efficient for your purposes.


I would prefer to filter locally, but of course that depends on the project.


Best,


Pablo

0
Answered
tsvet 2 months ago in Splash • updated by Pablo Vaz (Support Engineer) 2 months ago 1

Is it possible to choose the version of the splash for the splash instance? Mine is v2.1, but I would need to use a function that only appears to be avalilible in v2.3

Answer

Hi Tsvet,


Yes it is possible from our internal set up. Feel free to contact us through the Support Help desk to help you further.

Kind regards,

Pablo Vaz