0
Answered
mouch 1 month ago in Scrapy Cloud • updated by Pablo Vaz (Support Engineer) 3 weeks ago 4

Hi there,



I have a spider that is running perfectly well without proxy - on ScrapingHub also.

I then implemented a rotating proxy and bought few proxies for my use. Locally, it is running like a charm. 


So, I decided to move this to ScrapingHub but the spider is not working anymore. It actually never ends.


See below my logs

2017-05-28 14:07:27 INFO [scrapy.core.engine] Spider opened
2017-05-28 14:07:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-28 14:07:27 INFO TelnetConsole starting on 6023
2017-05-28 14:07:27 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:07:57 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:08:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-28 14:08:27 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:08:57 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:09:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-28 14:09:27 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:09:57 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:10:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-28 14:10:27 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:10:57 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:11:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-28 14:11:27 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)
2017-05-28 14:11:57 INFO [rotating_proxies.middlewares] Proxies(good: 0, dead: 0, unchecked: 5, reanimated: 0, mean backoff time: 0s)



I'm still wondering what is going wrong here. Why the rotating proxy extension is doing... nothing?

Would it be possible that ScrapingHub is actually locking the use of proxy extensions to ensure we use Crawlera instead? Still, it is hard for me to understand how technically it could detect this :)


Thank you for your feedback on this,

Cyril

Answer

Answer
Answered

Hey Cyril,


Nice post, your contributions helps to improve this forum and we encourage to continue doing that. Well done!


About your last question, indeed your own proxies won't be used. We use our own proxies, with Scrapy Cloud projects (Scrapy or Portia) and of course when enabling Crawlera (making all requests from a pool of proxies).


Best regards,


Pablo

GOOD, I'M SATISFIED
Satisfaction mark by mouch 3 weeks ago

While digging a bit more into Scrapy data flow (following that link https://doc.scrapy.org/en/latest/topics/architecture.html) I'm now able to say ScrapingHub is not calling the process_request method from my downloader middleware (the rotating proxy extension that I copy pasted locally to include some loggers).


Thus, locally, I'm able to see my DEBUG log at the begining of the process_request method but when I push this project on ScrapingHub, I don't see anything.


Following the previously shared architecture schema, it would mean there is an issue in the ScrapingHub scrapy engine?

What should I do to troubleshoot even more precisely? Maybe there are some silent errors that are not displayed in the scrapy log?



Your help will be very appreciated :)

Cyril


Edit: I wrote a new Middleware that does nothing else than killing the spider and shout some logs. While looking into ScrapingHub log, we can see the spider has been killed (which means the middleware is well supported and processed) but there is no log from the middleware... So, ScrapingHub seems to hide logs from middleware (and that's why I  previously supposed my rotating proxy was not called - it is but the log aren't showing).


Well, that doesn't help me to understand what's going on there...


See below logs from ScrapingHub

2017-05-29 08:52:11 INFO Log opened.
2017-05-29 08:52:11 INFO [scrapy.log] Scrapy 1.3.3 started
2017-05-29 08:52:11 INFO [scrapy.utils.log] Scrapy 1.3.3 started (bot: cotextractor)
2017-05-29 08:52:11 INFO [scrapy.utils.log] Overridden settings: {'NEWSPIDER_MODULE': 'cotextractor.spiders', 'STATS_CLASS': 'sh_scrapy.stats.HubStorageStatsCollector', 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['cotextractor.spiders'], 'AUTOTHROTTLE_ENABLED': True, 'LOG_ENABLED': False, 'MEMUSAGE_LIMIT_MB': 950, 'TELNETCONSOLE_HOST': '0.0.0.0', 'BOT_NAME': 'cotextractor', 'MEMUSAGE_ENABLED': True, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36'}
2017-05-29 08:52:11 INFO [scrapy.middleware] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.debug.StackTraceDump',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.spiderstate.SpiderState',
 'scrapy.extensions.throttle.AutoThrottle',
 'sh_scrapy.extension.HubstorageExtension']
2017-05-29 08:52:11 INFO [cotextractor.spiders.spiders] Start URL found in kwargs: myStartURL
2017-05-29 08:52:12 INFO [scrapy.middleware] Enabled downloader middlewares:
['sh_scrapy.diskquota.DiskQuotaDownloaderMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'cotextractor.middlewares.NothingMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'sh_scrapy.middlewares.HubstorageDownloaderMiddleware']
2017-05-29 08:52:12 INFO [scrapy.middleware] Enabled spider middlewares:
['sh_scrapy.diskquota.DiskQuotaSpiderMiddleware',
 'sh_scrapy.middlewares.HubstorageSpiderMiddleware',
 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-29 08:52:12 INFO [scrapy.middleware] Enabled item pipelines:
[]
2017-05-29 08:52:12 INFO [scrapy.core.engine] Spider opened
2017-05-29 08:52:12 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-29 08:52:12 INFO TelnetConsole starting on 6023
2017-05-29 08:52:12 ERROR [scrapy.core.scraper] Error downloading <GET myStartURL>
Traceback (most recent call last):
  File "/app/python/lib/python2.7/site-packages/twisted/internet/defer.py", line 1301, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 37, in process_request
    response = yield method(request=request, spider=spider)
  File "/tmp/unpacked-eggs/__main__.egg/cotextractor/middlewares.py", line 28, in process_request
    raise CloseSpider('You shall not pass!')
CloseSpider
2017-05-29 08:52:12 INFO [scrapy.core.engine] Closing spider (finished)
2017-05-29 08:52:12 INFO [scrapy.statscollectors] Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/scrapy.exceptions.CloseSpider': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 5, 29, 8, 52, 12, 232336),
 'log_count/ERROR': 1,
 'log_count/INFO': 8,
 'memusage/max': 53587968,
 'memusage/startup': 53587968,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/disk': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/disk': 1,
 'start_time': datetime.datetime(2017, 5, 29, 8, 52, 12, 26821)}
2017-05-29 08:52:12 INFO [scrapy.core.engine] Spider closed (finished)
2017-05-29 08:52:12 INFO Main loop terminated.


And here is the NothingMiddleware

class NothingMiddleware(object):
    def process_request(self, request, spider):
        logger.debug('Here!!')
        raise CloseSpider('You shall not pass!')
        return
    def process_response(self, request, response, spider):
        return response


Hi again,



I have to conclude that I don't know what to do to make ScrapingHub working with our own proxies... I actually think it is not possible at all.


I did a last very little test, I just added request.meta['proxy'] = 'http://myproxy:8080' to my NothingMiddleware.

Whereas locally it is running perfectly fine, once uploaded on ScrapingHub, the spider is stuck, the downloader middleware never receives its response from the downloader. Finally, the job fails with the same error (TCP connection timed out: 110: Connection timed out.)


So, here are my assumptions, that still need to be confirmed by ScrapingHub staff:

  1. ScrapingHub are using their own proxies (even without Crawlera subscription) - making our own proxies useless
  2. For some reason (political more than technical?), we are not allowed to modify the proxy
  3. Middlewares' logs are definitely not showing in the output log displayed in the web interface :)



    I now have to choose between building my own scrapyd server or subscribe to Crawlera. 

    Unfortunately I'll probably go for the first option, Crawlera's entry price are so high (I need to edit my UA btw).


    At the end, this was really instructive for me :) Hope my topic can help someone.



    Cyril

    Woa. I completely forgot my proxies were IP binded. Only my local IP is allowed to use them :) That will explain a lot. Such a stupid mistake!


    Anyway, can someone from the ScrapingHub team confirm it is useless to use our own proxies? As those one are going to be proxied one more time by ScrapingHub proxies - making them useless for antiban IP?



    Thanks,

    Cyril

    Answer
    Answered

    Hey Cyril,


    Nice post, your contributions helps to improve this forum and we encourage to continue doing that. Well done!


    About your last question, indeed your own proxies won't be used. We use our own proxies, with Scrapy Cloud projects (Scrapy or Portia) and of course when enabling Crawlera (making all requests from a pool of proxies).


    Best regards,


    Pablo