0
Not a bug
Braulio Ríos Ferreira 1 month ago in Crawlera • updated 1 month ago 2

The spider test code is the following (I've removed irrelevant code, but this spider is tested and reproduces the same error):


# -*- coding: utf-8 -*-from scrapy import Request
from scrapy.spiders import Spider

class AlohaTestSpider(Spider):
    name = "aloha_test"

    def __init__(self, *args, **kwargs):
        super(AlohaTestSpider, self).__init__(*args, **kwargs)

    def start_requests(self):
        site = 'https://aroogas.alohaorderonline.com/OrderEntryService.asmx/GetSiteList/'
        yield Request(url=site,
                      method='POST',
                      callback=self.parse,
                      headers={"Content-Type": "application/json"})

    def parse(self, response):
        print(response.body)

When I run this spider:

$ scrapy crawl aloha_test


I keep getting the following error:

2017-03-20 12:33:11 [scrapy] DEBUG: Retrying <POST https://aroogas.alohaorderonline.com/OrderEntryService.asmx/GetSiteList/> (failed 1 times): 400 Bad Request


In the original spider, I have a retry decorator, and this errors repeats for 10 retries.


I only get this error with this specific request. In the real spider, which has more https requests before, It only fails when this request is reached (previous https requests return 200 OK).


Please note that this is a POST request that doesn't have any data. I don't know if this is relevant to you, but this is the only particularity that this request has in my spider.


If I deactivate "CrawleraMiddleware" and activate "CustomHttpProxyMiddleware" in DOWNLOADER_MIDDLEWARES (settings.py), I can make the request without error.


If I make this request using curl, I can't reproduce this error even when using crawlera, I mean that both of the following requests work fine:


$ curl --cacert ~/crawlera-ca.crt -H 'Content-Type: application/json' -H 'Content-Length: 0' -X POST -vx proxy.crawlera.com:8010 -U MY_API_KEY https://aroogas.alohaorderonline.com/OrderEntryService.asmx/GetSiteList


$ curl -H 'Content-Type: application/json' -H 'Content-Length: 0' -X POST https://aroogas.alohaorderonline.com/OrderEntryService.asmx/GetSiteList


I've tried everything in my imagination (Crawlera sessions, Crawlera cookies disabled, different types of http headers, but I can't figure out a way to get this request to work with Crawlera.


I guess it has to do with the Crawlera Middleware in Scrapy, but I don't know what sort of magic with the http headers might Crawlera be doing that is causing this request to fail.

Any suggestions about what could be causing this error?

Answer

Answer
Not a bug

Hi Braulio,


As you corrected tested wit Curl, seems your Crawlera account is working fine.

Also, all projects using Scrapy-Crawlera integration are working fine in our platform.


About the integration with Scrapy, I can suggest you to review the information provided here:

http://help.scrapinghub.com/crawlera/using-crawlera-with-scrapy


And to know even more please see the official documentation:

http://scrapy-crawlera.readthedocs.io/en/latest/


If your project needs urgent attention, you can also consider to hire our experts. We can set up Scrapy-Crawlera projects that fits your needs saving you a lot of time and resources. If interested, let me invite you to fill our free quote request: https://scrapinghub.com/quote


Best regards,


Pablo Vaz

Support team

Answer
Not a bug

Hi Braulio,


As you corrected tested wit Curl, seems your Crawlera account is working fine.

Also, all projects using Scrapy-Crawlera integration are working fine in our platform.


About the integration with Scrapy, I can suggest you to review the information provided here:

http://help.scrapinghub.com/crawlera/using-crawlera-with-scrapy


And to know even more please see the official documentation:

http://scrapy-crawlera.readthedocs.io/en/latest/


If your project needs urgent attention, you can also consider to hire our experts. We can set up Scrapy-Crawlera projects that fits your needs saving you a lot of time and resources. If interested, let me invite you to fill our free quote request: https://scrapinghub.com/quote


Best regards,


Pablo Vaz

Support team

Were you able to reproduce the situation with the test spider that I provided?

As I mentioned, my crawlera configuration is correct and is part of a much bigger project which is currently in production (I already knew the crawlera-scrapy documentation that you provided).

The only problem is with this specific request. So currently I'm disabling crawlera for this spider and everything is working fine.

So, I came to the conclusion that Crawlera or the crawlera-scrapy middleware is doing something, which is not well documented and is not visible to me, that is causing this request to be rejected. And this made me waste a lot of time, searching through the entire documentation and blogposts, trying to discover which feature may be causing the problem, but nothing worked.

You'll understand then that I'm more willing to see alternatives to Crawlera than hiring an expert, unless you told me that the test spider that I provided is working under your configuration with crawlera enabled/disabled and you could not reproduce this bug.

Looking forward for your comments.