Welcome to the Scrapinghub feedback & support site! We discuss all things related to Scrapy Cloud, Portia and Crawlera. You can participate by reading posts, asking questions, providing feedback, helping others, and voting the best questions & answers. Scrapinghub employees regularly pop in to answer questions, share tips, and post announcements.
0
CSPlusC 3 hours ago in Scrapy Cloud 0

My local project used ITEM_PIPELINES to pass all items to custom XLSX in the format desired (using openpyxl to specify the format). After deploying it to the cloud I was able to also deploy the egg for openpyxl so that the project ran, and it does complete successfully. However, I can't find where to download the created XLSX file.


Sorry if I'm missing something obvious--after combing through the dashboard for a way to download it I also searched through the knowledge base and the forums here and couldn't find an explanation. Thanks in advance!

0
Roney Hossain 4 days ago in Portia 0

I am just new to crapinghib . I have just started testing portia 2 ( beta ) . In starting page I want to add more URLS . In beta version I had to add url one by one which is not convenient for 100s of urls as starting page . ( Note : I dont want crawl all links Just those starting pages ).

It seems in current stable version of portia I can paste Bulk Links as starting pages which is missing in portia2 beta version . Can you please give us an option to paste bult URLS

Thankyou

0
Lightyagami 1 week ago in Crawlera 0

Hi,

I was using crawlera to scrape linkedin page but I keep getting 500 error like the following:


< HTTP/1.1 503 Service Unavailable

< Connection: close
< Content-Length: 17
< Content-Type: text/plain
< Date: Mon, 22 Aug 2016 07:13:01 GMT
* HTTP/1.1 proxy connection set close!
< Proxy-Connection: close
< Retry-After: 1
< X-Crawlera-Error: slavebanned
< X-Crawlera-Version: 1.11.9-4-g8eb8d9f
<
* Closing connection 0
Website crawl ban

Does crawlera not support linkedin anymore?

0
Answered
6jh7ln 2 weeks ago in Portia • updated by Tomas Rinke 2 weeks ago 1

Here's an example: I'm trying to scrape data about the courses on this university site. How do I make portia follow the links from the starting page to a page with course data, but not follow any more links? After it gets the course page, I want it to go back to the starting page and try the next link. In the example site above, I would want it to click "Aerospace Studies" and scrape that page, but then go back to the start page and click "African American Studies".

Answer
Tomas Rinke 2 weeks ago

Hi, first of all you could configure your link crawling with URL pattern set regular expression that matches the courses like this: /course/ and toggle the link highlighting in order to see the links that's following(green) and the ones that are not (red)


Or other option is to follow all domain links and set the scrapy setting DEPTH_LIMIT to 1 in your project. But this is going to match all domain links present in that site.

0
Answered
Ijoin 2 weeks ago in Portia • updated by Tomas Rinke 2 weeks ago 2

Hi,

Just noob question but I found no solution, tried all but nothing works


I tried to scrap this link

https://www.example.com/search?q=&page=1 to 4000


I made crawling rules like this


Start url

https://www.example.com


Crawling rules

search?q=&page=[0-9]+/


I also tried


search?q=&page=\d+



Still, it doesn't work at all


Any idea?

Answer
Tomas Rinke 2 weeks ago

Hi,

checking your regular expressions using http://pythex.org/


- should escape ? character, since is a quantifier

- on the first regex there is an extra / at the end and doensn't match the sample


regards,


Tomas

0
Kaio Mano 2 weeks ago in Crawlera 0

Hello,

I have received almost 100 % response in the 504 requests to the domain sinespcidadao.sinesp.gov.br.

How can I solve this?

0
AppleGaming 2 weeks ago in Portia 0

Im very new to this stuff so bear with me. I'm trying to get data from this website

http://www.hotslogs.com/Sitewide/HeroAndMapStatistics

I want the hero (heroName), games played, games banned, popularity, and win %. When i run it the spider i created with portia I only get the field for heroName back and it has combined data for the entire table in 1 field.


Why is this happening when the "test spider" button seems to return exaclty what I want?

0
Shoamtal 3 weeks ago in Crawlera 0

Hello

Is it possible to set the content type header myself?

Im sending 'content-type': 'application/json; charset=utf-8'

but crawlera passing 'content-type': 'application/json' to the destination


0
Waiting for Customer
Robbieone 3 weeks ago in Scrapy Cloud • updated 2 weeks ago 7

Two questions.

1. Is there a paid option that I can buy that will increase the speed of my spider? Some of my sites take more than a day to crawl.
2. Is there a paid option that will allow me to use IP addresses from the united states or a way to set the ip's to use a source like trustedproxies.com?

Answer
Tomas Rinke 2 weeks ago

no, it's not the same.


All the spiders you develop in a scrapy project could be deployed and run in Scrapy Cloud.

The resources allocated to each job run are called job units.

Each job units provides 1GB of RAM, 2.5GB of storage, a given amount of CPU capacity and 1 concurrent crawl.


So depending the amount of available units of your plan you could:

  • run concurrent jobs (with more than one job unit)
  • run a job with multiple job units (if needs more resources)

Thanks

0
Completed
Juniorgerdet 3 weeks ago in Crawlera • updated by Tomas Rinke 2 weeks ago 1

with send authorization tokek? whith wget not work?

Answer
Tomas Rinke 2 weeks ago

Hi, a sample applied from this post:


wget -e use_proxy=yes -e http_proxy=http://CRAWLERA_APIKEY:@proxy.crawlera.com:8010/ http://httpbin.org/ip 

thanks to https://twitter.com/theresiatanzil/status/763589779036393472


0
Answered
Robbieone 3 weeks ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 3 weeks ago 1

I purchased a "Scrapy Cloud Container Unit" thinking that it would allow me to run more job at a time, but my 2nd job appears to still be pending the first job's completion. What option do I need to purchase to be able to run more than 1 job at a time.

Answer

Hello,

In order to run 2+ jobs at a time, you'd need to purchase additional Scrapy Cloud Units. If you only need to run 2 jobs at a time, then only 2 Units would need to be purchased.

Please note that the free unit is replaced by the purchased one. See: http://doc.scrapinghub.com/scrapy-cloud.html#pricing for more information.

0
Answered
Robbieone 3 weeks ago in Scrapy Cloud • updated by Nestor Toledo Koplin (Support Engineer) 3 weeks ago 1

This is the first time I've ever noticed this, but there is a "wait time" that is ticking away next to a job that I started. It shows the job in pending status. Why is there a "wait time" before the job starts? Is there a way to bypass the wait time or a way to determine how long I will have to wait? or a way to upgrade so I don't have to wait? Thanks.

Answer

Hello,


The wait time is the time that the job has been in pending status, until there's an available free unit to run it. In order to run 2+ jobs at a time, you'd need to purchase additional units. You'd need at least 1 unit per 1 running job.

0
Answered
Sandbox 3 weeks ago in Crawlera • updated by Nestor Toledo Koplin (Support Engineer) 3 weeks ago 7

I'm trying to use Crawlera and Scrapy to scrape data from Amazon (which uses HTTPS).


When I run the script from my PC I get the error code 407 (Bad Auth)

I've been searching for hours in many different places to try and learn how to get this to work. I've seen the example that goes with Python Requests, but how do I get HTTPS to work with Scrapy (and Crawlera)?


Thank you!

Answer

Hello,


A 407 error code from Crawlera is an authentication error, there's possibly a typo in the APIKEY or perhaps you are not using the correct one. Please make sure you are using the one that is displayed here (click on the gear icon next to your user):

app.scrapinghub.com/o/<org_id>/crawlera/overview


Regards,


Nestor

0
Mench9@163.com 3 weeks ago in Splash 0

My scenario is as simple as that there are multiple pages of data and there is only one next button which will trigger ajax call to load the next page when clicked. How can I crawl all of the data using splash? It seems that Splash is stateless, i.e. it doesn't remember the page of previous call therefore it lose the position of the pagination last time it was at. Any hints are appreciated.

0
Planned
Joep B 4 weeks ago in Scrapy Cloud • updated by Tomas Rinke 6 days ago 2

I have checked some related subjects, and there more people asking for this, but it seems there is no way to tell the scraping system to use some certain file (sitemap, rss feed, single html file with links) in the scraping process.

The only way I found was to add all url's to portia, manual and by command, but there is no way to tell the scraping system that it should use some file with all start url's defined.

I hope someone is willing to make this

Related posts:

https://support.scrapinghub.com/topics/746-start-from-sitemapxml-for-portia-spiders/