Since it’s not so trivial to set up proxy authentication in Selenium, a simple option is to employ crawlera-headless-proxy as a middle layer between Crawlera and the Browser. To run the headless-proxy, we need to have it available on the system. The project link above contains all the necessary info to get it running. Here we chose to use the docker image for added simplicity.


$ docker run -p 3128:3128 scrapinghub/crawlera-headless-proxy -d -a $CRAWLERA_API_KEY


Let's digest that command: We're using docker run, so we're (creating, if required, and) running a container using the scrapinghub/crawlera-headless-proxy image. The -d modified puts the headless-proxy in debug mode, so we can see all the activity going through it. Finally, the required -a specifies the user's API key from Crawlera, which in this case is stored on an environment variable.


This command should produce a lot of output and stay attached to your terminal session. We need it running for our browser to connect to. Next step is producing selenium script


For Python and Chrome:


from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.proxy import Proxy, ProxyType
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

headless_proxy = "127.0.0.1:3128"
proxy = Proxy({
    'proxyType': ProxyType.MANUAL,
    'httpProxy': headless_proxy,
    'ftpProxy' : headless_proxy,
    'sslProxy' : headless_proxy,
    'noProxy'  : ''
})

chrome_options = webdriver.ChromeOptions()

# disable images to speed up the page loading
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)

capabilities = dict(DesiredCapabilities.CHROME)
proxy.add_to_capabilities(capabilities)

driver = webdriver.Chrome(desired_capabilities=capabilities, options=chrome_options)
driver.set_page_load_timeout(600)

driver.get("https://www.whatismyip.com/")
elem = driver.find_element_by_css_selector("a.btn.btn-success.btn-md.btn-block")
actions = ActionChains(driver)
actions.click(on_element=elem)
actions.perform()
print("Clicked on a button!")
driver.close()


For Java and Chrome:


import org.openqa.selenium.By;
import org.openqa.selenium.Proxy;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;

public class CrawleraTest {

    public static void main(String[] args) {

        Proxy proxy = new Proxy(); 
        proxy.setHttpProxy("127.0.0.1:3128"); 
        proxy.setSslProxy("127.0.0.1:3128"); 

        DesiredCapabilities capabilities = DesiredCapabilities.chrome(); 
        capabilities.setCapability("proxy", proxy); 
        
        ChromeOptions options = new ChromeOptions(); 
        options.addArguments("start-maximized"); 

        capabilities.setCapability(ChromeOptions.CAPABILITY, options); 

        WebDriver driver = new ChromeDriver(capabilities);
        
        driver.get("https://www.whatismyip.com/");

        WebDriverWait wait = new WebDriverWait(driver, 60);    
        wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath("//a[@role='button'][contains(text(),'Questions & Answers')]")));

        driver.quit();
    }
}