How to implement a custom proxy middleware?

As had been discussed already in the documentation, the implementation of a custom proxy is merely two factory functions: one yields a requests.Session and/or requests_futures.sessions.FuturesSession instance, one yields a selenium.webdriver.Chrome instance; both with proxy presets.

See below an example from the documentation.

from darc.proxy import register

Session Factory

The session factory returns a requests.Session and/or requests_futures.sessions.FuturesSession instance with presets, e.g. proxies, user agent, etc.

Type annocation of the function can be described as

def get_session(futures=False) -> requests.Session: ...

@typing.overload
def get_session(futures=True) -> requests_futures.sessions.FuturesSession: ...

For example, let’s say you’re implementing a Socks5 proxy for localhost:9293, with other presets same as the default factory function, c.f. darc.requests.null_session().

import requests
import requests_futures.sessions

from darc.const import DARC_CPU
from darc.requests import default_user_agent


def socks5_session(futures=False):
    """Socks5 proxy session.

    Args:
        futures: If returns a :class:`requests_futures.FuturesSession`.

    Returns:
        Union[requests.Session, requests_futures.FuturesSession]:
        The session object with Socks5 proxy settings.

    """
    if futures:
        session = requests_futures.sessions.FuturesSession(max_workers=DARC_CPU)
    else:
        session = requests.Session()

    session.headers['User-Agent'] = default_user_agent(proxy='Socks5')
    session.proxies.update({
        'http': 'socks5h://localhost:9293',
        'https': 'socks5h://localhost:9293',
    })
    return session

In this case when darc needs to use a Socks5 session for its crawler worker nodes, it will call the socks5_session function to obtain a preset session instance.

Driver Factory

The driver factory returns a selenium.webdriver.Chrome instance with presets, e.g. proxies, options/switches, etc.

Type annocation of the function can be described as

def get_driver() -> selenium.webdriver.Chrome: ...

For example, let’s say you’re implementing a Socks5 proxy for localhost:9293, with other presets same as the default factory function, c.f. darc.selenium.null_driver().

import selenium.webdriver
import selenium.webdriver.common.proxy

from darc.selenium import BINARY_LOCATION


def socks5_driver():
    """Socks5 proxy driver.

    Returns:
        selenium.webdriver.Chrome: The web driver object with Socks5 proxy settings.

    """
    options = selenium.webdriver.ChromeOptions()
    options.binary_location = BINARY_LOCATION
    options.add_argument('--proxy-server=socks5://localhost:9293')
    options.add_argument('--host-resolver-rules="MAP * ~NOTFOUND , EXCLUDE localhost"')

    proxy = selenium.webdriver.Proxy()
    proxy.proxyType = selenium.webdriver.common.proxy.ProxyType.MANUAL
    proxy.http_proxy = 'socks5://localhost:9293'
    proxy.ssl_proxy = 'socks5://localhost:9293'

    capabilities = selenium.webdriver.DesiredCapabilities.CHROME.copy()
    proxy.add_to_capabilities(capabilities)

    driver = selenium.webdriver.Chrome(options=options,
                                       desired_capabilities=capabilities)
    return driver

In this case when darc needs to use a Socks5 driver for its loader worker nodes, it will call the socks5_driver function to obtain a preset driver instance.

What should I do to register the proxy?

All proxies are managed in the darc.proxy module and you can register your own proxy through darc.proxy.register():

# register proxy
register('socks5', socks5_session, socks5_driver)

As the codes above suggest, the darc.proxy.register() takes three positional arguments: proxy type, session and driver factory functions.

See also

socks5.py