How to implement a sites customisation?¶

As had been discussed already in the documentation, the implementation of a sites customisation is dead simple: just inherits the darc.sites.BaseSite class and overwrites the corresponding crawler() and loader() abstract static methods.

See below an example from the documentation.

from darc.sites import BaseSite, register

As the class below suggests, you may implement and register your sites customisation for mysite.com and www.mysite.com using the MySite class, where hostname attribute contains the list of hostnames to which the class should be associated with.

NB: Implementation details of the crawler and loader methods will be discussed in following sections.

class MySite(BaseSite):
    """This is a site customisation class for demonstration purpose.
    You may implement a module as well should you prefer."""

    #: List[str]: Hostnames the sites customisation is designed for.
    hostname = ['mysite.com', 'www.mysite.com']

    @staticmethod
    def crawler(timestamp, session, link): ...

    @staticmethod
    def loader(timestamp, driver, link): ...

Should your sites customisation be associated with multiple sites, you can just add them all to the hostname attribute; when you call darc.sites.register() to register your sites customisation, the function will automatically handle the registry association information.

# register sites implicitly
register(MySite)

Nonetheless, in case where you would rather specify the hostnames at runtime (instead of adding them to the hostname attribute), you may just leave out the hostname attribute as None and specify your list of hostnames at darc.sites.register() function call.

# register sites explicitly
register(MySite, 'mysite.com', 'www.mysite.com')

Crawler Hook¶

The crawler method is based on requests.Session objects and returns a requests.Response instance representing the crawled web page.

Type annotations of the method can be described as

@staticmethod
def crawler(session: requests.Session, link: darc.link.Link) -> requests.Response: ...

where session is the requests.Session instance with proxy presets and link is the target link (parsed by darc.link.parse_link() to provide more information than mere string).

For example, let’s say you would like to inject a cookie named SessionID and an Authentication header with some fake identity, then you may write the crawler method as below.

@staticmethod
def crawler(timestamp, session, link):
    """Crawler hook for my site.

    Args:
        timestamp (datetime.datetime): Timestamp of the worker node reference.
        session (requests.Session): Session object with proxy settings.
        link (darc.link.Link): Link object to be crawled.

    Returns:
        requests.Response: The final response object with crawled data.

    """
    # inject cookies
    session.cookies.set('SessionID', 'fake-session-id-value')

    # insert headers
    session.headers['Authentication'] = 'Basic fake-identity-credential'

    response = session.get(link.url, allow_redirects=True)
    return response

In this case when darc crawling the link, the HTTP(S) request will be provided with a session cookie and HTTP header, so that it may bypass potential authorisation checks and land on the target page.

Loader Hook¶

The loader method is based on selenium.webdriver.Chrome objects and returns a the original web driver instance containing the loaded web page.

Type annotations of the method can be described as

@staticmethod
def loader(driver: selenium.webdriver.Chrome, link: darc.link.Link) -> selenium.webdriver.Chrome: ...

where driver is the selenium.webdriver.Chrome instance with proxy presets and link is the target link (parsed by darc.link.parse_link() to provide more information than mere string).

For example, let’s say you would like to animate user login and go to the target page after successful attempt, then you may write the loader method as below.

@staticmethod
def loader(timestamp, driver, link):
    """Loader hook for my site.

    Args:
        timestamp: Timestamp of the worker node reference.
        driver (selenium.webdriver.Chrome): Web driver object with proxy settings.
        link (darc.link.Link): Link object to be loaded.

    Returns:
        selenium.webdriver.Chrome: The web driver object with loaded data.

    """
    # land on login page
    driver.get('https://%s/login' % link.host)

    # animate login attempt
    form = driver.find_element_by_id('login-form')
    form.find_element_by_id('username').send_keys('admin')
    form.find_element_by_id('password').send_keys('p@ssd')
    form.click()

    # check if the attempt succeeded
    if driver.title == 'Please login!':
        raise ValueError('failed to login %s' % link.host)

    # go to the target page
    driver.get(link.url)

    # wait for page to finish loading
    from darc.const import SE_WAIT  # should've been put with the top-level import statements
    if SE_WAIT is not None:
        time.sleep(SE_WAIT)

    return driver

In this case when darc loading the link, the web driver will first perform user login, so that it may bypass potential authorisation checks and land on the target page.

In case to drop the link from task queue…¶

In some scenarios, you may want to remove the target link from the task queue, then there’re basically two ways:

do like a wildling, remove it directly from the database

As there’re three task queues used in darc, each represents task queues for the crawler (requests database) and loader (selenium database) worker nodes and a track record for known hostnames (hostname database), you will need to call corresponding functions to remove the target link from the database desired.

Possible functions are as below:

darc.db.drop_hostname()
darc.db.drop_requests()
darc.db.drop_selenium()

all take one positional argument link, i.e. the darc.link.Link object to be removed.

Say you would like to remove https://www.mysite.com from the requests database, then you may just run

from darc.db import drop_requests
from darc.link import parse_link

link = parse_link('https://www.mysite.com')
drop_requests(link)

or make it in an elegant way

When implementing the sites customisation, you may wish to drop certain links at runtime, then you may simply raise darc.error.LinkNoReturn in the corresponding crawler and/or loader methods.

For instance, you would like to proceed with mysite.com but NOT www.mysite.com in the sites customisation, then you may implement your class as

from darc.error import LinkNoReturn

class MySite(BaseSite):

    ...

    @staticmethod
    def crawler(timestamp, session, link):
        if link.host == 'www.mysite.com':
            raise LinkNoReturn(link)

        ...

    @staticmethod
    def loader(timestamp, driver, link):
        if link.host == 'www.mysite.com':
            raise LinkNoReturn(link)

        ...

Then what should I do to include my sites customisation?¶

Simple as well!

Just install your codes to where you’re running darc, e.g. the Docker container, remote server, etc.; then change the startup by injecting your codes before the entrypoint.

Say the structure of the working directory is as below:

.
|-- .venv/
|   |-- lib/python3.8/site-packages
|   |   |-- darc/
|   |   |   |-- ...
|   |   |-- ...
|   |-- ...
|-- mysite.py
|-- ...

where .venv is the folder of virtual environment with darc installed and mysite.py is the file with your sites customisation.

Then you just need to change your mysite.py with some additional lines as below:

# mysite.py

import sys

from darc.__main__ import main
from darc.sites import BaseSite, register

class MySite(BaseSite):

    ...

# register sites
register(MySite)

if __name__ == '__main__':
    sys.exit(main())

And now, you can start darc through python mysite.py [...] instead of python -m darc [...] with your sites customisation registered to the system.

How to implement a sites customisation?¶

Crawler Hook¶

Loader Hook¶

In case to drop the link from task queue…¶

Then what should I do to include my sites customisation?¶

darc

Navigation

Related Topics