How to implement a sites customisation?¶
As had been discussed already in the documentation, the implementation
of a sites customisation is dead simple: just inherits the
darc.sites.BaseSite
class and overwrites
the corresponding crawler()
and
loader()
abstract static methods.
See below an example from the documentation.
from darc.sites import BaseSite, register
As the class below suggests, you may implement and register your sites
customisation for mysite.com and www.mysite.com using the
MySite
class, where hostname
attribute contains the list of hostnames to which the class should be
associated with.
NB: Implementation details of the crawler
and loader
methods will be discussed in following sections.
class MySite(BaseSite):
"""This is a site customisation class for demonstration purpose.
You may implement a module as well should you prefer."""
#: List[str]: Hostnames the sites customisation is designed for.
hostname = ['mysite.com', 'www.mysite.com']
@staticmethod
def crawler(timestamp, session, link): ...
@staticmethod
def loader(timestamp, driver, link): ...
Should your sites customisation be associated with multiple sites, you can
just add them all to the hostname
attribute; when you call
darc.sites.register()
to register your sites customisation, the
function will automatically handle the registry association information.
# register sites implicitly
register(MySite)
Nonetheless, in case where you would rather specify the hostnames at
runtime (instead of adding them to the hostname
attribute), you may
just leave out the hostname
attribute as None
and specify
your list of hostnames at darc.sites.register()
function call.
# register sites explicitly
register(MySite, 'mysite.com', 'www.mysite.com')
Crawler Hook¶
The crawler
method is based on requests.Session
objects
and returns a requests.Response
instance representing the
crawled web page.
Type annotations of the method can be described as
@staticmethod
def crawler(session: requests.Session, link: darc.link.Link) -> requests.Response: ...
where session
is the requests.Session
instance with proxy
presets and link
is the target link (parsed by
darc.link.parse_link()
to provide more information than mere string).
For example, let’s say you would like to inject a cookie named SessionID
and an Authentication
header with some fake identity, then you may write
the crawler
method as below.
@staticmethod
def crawler(timestamp, session, link):
"""Crawler hook for my site.
Args:
timestamp (datetime.datetime): Timestamp of the worker node reference.
session (requests.Session): Session object with proxy settings.
link (darc.link.Link): Link object to be crawled.
Returns:
requests.Response: The final response object with crawled data.
"""
# inject cookies
session.cookies.set('SessionID', 'fake-session-id-value')
# insert headers
session.headers['Authentication'] = 'Basic fake-identity-credential'
response = session.get(link.url, allow_redirects=True)
return response
In this case when darc
crawling the link, the HTTP(S) request will be
provided with a session cookie and HTTP header, so that it may bypass
potential authorisation checks and land on the target page.
Loader Hook¶
The loader
method is based on selenium.webdriver.Chrome
objects
and returns a the original web driver instance containing the loaded web
page.
Type annotations of the method can be described as
@staticmethod
def loader(driver: selenium.webdriver.Chrome, link: darc.link.Link) -> selenium.webdriver.Chrome: ...
where driver
is the selenium.webdriver.Chrome
instance with
proxy presets and link
is the target link (parsed by
darc.link.parse_link()
to provide more information than mere string).
For example, let’s say you would like to animate user login and go to the
target page after successful attempt, then you may write the loader
method as below.
@staticmethod
def loader(timestamp, driver, link):
"""Loader hook for my site.
Args:
timestamp: Timestamp of the worker node reference.
driver (selenium.webdriver.Chrome): Web driver object with proxy settings.
link (darc.link.Link): Link object to be loaded.
Returns:
selenium.webdriver.Chrome: The web driver object with loaded data.
"""
# land on login page
driver.get('https://%s/login' % link.host)
# animate login attempt
form = driver.find_element_by_id('login-form')
form.find_element_by_id('username').send_keys('admin')
form.find_element_by_id('password').send_keys('p@ssd')
form.click()
# check if the attempt succeeded
if driver.title == 'Please login!':
raise ValueError('failed to login %s' % link.host)
# go to the target page
driver.get(link.url)
# wait for page to finish loading
from darc.const import SE_WAIT # should've been put with the top-level import statements
if SE_WAIT is not None:
time.sleep(SE_WAIT)
return driver
In this case when darc
loading the link, the web driver will first
perform user login, so that it may bypass potential authorisation checks
and land on the target page.
In case to drop the link from task queue…¶
In some scenarios, you may want to remove the target link from the task queue, then there’re basically two ways:
do like a wildling, remove it directly from the database
As there’re three task queues used in darc
, each represents task
queues for the crawler (requests
database) and loader
(selenium
database) worker nodes and a track record for known
hostnames (hostname database), you will need to call corresponding functions
to remove the target link from the database desired.
Possible functions are as below:
darc.db.drop_hostname()
darc.db.drop_requests()
darc.db.drop_selenium()
all take one positional argument link
, i.e. the darc.link.Link
object to be removed.
Say you would like to remove https://www.mysite.com
from the
requests
database, then you may just run
from darc.db import drop_requests
from darc.link import parse_link
link = parse_link('https://www.mysite.com')
drop_requests(link)
or make it in an elegant way
When implementing the sites customisation, you may wish to drop certain
links at runtime, then you may simply raise darc.error.LinkNoReturn
in the corresponding crawler
and/or loader
methods.
For instance, you would like to proceed with mysite.com
but NOT
www.mysite.com
in the sites customisation, then you may implement your
class as
from darc.error import LinkNoReturn
class MySite(BaseSite):
...
@staticmethod
def crawler(timestamp, session, link):
if link.host == 'www.mysite.com':
raise LinkNoReturn(link)
...
@staticmethod
def loader(timestamp, driver, link):
if link.host == 'www.mysite.com':
raise LinkNoReturn(link)
...
Then what should I do to include my sites customisation?¶
Simple as well!
Just install your codes to where you’re running darc
, e.g. the
Docker container, remote server, etc.; then change the startup by injecting
your codes before the entrypoint.
Say the structure of the working directory is as below:
.
|-- .venv/
| |-- lib/python3.8/site-packages
| | |-- darc/
| | | |-- ...
| | |-- ...
| |-- ...
|-- mysite.py
|-- ...
where .venv
is the folder of virtual environment with darc
installed and mysite.py
is the file with your sites customisation.
Then you just need to change your mysite.py
with some additional lines
as below:
# mysite.py
import sys
from darc.__main__ import main
from darc.sites import BaseSite, register
class MySite(BaseSite):
...
# register sites
register(MySite)
if __name__ == '__main__':
sys.exit(main())
And now, you can start darc
through python mysite.py [...]
instead
of python -m darc [...]
with your sites customisation registered to the
system.
See also