darc
- Darkweb Crawler Project¶
Important
Starting from version 1.0.0
, new features of the project will not be
developed into this public repository. Only bugfix and security patches will be
applied to the update and new releases.
darc
is designed as a swiss army knife for darkweb crawling.
It integrates requests
to collect HTTP request and response
information, such as cookies, header fields, etc. It also bundles
selenium
to provide a fully rendered web page and screenshot
of such view.
- How to …
- Technical Documentation
- URL Utilities
Link
parse_link()
quote()
unquote()
urljoin()
urlparse()
urlsplit()
darc.parse.URL_PAT
darc.save._SAVE_LOCK
darc.db.BULK_SIZE
darc.db.LOCK_TIMEOUT
darc.db.MAX_POOL
darc.db.REDIS_LOCK
darc.db.RETRY_INTERVAL
darc.submit.PATH_API
darc.submit.SAVE_DB
darc.submit.API_RETRY
darc.submit.API_NEW_HOST
darc.submit.API_REQUESTS
darc.submit.API_SELENIUM
darc.selenium.BINARY_LOCATION
- Proxy Utilities
- Sites Customisation
- Module Constants
- Custom Exceptions
APIRequestFailed
DatabaseOperaionFailed
FreenetBootstrapFailed
HookExecutionFailed
I2PBootstrapFailed
LinkNoReturn
LockWarning
RedisCommandFailed
SiteNotFoundWarning
TorBootstrapFailed
TorRenewFailed
UnsupportedLink
UnsupportedPlatform
UnsupportedProxy
WorkerBreak
ZeroNetBootstrapFailed
_BaseException
_BaseWarning
- Data Models
- Configuration
- Customisations
- Docker Integration
- Web Backend Demo
- Data Models Demo
- Submission Data Schema
- Auxiliary Scripts
Rationale¶
There are two types of workers:
crawler
– runs thedarc.crawl.crawler()
to provide a fresh view of a link and test its connectabilityloader
– run thedarc.crawl.loader()
to provide an in-depth view of a link and provide more visual information
The general process can be described as following for workers of crawler
type:
process_crawler()
: obtain URLs from therequests
link database (c.f.load_requests()
), and feed such URLs tocrawler()
.crawler()
: parse the URL usingparse_link()
, and check if need to crawl the URL (c.f.PROXY_WHITE_LIST
,PROXY_BLACK_LIST
,LINK_WHITE_LIST
andLINK_BLACK_LIST
); if true, then crawl the URL withrequests
.If the URL is from a brand new host,
darc
will first try to fetch and saverobots.txt
and sitemaps of the host (c.f.save_robots()
andsave_sitemap()
), and extract then save the links from sitemaps (c.f.read_sitemap()
) into link database for future crawling (c.f.save_requests()
). Also, if the submission API is provided,submit_new_host()
will be called and submit the documents just fetched.If
robots.txt
presented, andFORCE
isFalse
,darc
will check if allowed to crawl the URL.Note
The root path (e.g.
/
in https://www.example.com/) will always be crawled ignoringrobots.txt
.At this point,
darc
will call the customised hook function fromdarc.sites
to crawl and get the final response object.darc
will save the session cookies and header information, usingsave_headers()
.Note
If
requests.exceptions.InvalidSchema
is raised, the link will be saved bysave_invalid()
. Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LIST
andMIME_BLACK_LIST
),submit_requests()
will be called and submit the document just fetched.If the response document is HTML (
text/html
andapplication/xhtml+xml
),extract_links()
will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()
).And if the response status code is between
400
and600
, the URL will be saved back to the link database (c.f.save_requests()
). If NOT, the URL will be saved intoselenium
link database to proceed next steps (c.f.save_selenium()
).
The general process can be described as following for workers of loader
type:
process_loader()
: in the meanwhile,darc
will obtain URLs from theselenium
link database (c.f.load_selenium()
), and feed such URLs toloader()
.loader()
: parse the URL usingparse_link()
and start loading the URL usingselenium
with Google Chrome.At this point,
darc
will call the customised hook function fromdarc.sites
to load and return the originalWebDriver
object.If successful, the rendered source HTML document will be saved, and a full-page screenshot will be taken and saved.
If the submission API is provided,
submit_selenium()
will be called and submit the document just loaded.Later,
extract_links()
will be called then to extract all possible links from the HTML document and save such links into therequests
database (c.f.save_requests()
).
Important
For more information about the hook functions, please refer to the customisation documentations.
Installation¶
Note
darc
supports Python all versions above and includes 3.6.
Currently, it only supports and is tested on Linux (Ubuntu 18.04)
and macOS (Catalina).
When installing in Python versions below 3.8, darc
will
use walrus
to compile itself for backport compatibility.
pip install darc
Please make sure you have Google Chrome and corresponding version of Chrome Driver installed on your system.
Important
Starting from version 0.3.0, we introduced Redis for the task queue database backend.
Since version 0.6.0, we introduced relationship database storage (e.g. MySQL, SQLite, PostgreSQL, etc.) for the task queue database backend, besides the Redis database, since it can be too much memory-costly when the task queue becomes vary large.
Please make sure you have one of the backend database installed, configured,
and running when using the darc
project.
However, the darc
project is shipped with Docker and Compose support.
Please see Docker Integration for more information.
Or, you may refer to and/or install from the Docker Hub repository:
docker pull jsnbzh/darc[:TAGNAME]
or GitHub Container Registry, with more updated and comprehensive images:
docker pull ghcr.io/jarryshaw/darc[:TAGNAME]
# or the debug image
docker pull ghcr.io/jarryshaw/darc-debug[:TAGNAME]
Usage¶
Important
Though simple CLI, the darc
project is more configurable by
environment variables. For more information, please refer to the
environment variable configuration documentations.
The darc
project provides a simple CLI:
usage: darc [-h] [-v] -t {crawler,loader} [-f FILE] ...
the darkweb crawling swiss army knife
positional arguments:
link links to craw
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-t {crawler,loader}, --type {crawler,loader}
type of worker process
-f FILE, --file FILE read links from file
It can also be called through module entrypoint:
python -m python-darc ...
Note
The link files can contain comment lines, which should start with #
.
Empty lines and comment lines will be ignored when loading.