Technical Documentation¶
darc
is designed as a swiss army knife for darkweb crawling.
It integrates requests
to collect HTTP request and response
information, such as cookies, header fields, etc. It also bundles
selenium
to provide a fully rendered web page and screenshot
of such view.
- URL Utilities
Link
parse_link()
quote()
unquote()
urljoin()
urlparse()
urlsplit()
darc.parse.URL_PAT
darc.save._SAVE_LOCK
darc.db.BULK_SIZE
darc.db.LOCK_TIMEOUT
darc.db.MAX_POOL
darc.db.REDIS_LOCK
darc.db.RETRY_INTERVAL
darc.submit.PATH_API
darc.submit.SAVE_DB
darc.submit.API_RETRY
darc.submit.API_NEW_HOST
darc.submit.API_REQUESTS
darc.submit.API_SELENIUM
darc.selenium.BINARY_LOCATION
- Proxy Utilities
darc.proxy.bitcoin.PATH
darc.proxy.bitcoin.LOCK
darc.proxy.data.PATH
darc.proxy.ed2k.PATH
darc.proxy.ed2k.LOCK
darc.proxy.ethereum.PATH
darc.proxy.ethereum.LOCK
darc.proxy.freenet.FREENET_PORT
darc.proxy.freenet.FREENET_RETRY
darc.proxy.freenet.BS_WAIT
darc.proxy.freenet.FREENET_PATH
darc.proxy.freenet.FREENET_ARGS
darc.proxy.freenet._MNG_FREENET
darc.proxy.freenet._FREENET_BS_FLAG
darc.proxy.freenet._FREENET_PROC
darc.proxy.freenet._FREENET_ARGS
darc.proxy.i2p.I2P_REQUESTS_PROXY
darc.proxy.i2p.I2P_SELENIUM_PROXY
darc.proxy.i2p.I2P_PORT
darc.proxy.i2p.I2P_RETRY
darc.proxy.i2p.BS_WAIT
darc.proxy.i2p.I2P_ARGS
darc.proxy.i2p._MNG_I2P
darc.proxy.i2p._I2P_BS_FLAG
darc.proxy.i2p._I2P_PROC
darc.proxy.i2p._I2P_ARGS
darc.proxy.irc.PATH
darc.proxy.irc.LOCK
darc.proxy.magnet.PATH
darc.proxy.magnet.LOCK
darc.proxy.mail.PATH
darc.proxy.mail.LOCK
darc.proxy.null.PATH
darc.proxy.null.LOCK
darc.proxy.script.PATH
darc.proxy.script.LOCK
darc.proxy.tel.PATH
darc.proxy.tel.LOCK
darc.proxy.tor.TOR_REQUESTS_PROXY
darc.proxy.tor.TOR_SELENIUM_PROXY
darc.proxy.tor.TOR_PORT
darc.proxy.tor.TOR_CTRL
darc.proxy.tor.TOR_PASS
darc.proxy.tor.TOR_RETRY
darc.proxy.tor.BS_WAIT
darc.proxy.tor.TOR_CFG
darc.proxy.tor._MNG_TOR
darc.proxy.tor._TOR_BS_FLAG
darc.proxy.tor._TOR_PROC
darc.proxy.tor._TOR_CTRL
darc.proxy.tor._TOR_CONFIG
darc.proxy.zeronet.ZERONET_PORT
darc.proxy.zeronet.ZERONET_RETRY
darc.proxy.zeronet.BS_WAIT
darc.proxy.zeronet.ZERONET_PATH
darc.proxy.zeronet.ZERONET_ARGS
darc.proxy.zeronet._MNG_ZERONET
darc.proxy.zeronet._ZERONET_BS_FLAG
darc.proxy.zeronet._ZERONET_PROC
darc.proxy.zeronet._ZERONET_ARGS
darc.proxy.LINK_MAP
- Sites Customisation
- Module Constants
- Custom Exceptions
APIRequestFailed
DatabaseOperaionFailed
FreenetBootstrapFailed
HookExecutionFailed
I2PBootstrapFailed
LinkNoReturn
LockWarning
RedisCommandFailed
SiteNotFoundWarning
TorBootstrapFailed
TorRenewFailed
UnsupportedLink
UnsupportedPlatform
UnsupportedProxy
WorkerBreak
ZeroNetBootstrapFailed
_BaseException
_BaseWarning
- Data Models
- Task Queues
- Submission Data Models
- Hostname Records
HostnameModel
- URL Records
URLModel
URLThroughModel
robots.txt
RecordsRobotsModel
sitemap.xml
RecordsSitemapModel
hosts.txt
RecordsHostsModel
- Crawler Records
RequestsHistoryModel
RequestsHistoryModel.DoesNotExist
RequestsHistoryModel.cookies
RequestsHistoryModel.document
RequestsHistoryModel.id
RequestsHistoryModel.index
RequestsHistoryModel.method
RequestsHistoryModel.model
RequestsHistoryModel.model_id
RequestsHistoryModel.reason
RequestsHistoryModel.request
RequestsHistoryModel.response
RequestsHistoryModel.status_code
RequestsHistoryModel.timestamp
RequestsHistoryModel.url
RequestsModel
RequestsModel.DoesNotExist
RequestsModel.cookies
RequestsModel.document
RequestsModel.history
RequestsModel.id
RequestsModel.is_html
RequestsModel.method
RequestsModel.mime_type
RequestsModel.reason
RequestsModel.request
RequestsModel.response
RequestsModel.session
RequestsModel.status_code
RequestsModel.timestamp
RequestsModel.url
RequestsModel.url_id
- Loader Records
SeleniumModel
- Base Model
BaseMeta
BaseMetaWeb
BaseModel
BaseModelWeb
- Miscellaneous Utilities
IPField
IntEnumField
JSONField
PickleField
Proxy
table_function()
As the websites can be sometimes irritating for their anti-robots
verification, login requirements, etc., the darc
project
also privides hooks to customise crawling behaviours around both
requests
and selenium
.
See also
Such customisation, as called in the darc
project, site
hooks, is site specific, user can set up your own hooks unto a
certain site, c.f. darc.sites
for more information.
Still, since the network is a world full of mysteries and miracles,
the speed of crawling will much depend on the response speed of
the target website. To boost up, as well as meet the system capacity,
the darc
project introduced multiprocessing, multithreading
and the fallback slowest single-threaded solutions when crawling.
Note
When rendering the target website using selenium
powered by
the renown Google Chrome, it will require much memory to run.
Thus, the three solutions mentioned above would only toggle the
behaviour around the use of selenium
.
To keep the darc
project as it is a swiss army knife, only the
main entrypoint function darc.process.process()
is exported
in global namespace (and renamed to darc.darc()
), see below:
And we also exported the necessary hook registration functions to the global namespace, see below:
For more information on the hooks, please refer to the customisation documentations.