Configuration¶
The darc
project is generally configurable through numerous
environment variables. Below is the full list of supported environment
variables you may use to configure the behaviour of darc
.
General Configurations¶
- DARC_REBOOT¶
-
If exit the program after first round, i.e. crawled all links from the
requests
link database and loaded all links from theselenium
link database.This can be useful especially when the capacity is limited and you wish to save some space before continuing next round. See Docker integration for more information.
- DARC_VERBOSE¶
-
If run the program in verbose mode. If
DARC_DEBUG
isTrue
, then the verbose mode will be always enabled.
- DARC_CHECK¶
-
If check proxy and hostname before crawling (when calling
extract_links()
,read_sitemap()
andread_hosts()
).If
DARC_CHECK_CONTENT_TYPE
isTrue
, then this environment variable will be always set asTrue
.
- DARC_CHECK_CONTENT_TYPE¶
-
If check content type through
HEAD
requests before crawling (when callingextract_links()
,read_sitemap()
andread_hosts()
).
- DARC_URL_PAT¶
-
Regular expression patterns to match all reasonable URLs.
The environment variable should be JSON encoded, as an array of three-element pairs. In each pair, it contains one scheme (
str
) as the fallback default scheme for matched URL, one Python regular expression string (str
) as described in the builtinre
module and one numeric value (int
) representing the flags as defined in the builtinre
module as well.Important
The patterns must have a named match group
url
, e.g.(?P<url>bitcoin:\w+)
so that the function can extract matched URLs from the given pattern.And the regular expression will always be used in ASCII mode, i.e., with
re.ASCII
flag to compile.
- DARC_CPU¶
-
Number of concurrent processes. If not provided, then the number of system CPUs will be used.
Note
DARC_MULTIPROCESSING
and DARC_MULTITHREADING
can
NOT be toggled at the same time.
- DARC_USER¶
- Type:
- Default:
current login user (c.f.
getpass.getuser()
)
Non-root user for proxies.
Data Storage¶
See also
See darc.save
for more information about source saving.
See darc.db
for more information about database integration.
- DB_URL¶
- Type:
str
(url)
URL to the RDS storage.
Important
The task queues will be saved to
darc
database; the data submittsion will be saved todarcweb
database.Thus, when providing this environment variable, please do NOT specify the database name.
- DARC_BULK_SIZE¶
- Type:
- Default:
100
Bulk size for updating databases.
See also
darc.db.save_requests()
darc.db.save_selenium()
- LOCK_TIMEOUT¶
- Type:
- Default:
10
Lock blocking timeout.
Note
If is an infinit
inf
, no timeout will be applied.See also
Get a lock from
darc.db.get_lock()
.
- DARC_MAX_POOL¶
- Type:
- Default:
1_000
Maximum number of links loaded from the database.
Note
If is an infinit
inf
, no limit will be applied.See also
darc.db.load_requests()
darc.db.load_selenium()
- REDIS_LOCK¶
-
If use Redis (Lua) lock to ensure process/thread-safely operations.
See also
Toggles the behaviour of
darc.db.get_lock()
.
Web Crawlers¶
- DARC_WAIT¶
- Type:
- Default:
60
Time interval between each round when the
requests
and/orselenium
database are empty.
- DARC_SAVE¶
-
If save processed link back to database.
Note
If
DARC_SAVE
isTrue
, thenDARC_SAVE_REQUESTS
andDARC_SAVE_SELENIUM
will be forced to beTrue
.See also
See
darc.db
for more information about link database.
- DARC_SAVE_REQUESTS¶
-
If save
crawler()
crawled link back torequests
database.See also
See
darc.db
for more information about link database.
- DARC_SAVE_SELENIUM¶
-
If save
loader()
crawled link back toselenium
database.See also
See
darc.db
for more information about link database.
- TIME_CACHE¶
- Type:
- Default:
60
Time delta for caches in seconds.
The
darc
project supports caching for fetched files.TIME_CACHE
will specify for how log the fetched files will be cached and NOT fetched again.Note
If
TIME_CACHE
isNone
then caching will be marked as forever.
- SE_WAIT¶
- Type:
- Default:
60
Time to wait for
selenium
to finish loading pages.Note
Internally,
selenium
will wait for the browser to finish loading the pages before return (i.e. the web API eventDOMContentLoaded
). However, some extra scripts may take more time running after the event.
White / Black Lists¶
- LINK_WHITE_LIST¶
- Type:
List[str]
(JSON)- Default:
[]
White list of hostnames should be crawled.
Note
Regular expressions are supported.
- LINK_BLACK_LIST¶
- Type:
List[str]
(JSON)- Default:
[]
Black list of hostnames should be crawled.
Note
Regular expressions are supported.
- MIME_WHITE_LIST¶
- Type:
List[str]
(JSON)- Default:
[]
White list of content types should be crawled.
Note
Regular expressions are supported.
- MIME_BLACK_LIST¶
- Type:
List[str]
(JSON)- Default:
[]
Black list of content types should be crawled.
Note
Regular expressions are supported.
- PROXY_WHITE_LIST¶
- Type:
List[str]
(JSON)- Default:
[]
White list of proxy types should be crawled.
Note
The proxy types are case insensitive.
- PROXY_BLACK_LIST¶
- Type:
List[str]
(JSON)- Default:
[]
Black list of proxy types should be crawled.
Note
The proxy types are case insensitive.
Note
If provided,
LINK_WHITE_LIST
, LINK_BLACK_LIST
,
MIME_WHITE_LIST
, MIME_BLACK_LIST
,
PROXY_WHITE_LIST
and PROXY_BLACK_LIST
should all be JSON encoded strings.
Data Submission¶
Note
If API_NEW_HOST
, API_REQUESTS
and API_SELENIUM
is None
, the corresponding
submit function will save the JSON data in the path
specified by PATH_DATA
.
Tor Proxy Configuration¶
- TOR_PASS¶
-
Tor controller authentication token.
Note
If not provided, it will be requested at runtime.
- TOR_WAIT¶
- Type:
- Default:
90
Time after which the attempt to start Tor is aborted.
Note
If not provided, there will be NO timeouts.
- TOR_CFG¶
- Type:
Dict[str, Any]
(JSON)- Default:
{}
Tor bootstrap configuration for
stem.process.launch_tor_with_config()
.Note
If provided, it should be a JSON encoded string.
I2P Proxy Configuration¶
- I2P_WAIT¶
- Type:
- Default:
90
Time after which the attempt to start I2P is aborted.
Note
If not provided, there will be NO timeouts.
- I2P_ARGS¶
- Type:
str
(Shell)- Default:
''
I2P bootstrap arguments for
i2prouter start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split()
).Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.
ZeroNet Proxy Configuration¶
- ZERONET_WAIT¶
- Type:
- Default:
90
Time after which the attempt to start ZeroNet is aborted.
Note
If not provided, there will be NO timeouts.
- ZERONET_ARGS¶
- Type:
str
(Shell)- Default:
''
ZeroNet bootstrap arguments for
ZeroNet.sh main
.Note
If provided, it should be parsed as command line arguments (c.f.
shlex.split()
).
Freenet Proxy Configuration¶
- FREENET_WAIT¶
- Type:
- Default:
90
Time after which the attempt to start Freenet is aborted.
Note
If not provided, there will be NO timeouts.
- FREENET_ARGS¶
- Type:
str
(Shell)- Default:
''
Freenet bootstrap arguments for
run.sh start
.If provided, it should be parsed as command line arguments (c.f.
shlex.split()
).Note
The command will be run as
DARC_USER
, if current user (c.f.getpass.getuser()
) is root.