Configuration

The darc project is generally configurable through numerous environment variables. Below is the full list of supported environment variables you may use to configure the behaviour of darc.

General Configurations

DARC_REBOOT
Type:

bool (int)

Default:

0

If exit the program after first round, i.e. crawled all links from the requests link database and loaded all links from the selenium link database.

This can be useful especially when the capacity is limited and you wish to save some space before continuing next round. See Docker integration for more information.

DARC_DEBUG
Type:

bool (int)

Default:

0

If run the program in debugging mode.

DARC_VERBOSE
Type:

bool (int)

Default:

0

If run the program in verbose mode. If DARC_DEBUG is True, then the verbose mode will be always enabled.

DARC_FORCE
Type:

bool (int)

Default:

0

If ignore robots.txt rules when crawling (c.f. crawler()).

DARC_CHECK
Type:

bool (int)

Default:

0

If check proxy and hostname before crawling (when calling extract_links(), read_sitemap() and read_hosts()).

If DARC_CHECK_CONTENT_TYPE is True, then this environment variable will be always set as True.

DARC_CHECK_CONTENT_TYPE
Type:

bool (int)

Default:

0

If check content type through HEAD requests before crawling (when calling extract_links(), read_sitemap() and read_hosts()).

DARC_URL_PAT
Type:

List[Tuple[str, str, int]] (JSON)

Default:

[]

Regular expression patterns to match all reasonable URLs.

The environment variable should be JSON encoded, as an array of three-element pairs. In each pair, it contains one scheme (str) as the fallback default scheme for matched URL, one Python regular expression string (str) as described in the builtin re module and one numeric value (int) representing the flags as defined in the builtin re module as well.

Important

The patterns must have a named match group url, e.g. (?P<url>bitcoin:\w+) so that the function can extract matched URLs from the given pattern.

And the regular expression will always be used in ASCII mode, i.e., with re.ASCII flag to compile.

DARC_CPU
Type:

int

Default:

None

Number of concurrent processes. If not provided, then the number of system CPUs will be used.

DARC_MULTIPROCESSING
Type:

bool (int)

Default:

1

If enable multiprocessing support.

DARC_MULTITHREADING
Type:

bool (int)

Default:

0

If enable multithreading support.

Note

DARC_MULTIPROCESSING and DARC_MULTITHREADING can NOT be toggled at the same time.

DARC_USER
Type:

str

Default:

current login user (c.f. getpass.getuser())

Non-root user for proxies.

Data Storage

See also

See darc.save for more information about source saving.

See darc.db for more information about database integration.

PATH_DATA
Type:

str (path)

Default:

data

Path to data storage.

REDIS_URL
Type:

str (url)

Default:

redis://127.0.0.1

URL to the Redis database.

DB_URL
Type:

str (url)

URL to the RDS storage.

Important

The task queues will be saved to darc database; the data submittsion will be saved to darcweb database.

Thus, when providing this environment variable, please do NOT specify the database name.

DARC_BULK_SIZE
Type:

int

Default:

100

Bulk size for updating databases.

See also

  • darc.db.save_requests()

  • darc.db.save_selenium()

LOCK_TIMEOUT
Type:

float

Default:

10

Lock blocking timeout.

Note

If is an infinit inf, no timeout will be applied.

See also

Get a lock from darc.db.get_lock().

DARC_MAX_POOL
Type:

int

Default:

1_000

Maximum number of links loaded from the database.

Note

If is an infinit inf, no limit will be applied.

See also

  • darc.db.load_requests()

  • darc.db.load_selenium()

REDIS_LOCK
Type:

bool (int)

Default:

0

If use Redis (Lua) lock to ensure process/thread-safely operations.

See also

Toggles the behaviour of darc.db.get_lock().

RETRY_INTERVAL
Type:

int

Default:

10

Retry interval between each Redis command failure.

Note

If is an infinit inf, no interval will be applied.

See also

Toggles the behaviour of darc.db.redis_command().

Web Crawlers

DARC_WAIT
Type:

float

Default:

60

Time interval between each round when the requests and/or selenium database are empty.

DARC_SAVE
Type:

bool (int)

Default:

0

If save processed link back to database.

Note

If DARC_SAVE is True, then DARC_SAVE_REQUESTS and DARC_SAVE_SELENIUM will be forced to be True.

See also

See darc.db for more information about link database.

DARC_SAVE_REQUESTS
Type:

bool (int)

Default:

0

If save crawler() crawled link back to requests database.

See also

See darc.db for more information about link database.

DARC_SAVE_SELENIUM
Type:

bool (int)

Default:

0

If save loader() crawled link back to selenium database.

See also

See darc.db for more information about link database.

TIME_CACHE
Type:

float

Default:

60

Time delta for caches in seconds.

The darc project supports caching for fetched files. TIME_CACHE will specify for how log the fetched files will be cached and NOT fetched again.

Note

If TIME_CACHE is None then caching will be marked as forever.

SE_WAIT
Type:

float

Default:

60

Time to wait for selenium to finish loading pages.

Note

Internally, selenium will wait for the browser to finish loading the pages before return (i.e. the web API event DOMContentLoaded). However, some extra scripts may take more time running after the event.

CHROME_BINARY_LOCATION
Type:

str

Default:

google-chrome

Path to the Google Chrome binary location.

Note

This environment variable is mandatory for non macOS and/or Linux systems.

See also

See darc.selenium for more information.

White / Black Lists

Type:

List[str] (JSON)

Default:

[]

White list of hostnames should be crawled.

Note

Regular expressions are supported.

Type:

List[str] (JSON)

Default:

[]

Black list of hostnames should be crawled.

Note

Regular expressions are supported.

Type:

bool (int)

Default:

0

Fallback value for match_host().

MIME_WHITE_LIST
Type:

List[str] (JSON)

Default:

[]

White list of content types should be crawled.

Note

Regular expressions are supported.

MIME_BLACK_LIST
Type:

List[str] (JSON)

Default:

[]

Black list of content types should be crawled.

Note

Regular expressions are supported.

MIME_FALLBACK
Type:

bool (int)

Default:

0

Fallback value for match_mime().

PROXY_WHITE_LIST
Type:

List[str] (JSON)

Default:

[]

White list of proxy types should be crawled.

Note

The proxy types are case insensitive.

PROXY_BLACK_LIST
Type:

List[str] (JSON)

Default:

[]

Black list of proxy types should be crawled.

Note

The proxy types are case insensitive.

PROXY_FALLBACK
Type:

bool (int)

Default:

0

Fallback value for match_proxy().

Note

If provided, LINK_WHITE_LIST, LINK_BLACK_LIST, MIME_WHITE_LIST, MIME_BLACK_LIST, PROXY_WHITE_LIST and PROXY_BLACK_LIST should all be JSON encoded strings.

Data Submission

SAVE_DB
Type:

bool

Default:

True

Save submitted data to database.

API_RETRY
Type:

int

Default:

3

Retry times for API submission when failure.

API_NEW_HOST
Type:

str

Default:

None

API URL for submit_new_host().

API_REQUESTS
Type:

str

Default:

None

API URL for submit_requests().

API_SELENIUM
Type:

str

Default:

None

API URL for submit_selenium().

Note

If API_NEW_HOST, API_REQUESTS and API_SELENIUM is None, the corresponding submit function will save the JSON data in the path specified by PATH_DATA.

Tor Proxy Configuration

DARC_TOR
Type:

bool (int)

Default:

1

If manage the Tor proxy through darc.

TOR_PORT
Type:

int

Default:

9050

Port for Tor proxy connection.

TOR_CTRL
Type:

int

Default:

9051

Port for Tor controller connection.

TOR_PASS
Type:

str

Default:

None

Tor controller authentication token.

Note

If not provided, it will be requested at runtime.

TOR_RETRY
Type:

int

Default:

3

Retry times for Tor bootstrap when failure.

TOR_WAIT
Type:

float

Default:

90

Time after which the attempt to start Tor is aborted.

Note

If not provided, there will be NO timeouts.

TOR_CFG
Type:

Dict[str, Any] (JSON)

Default:

{}

Tor bootstrap configuration for stem.process.launch_tor_with_config().

Note

If provided, it should be a JSON encoded string.

I2P Proxy Configuration

DARC_I2P
Type:

bool (int)

Default:

1

If manage the I2P proxy through darc.

I2P_PORT
Type:

int

Default:

4444

Port for I2P proxy connection.

I2P_RETRY
Type:

int

Default:

3

Retry times for I2P bootstrap when failure.

I2P_WAIT
Type:

float

Default:

90

Time after which the attempt to start I2P is aborted.

Note

If not provided, there will be NO timeouts.

I2P_ARGS
Type:

str (Shell)

Default:

''

I2P bootstrap arguments for i2prouter start.

If provided, it should be parsed as command line arguments (c.f. shlex.split()).

Note

The command will be run as DARC_USER, if current user (c.f. getpass.getuser()) is root.

ZeroNet Proxy Configuration

DARC_ZERONET
Type:

bool (int)

Default:

1

If manage the ZeroNet proxy through darc.

ZERONET_PORT
Type:

int

Default:

4444

Port for ZeroNet proxy connection.

ZERONET_RETRY
Type:

int

Default:

3

Retry times for ZeroNet bootstrap when failure.

ZERONET_WAIT
Type:

float

Default:

90

Time after which the attempt to start ZeroNet is aborted.

Note

If not provided, there will be NO timeouts.

ZERONET_PATH
Type:

str (path)

Default:

/usr/local/src/zeronet

Path to the ZeroNet project.

ZERONET_ARGS
Type:

str (Shell)

Default:

''

ZeroNet bootstrap arguments for ZeroNet.sh main.

Note

If provided, it should be parsed as command line arguments (c.f. shlex.split()).

Freenet Proxy Configuration

DARC_FREENET
Type:

bool (int)

Default:

1

If manage the Freenet proxy through darc.

FREENET_PORT
Type:

int

Default:

8888

Port for Freenet proxy connection.

FREENET_RETRY
Type:

int

Default:

3

Retry times for Freenet bootstrap when failure.

FREENET_WAIT
Type:

float

Default:

90

Time after which the attempt to start Freenet is aborted.

Note

If not provided, there will be NO timeouts.

FREENET_PATH
Type:

str (path)

Default:

/usr/local/src/freenet

Path to the Freenet project.

FREENET_ARGS
Type:

str (Shell)

Default:

''

Freenet bootstrap arguments for run.sh start.

If provided, it should be parsed as command line arguments (c.f. shlex.split()).

Note

The command will be run as DARC_USER, if current user (c.f. getpass.getuser()) is root.