How to gracefully deploy darc?

Important

It is NOT necessary to work at the darc repository folder directly. You can just use darc with your customised code somewhere as you wish.

However, for simplicity, all relative paths referred in this article is relative to the project root of the darc repository.

To deploy darc, there would generally be three basic steps:

  1. deploy the darc Docker image;

  2. setup the healthcheck watchdog service;

  3. install the upload cron job (optional)

To Start With

Per best practice, the system must have as least 2 GB RAM and 2 CPU cores to handle the loader worker properly. And the capacity of the RAM will heavily impact the performance of the selenium integration as Google Chrome is the renowned memory monster.

Note

Imma assume that you’re using *NIX systems, as I don’t believe a Windows user is gonna see this ;)

Firstly, you will need to clone the repository to your system:

git clone https://github.com/JarryShaw/darc.git
# change your working directory
cd darc

then set up the folders you need for the log files:

mkdir -p logs
mkdir -p logs/cron

And now, you will need to decide where you would like to store the data (documents crawled and saved by darc); let’s assume that you have a /data disk mounted on the system – since that’s what I have on mine xD – which would be big enough to use as a safe seperated storage place from the system so that darc will not crash your system by exhausting the storage,

mkdir /data/darc
# and make a shortcut
ln -s /data/darc data

therefore, you’re gonna save your data in /data/darc folder.

Software Dependency

After setting local systems, there’re some software dependencies you shall install:

  1. Docker

darc is exclusively deployed through Docker environment, even though it can also be deployed directly on a host machine, either Linux or macOS, and perhaps Windows but I had never tested.

  1. Database

darc needs database backend for the task queue management and other stuffs. It is highly recommended to deploy darc with Redis; but if you insist, you may use relationship database (e.g. MySQL, SQLite, PostgreSQL, etc.) instead.

Important

In this article, I will not discuss about the usage of relationship database as they’re just too miserable for darc in terms of availability anyway.

As per best practice, 4 GB RAM would be minimal requirement for the Redis database. It would be suggested to use directly a cloud provider hosted Redis database instead of running it on the same server as darc.

Deploy darc Docker Image

As discussed in Docker Integration, darc is exclusively integrated with Docker workflow. So basically, just pull the image from Docker Hub or GitHub Container Registry:

# Docker Hub
docker pull jsnbzh/darc:latest
# GitHub Container Registry
docker pull ghcr.io/jarryshaw/darc:latest

In cases where you would like to use a debug image, which changes the apt sources to China hosted and IPython and other auxiliaries installed, you call also pull such image instead:

# Docker Hub
docker pull jsnbzh/darc:debug
# GitHub Container Registry
docker pull ghcr.io/jarryshaw/darc-debug:latest

Then you will need to customise the docker-compose.yml based on your needs. Default values and descriptive help messages can be found in the file.

The rest of it is easy as just calling docker-compose command to manage the deployed containers, thus I shall not discuss further.

Deploy with Customisations

Important

I’ve made a sample customisation at demo/deploy/ folder, which can be used directly as a new repository to start with your customisation, please check it out before moving forwards.

As in the sample customisation, you can simply use the Dockerfile there as your Docker environment declration. And the entrypoint file market/run.py has the sites customisations registered and the CLI bundled.

Setup healthcheck Daemon Service

Since darc can be quite a burden to its host system, I introduced this healthcheck service as discussed in Auxiliary Scripts.

For a normal System V based service system, you can simply install the darc-healthcheck service to /etc/systemd/system/:

ln -s extra/healthcheck.service /etc/systemd/system/darc-healthcheck.service

then enable it to run at startup:

sudo systemctl enable darc-healthcheck.service

And from now on, you can simply manage the darc-healthcheck service through systemctl or service command as you prefer.

Install upload Cron Job

In certain cases, you might wish to upload the API submission JSON files to your FTP server which has much more space than the deploy server, then you can utilise the upload cron job as mentioned in Auxiliary Scripts.

Simply type the following command:

crontab -e

and add the cron job into the file opened:

10 0 * * * ( cd /path/to/darc/ && /path/to/python3 /path/to/darc/extra/upload.py --host ftp://hostname --user username:password ) >> /path/to/darc/logs/cron/darc-upload.log 2>&1

just remember to change the paths, hostname and credential respectively; and at last, to activate the new cron job:

sudo systemctl restart cron.service

Now, darc API submission JSON files will be uploaded to the target FTP server everyday at 0:10 am.

Bonus Tip

There is a Makefile at the project root. You can play and try to exploit it. A very useful command is that

make reload

when you wish to pull the remote repository and restart darc gracefully.