How to gracefully deploy darc
?¶
Important
It is NOT necessary to work at the darc
repository
folder directly. You can just use darc
with your
customised code somewhere as you wish.
However, for simplicity, all relative paths referred in this
article is relative to the project root of the darc
repository.
To deploy darc
, there would generally be three basic steps:
deploy the
darc
Docker image;setup the
healthcheck
watchdog service;install the
upload
cron job (optional)
To Start With¶
Per best practice, the system must have as least 2 GB RAM
and 2 CPU cores to handle the loader
worker properly.
And the capacity of the RAM will heavily impact the performance
of the selenium
integration as Google Chrome is the
renowned memory monster.
Note
Imma assume that you’re using *NIX systems, as I don’t believe a Windows user is gonna see this ;)
Firstly, you will need to clone the repository to your system:
git clone https://github.com/JarryShaw/darc.git
# change your working directory
cd darc
then set up the folders you need for the log files:
mkdir -p logs
mkdir -p logs/cron
And now, you will need to decide where you would like to store
the data (documents crawled and saved by darc
); let’s assume
that you have a /data
disk mounted on the system – since that’s
what I have on mine xD – which would be big enough to use as a
safe seperated storage place from the system so that darc
will
not crash your system by exhausting the storage,
mkdir /data/darc
# and make a shortcut
ln -s /data/darc data
therefore, you’re gonna save your data in /data/darc
folder.
Software Dependency¶
After setting local systems, there’re some software dependencies you shall install:
darc
is exclusively deployed through Docker environment, even
though it can also be deployed directly on a host machine, either
Linux or macOS, and perhaps Windows but I had never tested.
Database
darc
needs database backend for the task queue management and
other stuffs. It is highly recommended to deploy darc
with
Redis; but if you insist, you may use
relationship database (e.g. MySQL, SQLite, PostgreSQL,
etc.) instead.
Important
In this article, I will not discuss about the usage of relationship
database as they’re just too miserable for darc
in terms of
availability anyway.
As per best practice, 4 GB RAM would be minimal requirement
for the Redis database. It would be suggested to use directly a
cloud provider hosted Redis database instead of running it on
the same server as darc
.
Deploy darc
Docker Image¶
As discussed in Docker Integration, darc
is exclusively
integrated with Docker workflow. So basically, just pull
the image from Docker Hub or GitHub Container Registry:
# Docker Hub
docker pull jsnbzh/darc:latest
# GitHub Container Registry
docker pull ghcr.io/jarryshaw/darc:latest
In cases where you would like to use a debug image, which
changes the apt
sources to China hosted and IPython and
other auxiliaries installed, you call also pull such image
instead:
# Docker Hub
docker pull jsnbzh/darc:debug
# GitHub Container Registry
docker pull ghcr.io/jarryshaw/darc-debug:latest
Then you will need to customise the docker-compose.yml
based on your needs. Default values and descriptive help
messages can be found in the file.
The rest of it is easy as just calling docker-compose
command to manage the deployed containers, thus I shall
not discuss further.
Deploy with Customisations¶
Important
I’ve made a sample customisation at demo/deploy/
folder,
which can be used directly as a new repository to start with
your customisation, please check it out before moving forwards.
As in the sample customisation, you can simply use the Dockerfile
there as your Docker environment declration. And the entrypoint file
market/run.py
has the sites customisations registered and the
CLI bundled.
Setup healthcheck
Daemon Service¶
Since darc
can be quite a burden to its host system,
I introduced this healthcheck service as discussed in
Auxiliary Scripts.
For a normal System V based service system, you can
simply install the darc-healthcheck
service to
/etc/systemd/system/
:
ln -s extra/healthcheck.service /etc/systemd/system/darc-healthcheck.service
then enable it to run at startup:
sudo systemctl enable darc-healthcheck.service
And from now on, you can simply manage the darc-healthcheck
service through systemctl
or service
command
as you prefer.
Install upload
Cron Job¶
In certain cases, you might wish to upload the API submission
JSON files to your FTP server which has much more space than
the deploy server, then you can utilise the upload
cron
job as mentioned in Auxiliary Scripts.
Simply type the following command:
crontab -e
and add the cron job into the file opened:
10 0 * * * ( cd /path/to/darc/ && /path/to/python3 /path/to/darc/extra/upload.py --host ftp://hostname --user username:password ) >> /path/to/darc/logs/cron/darc-upload.log 2>&1
just remember to change the paths, hostname and credential respectively; and at last, to activate the new cron job:
sudo systemctl restart cron.service
Now, darc
API submission JSON files will be uploaded
to the target FTP server everyday at 0:10 am.
Bonus Tip¶
There is a Makefile
at the project root. You can play
and try to exploit it. A very useful command is that
make reload
when you wish to pull the remote repository and restart
darc
gracefully.