This is a dockerised version of NoSketch Engine, the open source version of Sketch Engine corpus manager and text analysis software developed by Lexical Computing Limited. NoSketch Engine lacks some features compared to Sketch Engine. See the comparison for details.
This repository is completely independent of both Lexical Computing Limited and the NoSketch Engine upstream. If you have any questions about the NoSketch Engine, please use the mailing list: https://groups.google.com/a/sketchengine.co.uk/g/noske . For questions about the docker image, please use the issue tracker.
This docker image is based on Debian 12 Bookworm and the NoSketch Engine build and installation process contains some additional hacks for convenient install and use. See Dockerfile for details.
git clone --depth=1 https://github.com/ELTE-DH/NoSketch-Engine-Docker.gitmake pull– to download the docker imagemake compile– to compile sample corporamake execute– to execute a Sketch Engine command (compilecorp,corpquery, etc.) in the docker container (runs a test CLI query onsusannecorpus by default)make run– to launch the docker container- Navigate to
http://localhost:10070/to try the WebUI
- Easy to add corpora (just add vertical file and registry file to the appropriate location, and compile the corpus with one command)
- CLI commands can be used directly (outside the docker image)
- Works on any domain without changing configuration (without HTTPS and Shibboleth)
- Two example corpora included:
susanne(original NoSkE sample corpus) andemagyardemo - (optional) Shibboleth SP (with eduid.hu)
- (optional) basic auth (updateable easily)
- (optional) HTTPS with Let's Encrypt (automatic renewal with traefik proxy)
Further info on how to analyse a plain text corpus by e-magyar and convert it to the right format suitable to fit in the system.
Corpus configuration recipes to aid compilation of large corpora can be found here.
- Either pull the prebuilt image from Dockerhub:
make pull(ordocker pull eltedh/nosketch-engine:latest) - Or build your own image yourself (the process can take 5 minutes or so):
make build IMAGE_NAME=myimage:latest– be sure to name your image using theIMAGE_NAMEparameter
- Put vert file(s) in:
corpora/CORPUS_NAME/verticaldirectory
(see examples incorpora/susanne/verticalandcorpora/emagyardemo/verticaldirectories) - Put config in:
corpora/registry/CORPUS_NAMEfile
(see examples incorpora/registry/susanneandcorpora/registry/emagyardemo) - Compile all corpora listed in
corpora/registrydirectory using the docker image:make compile- To compile one corpus at a time (overwriting existing files), use the following command:
make execute CMD="compilecorp --no-ske --recompile-corpus CORPUS_REGISTRY_FILE" - If you want to overwrite all existing indices automatically when running
make compileset any non-empty value forFORCE_RECOMPILEenv variable e.g.make compile FORCE_RECOMPILE=y
- To compile one corpus at a time (overwriting existing files), use the following command:
(Optional, only recommended if variables are altered)
Customise the environment variables in secrets/env.sh (see secrets/env.sh.template
for example) and export them into the current shell with source secrets/env.sh
- Run docker container:
make run - Navigate to
http://SERVER_NAME:10070/to use
make execute: runs NoSketch Engine CLI commands using the docker image. Specify the command to run in theCMDparameter. For example:make execute CMD='corpinfo -s susanne'
gives info about the susanne corpusmake execute CMD='corpquery emagyardemo "[lemma=\"és\"]"'
runs the specified query on the emagyardemo corpus and gives 2 hits.
Mind the use of quotation marks:\"inside"inside'.
make connect: gives a shell to a running container
make stop: stops the containermake clean: stops the container, removes indices for all corpora and deletes docker image – use with caution!make create-cert: create self-signed certificate for Shibboleth (must restart a container to apply)make remove-cert: delete self-signed certificate files (must restart a container to apply)make htpasswd: generate strong password for htaccess authentication (must restart a container to apply; see details in Basic auth section)
By default,
- the name of the docker image (
IMAGE_NAME) iseltedh/nosketch-engine:latest, - the name of the docker container (
CONTAINTER_NAME) isnoske, - the directory where the corpora are stored (
CORPORA_DIR) is$(pwd)/corpora, - the port number which the docker container uses (
PORT) is10070, - the variable to force recompiling already indexed corpora (
FORCE_RECOMPILE) is not set (empty or not set means false any other non-zero length value means true), - the citation link (
CITATION_LINK) ishttps://github.com/elte-dh/NoSketch-Engine-Docker, - the server name required for Let's Encrypt and/or Shibboleth (
SERVER_NAME) ishttps://sketchengine.company.com/(mandatory fordocker-compose.yml), - the server alias required for Let's Encrypt and/or Shibboleth (
SERVER_ALIAS) issketchengine.company.com(mandatory fordocker-compose.yml), - the e-mail address required by Let's Encrypt (
LETS_ENCRYPT_EMAIL) is not set (mandatory for Let's Encrypt anddocker-compose.yml), - the self-signed public and private keys (
PUBLIC_KEY,PRIVATE_KEY) are loaded from (secrets/sp.for.eduid.service.hu-{cert,key}.crt) or empty if these files do not exist (mandatory fordocker-compose.yml), - the htaccess and htpasswd files (
HTACCESS,HTPASSWD) are loaded from (secrets/{htaccess,htpasswd} see secrets/{htaccess.template,htpasswd.template} for example) or empty if these files do not exist (mandatory fordocker-compose.yml).
If there is a need to change these, set them as environment variables (e.g. export IMAGE_NAME=myimage:latest)
or supplement make commands with the appropriate values (e.g. make run PORT=8080).
E.g. export IMAGE_NAME=myimage:latest; make build build an image called myimage:latest; and
make run IMAGE_NAME=myimage:latest CONTAINER_NAME=mycontainer PORT=12345 launches the image called myimage:latest in a container
called mycontainer which will use port 12345.
In the latter case the system will be available at http://SERVER_NAME:12345/.
See the table below on which make command accepts which parameter:
| command | IMAGE_NAME |
CONTAINER_NAME |
CORPORA_DIR |
PORT |
FORCE_RECOMPILE |
USERNAME |
PASSWORD |
The Other Variables |
|---|---|---|---|---|---|---|---|---|
make pull |
✔ | . | . | . | . | . | . | . |
make build |
✔ | . | . | . | . | . | . | . |
make compile |
✔ | . | . | . | ✔ | . | . | . |
make execute |
✔ | . | ✔ | . | ✔ | . | . | ✔ |
make run |
✔ | ✔ | ✔ | ✔ | . | . | . | ✔ |
make connect |
. | ✔ | . | . | . | . | . | . |
make stop |
. | ✔ | . | . | . | . | . | . |
make clean |
✔ | ✔ | ✔ | . | . | . | . | . |
make create-cert |
. | . | . | . | . | . | . | . |
make remove-cert |
. | . | . | . | . | . | . | . |
make htpasswd |
✔ | . | . | . | . | ✔ | ✔ | . |
- The Other Variables are
CITATION_LINKSERVER_NAMEandSERVER_ALIASPUBLIC_KEYandPRIVATE_KEYHTACCESSandHTPASSWD
LETS_ENCRYPT_EMAILvariable is only used indocker-compose.yml
In the rare case of multiple different docker images, be sure to name them differently (by using IMAGE_NAME).
In the more common case of multiple different docker containers running simultaneously,
be sure to name them differently (by using CONTAINER_NAME) and also be sure to use different port for each of them
(by using PORT). To handle multiple different sets of corpora be sure to set the directory containing the corpora
(CORPORA_DIR) accordingly for each container.
If you want to build your own docker image be sure to include the IMAGE_NAME parameter into the build command:
make build IMAGE_NAME=myimage:latest and also provide IMAGE_NAME=myimage:latest for every make command
which accepts this parameter.
A convenient solution for managing many environment variables in an easy and reproducible way
(e.g. for docker-compose.yml) is to customise and source secrets/env.sh (based on
secrets/env.sh.template) before running the actual command:
source secrets/env.sh; docker-compose up -d or source secrets/env.sh; make run.
See secrets/env.sh.template for example configuration.
Two types of authentication is supported: basic auth and Shibboleth
- Copy and uncomment relevant config lines from
secrets/htaccess.templateintosecrets/htaccessand set username and password insecrets/htpasswd(e.g. usemake htpasswd USERNAME="USERNAME" PASSWORD="PASSWD" >> secrets/htpasswdshortcut for runninghtpasswdfromapache2-utilspackage inside docker) - Run or restart the container to apply or (re)build your custom image
Note: All users have the same privileges. Currently, there is no interface to manage users from the web UI.
To be able to use the container as a Shibboleth SP (with eduid.hu)
- Set the following environment variables:
SERVER_NAMEe.g.export SERVER_NAME="https://sketchengine.company.com/"SERVER_ALIASe.g.export SERVER_ALIAS="sketchengine.company.com"
- Obtain a self-signed certificate:
make create-certto create a new certificate- Or put your files to
secrets/sp.for.eduid.service.hu-cert.crtandsecrets/sp.for.eduid.service.hu-key.crtwith appropriate permissions (chmod 644 secrets/sp.for.eduid.service.hu-cert.crt secrets/sp.for.eduid.service.hu-key.crt)
- Setup HTTPS
- Run or restart the container to apply or uncomment the relevant lines at the end of
Dockerfilebefore (re)building your custom image - Register your SP with your IdP
- Set (
export) the environment variables (or set them insecrets/env.shbased onsecrets/env.sh.templateandsource secrets/env.sh):CITATION_LINKe.g.export CITATION_LINK="https://github.com/elte-dh/NoSketch-Engine-Docker"LETS_ENCRYPT_EMAILe.g.export LETS_ENCRYPT_EMAIL="contact@company.com"SERVER_NAMEe.g.export SERVER_NAME="https://sketchengine.company.com/"SERVER_ALIASe.g.export SERVER_ALIAS="sketchengine.company.com"- (optional)
IMAGE_NAME,PORTandCONTAINER_NAME PRIVATE_KEYe.g.export PRIVATE_KEY="$(cat secrets/sp.for.eduid.service.hu-key.crt 2> /dev/null)"or set as empty if basic auth is usedexport PRIVATE_KEY=""PUBLIC_KEYe.g.export PUBLIC_KEY="$(cat secrets/sp.for.eduid.service.hu-cert.crt 2> /dev/null)"or set as empty if basic auth is usedexport PUBLIC_KEY=""HTACCESSe.g.export HTACCESS="$(cat secrets/htaccess 2> /dev/null)"or set as empty if Shibboleth is usedexport HTACCESS=""HTPASSWDe.g.export HTPASSWD="$(cat secrets/htpasswd 2> /dev/null)"or set as empty if Shibboleth is usedexport HTPASSWD=""
- Run
docker-compose up -d
You can set a link to your publications which you require users to cite.
Set CITATION_LINK e.g. export CITATION_LINK="https://LINK_GOES_HERE" or in secrets/env.sh
(see secrets/env.sh.template for example).
The link is displayed in the lower-right corner of the main dashboard if any type of authentication is set.
Sketch Engine provides an API for accessing features programmatically. The same applies to NoSketch Engine. See the detailed documentation at https://www.sketchengine.eu/apidoc/ .
The base URL of the NoSketch Engine API for this container is as follows: http://****:10070/bonito/run.cgi
An example API call is as shown below:
Request:
http://***:10070/bonito/run.cgi/wordlist?corpname=susanne&wlattr=word&wlpat=test.&wlsort=frq&wlmaxitems=2&format=json
Response:
{"new_maxitems": 2, "wllimit": 0, "total": 4, "totalfrq": 52, "lastpage": 0, "Items": [{"str": "test", "frq": 26, "relfreq": 172.84246}, {"str": "tested", "frq": 13, "relfreq": 86.42123}], "wlattr_label": "word", "frtp": "frequency", "api_version": "open-5.71.15", "manatee_version": "2.36.7-open-2.225.8", "request": {"wlmaxitems": "2", "wlpat": "test.*", "format": "json", "corpname": "susanne", "wlsort": "frq", "wlattr": "word"}}
- Leipzig Corpora Collection
- Goethe University Frankfurt
- National Library of Latvia
- Dublin City University
- HUN-REN Hungarian Research Centre for Linguistics
The research was supported by the National Laboratory for Digital Heritage (Project no. 2022-2.1.1-NL-2022-00009). The project has been implemented with the support provided by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, financed under the 2022-2.1.1-NL funding scheme.
The following files in this repository are from https://nlp.fi.muni.cz/trac/noske and have their own license:
noske_files/manatee-open-*.tar.gz(GPLv2+)noske_files/bonito-open-*.tar.gz(GPLv2+)noske_files/crystal-open-*.tar.gz(GPLv3)noske_files/gdex-*.tar.gz(GPLv3)- Susanne sample corpus:
data/corpora/susanne/verticalanddata/registry/susanne
The rest of the files are licensed under the Lesser GNU GPL version 3 or any later.
The authors acknowledge the support of the National Laboratory for Digital Heritage. Project no. 2022-2.1.1-NL-2022-00009 has been implemented with the support provided by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, financed under the 2022-2.1.1-NL funding scheme.