Paper
https://aidanhogan.com/docs/sparql-autocompletion.pdf
You will first need a Wikidata dump.
-
Wikidata dump latest truthy (
.nt.gzis larger, but faster for building the index.) https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.gz [~50GB] -
For testing you can use an internal file:
SPARQLforHumans\Sample500.nt
Then some tools for compiling the source code.
dotnet SDKhttps://dotnet.microsoft.com/download/dotnet/thank-you/sdk-5.0.402-windows-x64-installergit for windows SDKhttps://github.com/git-for-windows/build-extra/releases/latest
Git For Windows SDKsince need an update version ofgzipto sort the large output files
- For the
RDFExplorerclient, we will also neednodehttps://nodejs.org/en/download/
If your planning on running the benchamarks only, then
nodeis not required.
On the Git SDK-64 console (Git for Windows SDK Console)
Install gzip via pacman
$ pacman -S gzipWe'll now need to clone the repository and build the project.
On the Git SDK-64 console
$ git clone https://github.com/gabrieldelaparra/SPARQLforHumans.git
> Cloning into 'SPARQLforHumans'...
> [...]
$ cd SPARQLforHumans
$ dotnet build .
> [...]
> Build succeeded.
> 0 Warning(s)
> 0 Error(s)Now we will run some tests to check that everything works.
$ cd SparqlForHumans.UnitTests/
$ dotnet test
> [...]
> Passed! - Failed: 0, Passed: 214, Skipped: 0, Total: 214, Duration: 9 s - SparqlForHumans.UnitTests.dll (netcoreapp3.1)If any of the tests do not pass, you can create an issue and I will get in touch with you :)
Now we will run the Command Line Interface to filter and index our Wikidata dump.
$ cd ../SparqlForHumans.CLI/
$ dotnet run -- --version
> SparqlForHumans.CLI 1.0.0For the following sections a given Sample500.nt file is given on the root folder of the repository.
To build the complete index (production), latest-truthy.nt.gz should be used.
Note that filtering, sorting and indexing the
latest-truthy.nt.gzwill take~40 hours, depending on your system.
Filters an input file:
- Keeps all subjects that start with
http://www.wikidata.org/entity/ - Keeps all predicates that start with
http://www.wikidata.org/prop/direct/- and object starts with
http://www.wikidata.org/entity/
- and object starts with
- or predicate is
label,descriptionoralt-label- and object is literal and ends with
@en.
- and object is literal and ends with
These can be changed on the SparqlForHumans.RDF\FilterReorderSort\TriplesFilterReorderSort.IsValidLine() method.
To filter run:
$ dotnet run -- -i ../Sample500.nt -fThe command for sorting is given in the console after filtering.
It will add the .filterAll.gz sufix as filtered output and .filterAll-Sorted.gz for sorting.
Filter for
latesttakes~10 hourson my notebook computer (16GB RAM).
Sorting takes Sample500.filterAll.gz as input and outputs Sample500.filterAll-Sorted.gz.
The sorting command process gives no notifications about the status.
Sortinglatesttakes~5 hoursand requires3xthe size ofFiltered.gzdisk space (~40GBfree forlatest)
$ gzip -dc Sample500.filterAll.gz | LANG=C sort -S 200M --parallel=4 -T tmp/ --compress-program=gzip | gzip > Sample500.filterAll-Sorted.gzAfter filtering and sorting, we can now create our index. As a note, both "-e -p" can be given together for the sample file to generate both Entities and Properties Index. For a large file, it is better to do them in 2 steps.
Entities Index will be created by default at %homepath%\SparqlForHumans\LuceneEntitiesIndex\
$ dotnet run -- -i Sample500.filterAll-Sorted.gz -eIf -p was not used above, then we need to create the Properties Index.
Building the Entities Index takes
~30 hoursto complete.
$ dotnet run -- -i Sample500.filterAll-Sorted.gz -pProperties Index will be created by default at %homepath%\SparqlForHumans\LucenePropertiesIndex\
Now our index is ready.
- We can now run our backend via
SparqlForHumans.Server/using theRDFExplorerclient. - Or recreate the results from the paper via
SparqlForHumans.Benchmark/.
Building the Properties Index takes
~2 hoursto complete.
The backend will listen to request from a modified version of RDFExplorer. First we will need to get the server running:
$ cd ../SparqlForHumans.Server/
$ dotnet runWith the server running we can now start the client.
We will now need another console for this.
$ cd `path-for-the-client`
$ git clone https://github.com/gabrieldelaparra/RDFExplorer.git
$ cd RDFExplorer
$ npm install
$ npm start
Now browse to http://localhost:4200/
With the full index we can compare our results agains the Wikidata Endpoint.
67Properties ({Prop}) have been selected to run4type of queries (For a total of268)?var1 {Prop} ?var2 . ?var1 ?prop ?var3 .?var1 {Prop} ?var2 . ?var3 ?prop ?var1 .?var1 {Prop} ?var2 . ?var2 ?prop ?var3 .?var1 {Prop} ?var2 . ?var3 ?prop ?var2 .
268queries are run against ourLocal Indexand theRemote Endpoint.- We will query for
?propon both (Local and Remote) and compare the results. - Running the benchmarks takes
2~3 hours, due to the 50 seconds timeout if the query cannot be completed on the Wikidata Endpoint. - The details of the runs will be stored at
benchmark.json. - The time results will be summarized at
results.txt. - The time results, for each query, will be exported to a
points.csv. Each row is a query. TheIdof the query can be found on thebenchmark.jsonfile asHashCode. - A qualitative comparison (
precision,recall,f1), for each query, will be exported tometrics.csv. Each row is a query. This will only consider those queries that returned results on theRemote Wikidata Endpoint.
$ cd ../SparqlForHumans.Benchmark/
$ dotnet run