Framework for the features extraction from Wikipedia XML dumps.
This project has been tested with Python 3.5.0 and Python 3.8.5.
You need to install dependencies first, as usual.
pip install -r requirements.txtFirst of all, download Wikipiedia dumps:
./download.shThen run the extractor:
python -m wikidump [PROGRAM_OPTIONS] FILE [FILE ...] OUTPUT_DIR [PROGRAM_OPTIONS] FUNCTION [FUNCTION_OPTIONS]You can also run the program by using the Makefile and GNU/Make (edit the file in order to run the program with the desired parameters).
For example, you can run the program on the English dumps by typing:
make run-enIf you are interested in extracting the languages known by the Catalan Wikipedia's users, you can type:
python -m wikidump --output-compression gzip dumps/cawiki/20210201/cawiki-20210201-pages-meta-history.xml.7z output extract-known-languages --only-pages-with-languages --only-revisions-with-languages --only-last-revisionSo as to retrieve the wikibreaks and similar templates associated to the users within their user page and user talk page, you can type:
python -m wikidump dumps/cawiki/20210201/cawiki-20210201-pages-meta-history.xml.7z output_wikibreaks --output-compression gzip extract-wikibreaks --only-pages-with-wikibreaksThe example above shows the language extraction considering the Catalan Wikipedia.
In order to retrieve transcluded user warnings templates and their associated parameters within user talk pages, you can run the following Python command:
python -m wikidump dumps/cawiki/20210201/cawiki-20210201-pages-meta-history.xml.7z output_user_warnings_transcluded --output-compression gzip extract-user-warnings --only-pages-with-user-warningsThe example, shown above, illustrates the template extraction considering the Catalan Wikipedia.
This command aims to produce regular expressions to detect a substituted user warnings template (using the subst function) within user talk pages.
Unfortunately, for the sake of semplicity, the subst-chain is not handled by this Python code.
To run the script, you run the following command:
python -m wikidump dumps/cawiki/20210201/cawiki-20210201-pages-meta-history.xml.7z output_user_warnings_regex --output-compression gzip extract-user-warnings-templates --esclude-template-repetition --set-interval '1 week'The example above shows the regular expressions produced considering the Catalan Wikipedia.
The previous command will ignore revisions, in which the template has not changed. The script will group the changes by weeks, therefore, if we consider a single week, the script will return the latest revision among all the ones made within seven days.
Please note: regular expressions have not been tested, since the work would have been tough and time consuming, therefore I can not assure the outcomes is totally correct
In order to find the most salient words which best characterize the user warnings templates, you can run the following command:
python -m wikidump dumps/cawiki/20210201/cawiki-20210201-pages-meta-history.xml.7z output_user_warnings_tokens --output-compression gzip extract-user-warnings-templates-tokens --esclude-template-repetition --set-interval '1 week' --language catalanThe example above shows the most salient words extraction considering the Catalan Wikipedia.
The previous command will ignore revisions, in which the template has not changed. The script will group the changes by weeks, therefore, if we consider a single week, the script will return the latest revision among all the ones made within seven days.
First of all, the punctuation and symbols are removed from each template.
Secondly, the stopwords of the chosen language are removed. Subsequently, if the appropriate flag is set, every word left is stemmed.
Finally, the value of the tf-idf metric for each word within all the revisions is calculated.
The corpus considered is made up of the set of template text of the revisions selected for that template.
At this point, we define N as the number of words which makes up the revision of the template and X the number of documents in the corpus.
Let's consider 2*X documents per template: X elements are randomly taken from other templates of the same language to avoid the idf value being too small (in the worst case 0) for the templates which change infrequently.
The K words with the highest tf-idf value for revision are then selected, where K changes from revision to revision and is equal to N/2.
To find substituted user warnings in a probabilistic way, with the possibility of false positives cases, you can run the following command:
python -m wikidump dumps/cawiki/20210201/cawiki-20210201-pages-meta-history.xml.7z output_user_warnings_probabilistic --output-compression gzip extract-user-warnings-templates-probabilistic --only-pages-with-user-warnings --language catalan output_tokens/cawiki-20210201-pages-meta-history.xml.7z.features.json.gz --only-last-revisionThe example above will use of the words extracted from the extract-user-warnings-templates-tokens command by passing the output files as a parameter.
The objective is to find all the salient words of a template within the user talk page; if the aim is successfully reached, the template is marked as found and, after that, the salient words found will be printed.
Firstly, you need to find the Wikidata item code of the template; for example the code for the wikibreak is Q5652064(retrieved from the corresponding wikidata page).
Secondly, you need to install the development dependencies
pip install -r requirements.dev.txtFinally, run the following python command and giving it the template code
python utils/get_template_names.py WIKIDATA-TEMPLATE-CODEThe documentation, regarding the the produced data and the refactored one, is shown here data documentation.
In order to merge all the fragments, into which the dump is divided and to make the produced file more manageable, you can use the Python scripts, present in the utils/dataset_handler folder, in sequence.
As for the previous case, it is possible and recommended to use a Makefile;
only after having edited it, can you simply type:
make runutils/dataset_handler contains also some scripts to upload some metrics on a Postgres database, all you need to do in order to produce them is to run the following command:
python utils/metrics_loader/..metrics.py DATASET_LOCATION DATABASE_NAME POSTGRES_USER POSTGRES_USER_PASSWORD POSTGRES_PORTDATASET_LOCATIONrefers to the path where the compressed JSON file is stored. Undoubtedly you need to pass the correct dataset according to the metrics which you are willing to compute.DATABASE_NAMErefers to the name of the Postgres database you are willing to usePOSTGRES_USERrefers to the name of the Postgres user you are willing to usePOSTGRES_USER_PASSWORDrefers to the password of the previously defined userPOSTGRES_PORTrefers to the Postgres process port
The produced metrics will be the following:
- Number of wikibreak templates registered in a given month, and the cumulative amount up to that point
- Number of user warnings templates registered in a given month, and the cumulative amount up to that point
They will have the following schema:
| id | name | year | month | category | uw_category | wikibreak_category1 | wikibreak_category2 | wikibreak_subcategory | amount | cumulative_amount |
|---|---|---|---|---|---|---|---|---|---|---|
| PK SERIAL | TEXT | INT | INT | TEXT | TEXT | TEXT | TEXT | TEXT | INT | INT |
In order to call the all the scripts on all the Wikipedia dump, you can run the following script
./run.shFirst of all, be sure you have modified all the readonly variables so as to fit your needs; feel free to change whatever you want.
The dependencies of the previously defined script are
So as to call the entire program in a Docker cotainer, a Dockerfile has been provided.
First, you need to change the content of the run.sh file in order to fill your requirements, such as the files' locations and which operation should be carried out by the script.
Additionally, make sure you have given the correct reference if you are willing to directly install the dump within the Docker image by using wikidump-download-tools.
Then, you can build the Docker image by typing:
docker build -t wikidump .Lastly, run the docker image:
docker run wikidumpThis library was created by Alessio Bogon and then expanded by Cristian Consonni.
The here presented project is implemented on the pre-existent structure.