-
Notifications
You must be signed in to change notification settings - Fork 53
Variant Annotation
CellBase can take advantage of the data integrated to implement a rich and high-performance variant annotator. The variant annotation tool is integrated within the CellBase code and can be accessed in two different ways:
- Using the RESTful web services: both GET and POST annotation web services are available (see http://bioinfodev.hpc.cam.ac.uk/cellbase/webservices/). Results are returned in the form of JSON objects, returning VEP is being implemented.
- Using the command line interface (CLI): this can efficiently fetch annotation data remotely (by default the web services available at the University of Cambridge) and therefore does not require any local database downloads nor installation. By avoiding local installation of the knowledge base, we not only avoid requiring users to move and store hundreds of Gigabytes (about 500GB in current release v4), but also users will always automatically access updated data without the painful requirement of downloading and re-installing updates of the knowledge base. Moreover, despite remotely accessing the knowledgebase, the CellBase client provides a lightweight efficient multi-threaded implementation which outperforms a VEP installation with local cached databases (Figure TIMECOMPARISON).
The typical input for the CellBase variant annotator will be a VCF file, although the CLI also offers the possibility to explicitly provide a short list of variants as an argument for fast annotation. Two different output formats can be currently generated by the annotator: a .json file with a list of VariantAnnotation objects (see Variant and VariantAnnotation models at http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/feature/variant/model), or a tab separated values file with the VEP formatted output.
Data provided by the variant annotator is the result of integrating most of the annotations available at the CellBase knowledge base: ENSEMBL's core transcript annotation such as location, id, strand, biotype,etc.; protein annotation provided by UniProt, InterPro, SIFT and PolyPhen; population frequencies provided by the European Variation Archive for The 1000 Genomes Project Phase 3, The Exome Server Project (EVS), The Exome Aggregation Consortium v3 (ExaC) and The Genomes of the Netherlands (GoNL); sequence conservation from PhastCons and PhyloP; gene expression values from The Genome Expression Atlas and The Genotype-Tissue Expression project (GTEx); gene drug interaction data from The Drug Gene Interaction Database (DGIdb) and the Human Phenotype Ontology database (HPO); clinical variants annotation from ClinVar, The Genome-Wide Association Studies catalog (GWAS) and The Catalogue of Somatic Mutations in Cancer (COSMIC). Sequence effect prediction is also calculated on the fly and described by Sequence Ontology (SO) terms.
| Annotation | Homo Sapiens GRCh37 | Homo Sapiens GRCh38 | Others |
|---|---|---|---|
| Consequence Types1 | ✔️ | ✔️ | ✔️ |
| Conservation Scores2 | ✔️ | ✖️ | ✖️ |
| Protein Subs. Scores3 | ✔️ | ✔️ | ✖️ |
1 More info at http://www.ensembl.org/info/genome/variation/predicted_data.html
2 PhastCons, PhyloP and GERP
Exhaustive comparison of sequence effect predictions was made against VEP (78) results for the whole 1000 Genome Phase 1 variant set (XX million variants, XXX million effect predictions), yielding a 99.999t% of concordanced with Ensembl VEP Consequence Types (see https://github.com/opencb/cellbase/wiki/Variant-Annotation for a detailed report on known differences).
CellBase provided annotation can be complemented with custom annotations provided by the user. This custom annotations must be provided within a VCF file and will be read from the INFO column.