Feature Curation
Introduction
Feature curation refers to the genome-wide annotation of structural features in the genome. In other words, everything from protein coding regions and exon/intron boundaries to untranslated regions, non-coding RNA etc. Typically the evidence used to identify these features is based upon model-based algorithms backed up with experimental data (e.g. RNA-seq, EST libraries). Repeat annotation and masking is necessary at this stage.
Model-based algorithms typically use a hidden-markov based approach to model known gene features from related organisms. This ‘trained’ model is then used to predict features in the new genome. The sensitivity and specificity of these algorithms se can be limited by the availability of well-annotated genomes from related species however.As such it is crucial to obtain transcriptomic data as well from a representative set of tissues, time-points and/or developmental stages.
Most feature curation efforts utilise multiple sources of evidence and combine these predictions to form a consensus prediction. This is then followed by global quality control prior to starting any manual curation checks. These include sanity-checks such as ambiguous predictions, start-to-stop codons, splicing-rules, checking for missing universal orthologs, ensuring agreement with RNA-seq. This can form the basis of a priority list for subsequent manual curation efforts.