EMPOP mtDNA database -

EDNAP forensic mtDNA population database (EMPOP)

The Innsbruck Institute of Legal Medicine took responsibility in setting up the EDNAP forensic mtDNA population database (EMPOP), a new concept that addresses the necessary quality standards criteria for data generation, analysis and transfer quality control.

Mitochondrial DNA (mtDNA) has the appealing characteristics of a multi-copy target molecule under strict uniparental (maternal) inheritance that makes it an informative marker for forensic, population and medical genetic investigations. Research on mtDNA at the Institute of Legal Medicine Innsbruck goes back to the mid-nineties where basic protocols and population studies in the human field [Parson 1998] and species-specific identification methods were introduced [Parson 2000].

Innsbruck EMPOP Team 2011

Towards forensic standards

At the 1999 ISFH (now: ISFG) conference in San Francisco, USA, the European DNA Profiling (EDNAP) group first expressed their commitment to establish a mtDNA database that should meet high forensic standards in DNA sequencing.

High quality standards are required for forensic applications, where a sequencing result has to be legally defensible. While a false match is generally considered less probable, a sequencing error could easily lead to a wrong exclusion of a suspect. In the typical casework scenario, the triangle stain-victim-suspect serves as final control instance for correct data interpretation.

96 well PCR plate (Photo: H. Niederstätter)

MtDNA databases meeting high standards are necessary for calculating the probability of a chance match by determining the frequency of the given haplotype in a (sub)population dataset. Thereby, a certain weight is given to a match or exclusion. Erroneous databases will lower the bounds for a frequency estimate of a given haplotype.

In common mtDNA applications such as phylogeography and population genetics, a high sample number has to be analysed, which increases complexity and the chance of introducing errors. In addition, there is no “external control”, as samples are analysed anonymously without a possibility to be double-checked or correlated to additional samples.

The EDNAP group chose to address these topics. The question to be answered was whether analyses performed in different laboratories using different strategies and technical equipment would have the same results. In order to find a standard procedure as a common ground for harmonised mtDNA analysis and sequence data interpretation, a collaborative exercise was performed according to existing principles of forensic quality control. A set of samples was sent to participating laboratories. The results were compared and employed to identify potential sources of errors in mtDNA analysis [Parson 2004]. Four major classes of mtDNA sequencing errors were identified in the collaborative exercise [Parson 2004, Bandelt 2004]:

clerical errors – mistakes in manual data transfer causing wrong sequence results (while raw data are correct)
sample mix-up or artificial recombination – wrong assembly of sequence data for samples analysed in separate reactions (while raw data are correct)
contamination
nomenclature inconsistencies – a sequence string can be aligned in varying ways to a reference sequence, leading to differing motifs while the mere nucleotide string is the same. If more than one approach of alignment is applied to a dataset, matches can be missed and haplotypes may appear rarer than they are. The phylogenetic approach [Parson 2007a, Bandelt 2008] respects the origin of a sequence and its signature mutations. It points out the similarity of closely related samples, as subsequently harboured (younger) mutations will not affect the alignment motif of the common (older) background. This approach is favoured by EDNAP. Difficulties in nomenclature also appear when length or point heteroplasmies in a sample are detected and called to varying extents.

Assuring high quality data for a mtDNA database

As a result of the collaborative exercise, quality control mechanisms in both sequence data generation and interpretation were set up as requirements. They include avoidance of manual data transcription, double evaluation and final inspection of sequence, usage of standard mtDNA nomenclature, raw data storage, consideration of the phylogeographic background of a sample (for haplogroup assignment and to avoid artificial recombination), external control through collaborative exercises, and finally storage of the original sample for a second analysis – a very basic forensic principle. The EMPOP protocol of mtDNA control region sequence data generation is now well established in the scientific community [Brandstätter 2007]. New protocols for more reliable mtDNA typing of population [Parson 2007a] and casework samples [Niederstätter 2007, Eichmann 2008] have been established.

The EMPOP mtDNA database

On October 16, 2006, release I of the EMPOP database went online [Parson 2007b]. It is an IT based, open platform for comparison and storage of mtDNA sequence data and comprises 5173 mtDNA sequences from worldwide populations contributed by laboratories that had successfully participated in collaborative exercises. 4527 sequences are forensic data (high quality sequences), 646 sequences in the database are validated sequence data from publications, where raw data are not available. Literature-derived sequences have been carefully inspected with several methods of phylogenetic evaluation. The majority of the 5173 sequences derive from Western Eurasian populations, smaller datasets from East Asian, South East Asian and Subsahara African (meta)populations. Ongoing sampling and analysis is continuously increasing the number of samples and worldwide regions covered. Three PhD students at the Innsbruck Institute of Legal Medicine work in this field investigating populations currently underrepresented not only in in the EMPOP database.

EMPOP tools for quality control of mtDNA data

In collaboration with the Institute of Mathematics further tools were developed for the quality control of mtDNA data [Parson 2007b]. In the EMPOP database, sequences are subject to a posteriori analysis. A quasi median network analysis of the dataset shows errors that are not visible in the data table, especially for the untrained eye. The visualisation of mtDNA data is performed by the NETWORK software package that is freely available on the EMPOP homepage. It is a valuable means to pinpoint errors in sequencing, data transcription and/or interpretation and leads to better understanding of homoplasy and potential artefacts in the table.

The EMPOP homepage gives hands-on advice on how to use the database functions (currently sequence search and network analysis). Several search and analysis options can be chosen throughout the process. In the query function, the range, data type and (sub)population to be queried can be selected. A sequence dataset can be uploaded as a motif list in .emp-format onto the EMPOP platform for a posteriori analysis. The input file is checked for correct format and plausibility (this step points out errors such as an rCRS base given as difference, a nucleotide position number exceeding 16569, a second insertion without a first insertion, mutations mentioned twice, contradictory mutations, non-IUPAC nucleotide codes…). In network analysis, the most valuable option is the choice of the filter for calculation of the quasi median network. The application of a filter onto a dataset highlights special positions that should be inspected. Three types of filters can be applied:

unfiltered –all differences in the dataset are displayed in the network – recommended for short sequence strings
EMPOPspeedy – this filter disregards highly recurrent mutations in the EMPOP dataset of 3830 West Eurasian sequences and is recommended for datasets of 50 – 300 samples
EMPOPall – this filter disregards all mutations in the database. Only unobserved mutations are displayed. This option is recommended for datasets exceeding 300 sequences.

The output files give a summary of the input data and chosen options, of results and relevant information on the quasi median network analysis. The more suggestive graphics, that can be displayed with free software, shows the network of samples (and quasi medians) stemming from a root sequence. The network can be displayed in two ways: the full network depicts all mutations in the dataset that have not been filtered, while in the torso network dependent subtrees are collapsed into their stem haplotype. Mutations are indicated in the network. Especially transversions are recommended to be double-checked, as they are expected to be rare events. Reticulations in the network need to be inspected, as they are often the result of idiosyncrasies – potential errors – rather than unfiltered homoplasies in the dataset. All reticulations will be displayed in the torso as well.

The NETWORK program provided on the EMPOP homepage pinpoints „unusual mutations“ in a dataset, that could be sequencing artefacts, transcription errors, notation errors etc. and therefore constitutes a valuable tool for the scientific community compiling validated mtDNA sequences on this database. Besides network analysis, a tool for phylogenetic analysis of data ets is in progress and will be presented elsewhere.

The board of the International Society of Forensic Genetics (ISFG) and the editor of Forensic Science International: Genetics invited EMPOP to logistically organize and perform quality control (QC) of mtDNA sequences in the course of manuscript preparations for the journal Forensic Science International: Genetics.

EMPOP release II is planned to be launched in 2009, offering a database of >10000 mtDNA sequence entries and a new tool: a string based search function as presented [Röck 2009].

International EMPOP Meetings

Three biennial international EMPOP meetings have been organized. They had great impact on the mtDNA scientific community and led to valuable discussions.

1st EMPOP Meeting in conjunction with “Haploid DNA Markers in Forensic Genetics”, Berlin, Germany, November 18-20, 2004.
2nd EMPOP Meeting in conjunction with “DNA in Forensics 2006”, Innsbruck, Austria, September 28-30, 2006.
3rd EMPOP Meeting in conjunction with “DNA in Forensics 2008”, Ancona, Italy, May 27-30, 2008.
4th EMPOP Meeting in conjunction with “Haploid DNA Markers in Forensic Genetics”, Berlin, Germany, April 22-24, 2010.

Literature cited

Parson W 1998 Int J Legal Med 111(3):124
Parson W 2000 Int J Legal Med 114(1-2):23
Bandelt HJ 2004 Rechtsmedizin 14: 251
Parson W 2004 Forensic Sci Int 139(2-3):215
Brandstätter A 2007 Forensic Sci Int 166 (2-3):164
Niederstätter H 2007 Forensic Sci Int.:Genetics 1:29
Parson W 2007a Forensic Sci Int.:Genetics 1(1):13
Parson W 2007b Forensic Sci Int.:Genetics 1:88
Bandelt HJ 2008 Int J Legal Med 122(1):11
Eichmann C 2008 Int J Legal Med 122: 385
Röck A 2009 PLoS ONE, submitted

Collaborators

EMPOP is a collaborative project with many forensic partner laboratories worldwide. The Institute of Mathematics, University of Innsbruck is closely collaborating with EMPOP. The main collaborator is the Armed Forces DNA Identification Laboratory, Armed Forces Institute of Pathology, Rockville, MD 20850, USA; The following scientific groups work together with EMPOP: EDNAP; Ge.F.I.; GEP-ISFG.

Funding

The EMPOP project is funded by FWF Translational Research project L397.