Rapid assignment of bacterial pathogens into predefined populations is an important first step for epidemiological tracking. For clonal species, a single allele can theoretically define a population. For non-clonal species such as Burkholderia pseudomallei, however, shared allelic states between distantly related isolates make it more difficult to identify population defining characteristics. Two distinct B. pseudomallei populations have been previously identified using multilocus sequence typing (MLST). These populations correlate with the major foci of endemicity (Australia and Southeast Asia). Here, we use multiple Bayesian approaches to evaluate the compositional robustness of these populations, and provide assignment results for MLST sequence types (STs). Our goal was to provide a reference for assigning STs to an established population without the need for further computational analyses. We also provide allele frequency results for each population to enable estimation of population assignment even when novel STs are discovered. The ability for humans and potentially contaminated goods to move rapidly across the globe complicates the task of identifying the source of an infection or outbreak. Population genetic dynamics of B. pseudomallei are particularly complicated relative to other bacterial pathogens, but the work here provides the ability for broad scale population assignment. As there is currently no independent empirical measure of successful population assignment, we provide comprehensive analytical details of our comparisons to enable the reader to evaluate the robustness of population designations and assignments as they pertain to individual research questions. Finer scale subdivision and verification of current population compositions will likely be possible with genotyping data that more comprehensively samples the genome. The approach used here may be valuable for other non-clonal pathogens that lack simple group-defining genetic characteristics and provides a rapid reference for epidemiologists wishing to track the origin of infection without the need to compile population data and learn population assignment algorithms.
Burkholderia pseudomallei is a soil-dwelling bacterium that can infect a large range of hosts. In humans, B. pseudomallei causes melioidosis, and typical routes of entry include open wounds, inhalation, or ingestion. Clinical features are diverse, although pneumonia and abscess formation are common. High rates of recombination within the genome of this bacterium have confounded attempts to match clinical samples to geographically defined populations. Here we provide a reference that simplifies source attribution issues. We applied population assignment software to previously generated sequence data from seven B. pseudomallei genes to define the major geographic populations within this species. We evaluated the robustness of our results by comparison with two additional population assignment programs. We present the likelihood that each variant is assigned to a particular geographic population. This information can be used to assign novel B. pseudomallei isolates to a geographic population without needing to learn and run cumbersome population assignment applications. This method can also be used for other bacteria that are difficult to source-attribute due to high levels of genomic variation and recombination.
Citation: Dale J, Price EP, Hornstra H, Busch JD, Mayo M, et al. (2011) Epidemiological Tracking and Population Assignment of the Non-Clonal Bacterium, Burkholderia pseudomallei. PLoS Negl Trop Dis 5(12): e1381. doi:10.1371/journal.pntd.0001381
Editor: Gary Simon, George Washington University, United States of America
Received: May 30, 2011; Accepted: September 16, 2011; Published: December 13, 2011
Copyright: © 2011 Dale et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by the U.S. Department of Homeland Security S&T CB Division Bioforensics R&D Program. VW, SJP, BGS and DG are funded by The Wellcome Trust. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Burkholderia pseudomallei, the etiologic agent of melioidosis, is commonly isolated from soil and water in many tropical regions of the world. Endemic foci of B. pseudomallei predominantly include Southeast Asia (particularly Thailand) and northern Australia, although this organism is found sporadically in other equatorial regions such as South and Central America, Africa, and the Indian subcontinent . Since infections are most commonly acquired from the environment, genetic differentiation is expected to occur, leading to geographic substructure within the bacterial population. Previous studies have demonstrated that B. pseudomallei populations from the melioidosis-endemic regions in Southeast Asia and Australia are not only geographically distinct but exhibit differences in clinical presentation and genetic features , , . For example, differences in clinical manifestations include parotid abscesses, which are much more prevalent in Thailand (15%) than Australia (4%). In contrast, genitourinary infections and brainstem encephalitis are more commonly seen in Australia than Thailand (15% vs. 2% and 2% vs. <0.2%, respectively) , . Differences in mortality rates also differ greatly between the two endemic regions, with mortality rates of approximately 50% in Thailand compared with <20% in Australia . The difference in mortality rates could reflect differences in virulence but is probably more likely to be due to differences in intensive care provisions between the two regions . Despite these marked differences, none are fully diagnostic for isolates from specific geographical regions.
Multilocus sequence typing (MLST)  is a bacterial genotyping method that involves the comparison of ~450 bp-long nucleotide sequences from seven housekeeping genes. An MLST scheme has been developed for B. pseudomallei  and 699 sequence types (STs) from isolates and multiple species (as of November 6th, 2010) populate the public database (http://bpseudomallei.mlst.net/). These data have shed light on the population structure of this species. It has been previously observed that B. pseudomallei STs from Australia and Southeast Asia are mutually exclusive as phylogenetic analyses show geographically correlated clusters of STs, although these analyses failed to group all samples from either region together  . Due to relatively low levels of sequence diversity and high levels of lateral gene transfer among B. pseudomallei isolates , , sequence data from only seven genes are insufficient for robust phylogenetic discrimination , . Pearson et al. therefore used a population genetics approach to determine that B. pseudomallei STs form two distinct populations, conforming to the geographic regions of Southeast Asia and Australia . Despite the phylogenetic limitations of MLST data, this large public database shows potential for population assignment using population genetic analyses.
We further evaluate and update the previous population assignments  by comparing these results with commonly used assignment algorithms. The program Structure  is a Bayesian-based clustering algorithm that has been used to infer population structure within genetically diverse bacteria such as Helicobacter pylori . Comparison of Structure with other population assignment software allowed us to assess the robustness of our population assignments. The B. pseudomallei population assignment results that we provide, along with a probability estimation of each assignment, can be used as a practical and immediate reference for melioidosis researchers interested in identifying geographic origins of B. pseudomallei STs and may serve as a model for other weakly clonal species.
The data used to define populations and evaluate the robustness of population assignments were downloaded from the B. pseudomallei MLST database (http://bpseudomallei.mlst.net/) on January 15th, 2009. The database consisted of 641 B. pseudomallei STs from 1802 isolates collected over 89 years from 35 countries. Approximately 44% of these isolates were collected in Southeast Asia and 53% in Australia and Papua New Guinea. The data were downloaded again on November 9th, 2010, in order to provide more updated population assignments and population allele frequencies for all currently known STs. These most recent data consist of 664 STs from 1829 isolates, where 44% of these isolates were collected in Southeast Asia and 53% of the isolates were collected in Australia and Papua New Guinea. More detailed information on the geographical sources of isolates representing each ST can be found in the profiles datasheet in the MLST database.
Population analyses using Structure
The program Structure  (versions 2.2–2.3.1 due to software updates over the course of this study) was used to analyze allelic profile data from the original 641 B. pseudomallei STs. Briefly, Structure uses MLST datasets and a Bayesian approach to identify population structure and to assign individuals to populations without a priori population descriptions. A Markov Chain Monte Carlo simulation of 100,000 iterations with a burn-in period of 30,000 was run to determine the posterior probability of the number of populations (K). Where K = 2–4, Structure analyses were repeated eight times and the posterior probabilities from each run were averaged. For populations of K = 5–17, Structure analyses were repeated three times and the posterior probabilities averaged. Fewer repetitions were carried out for these higher K values as previous work suggests that more populations are not well supported . The most statistically supported K value was selected to represent the number of populations among the STs based on the estimated log (ln) of the probability of the data (ln P(D)), and the variance exhibited by each K value. All simulations were carried out using both the “no admixture”  and “admixture” models  (comparison between these two models is shown in Supplemental Data Figure S1). The posterior probability of the data (ln P(D)) for a given value of K might be expected to peak at the true value of K, however, in our runs there was no definite peak as ln P(D) increased slightly with an increase in K. This pattern, along with an increase in the variance of ln P(D) is common and has been reported by Evanno and colleagues  who suggest that measuring the changes in likelihood is a more accurate method for estimating the true value of K. We therefore used ΔK to determine the optimal K value of the B. pseudomallei populations. The ΔK value corresponds with the second order rate of change of all K values divided by the standard deviations from each K . Calculation of ΔK is shown in Supplemental Data Text S1.
Population analyses using BAPS
We used both BAPS and Structure results to assess population assignments . BAPS (version 4) is another free software package for Bayesian inference of genetic structure within a given dataset , , , . Using the “clustering of linked loci” module, BAPS determines the log likelihood in 10% increments of different population divisions and subsequently calculates the most likely K value. Thus, unlike with Structure, K is not selected a priori. The likelihood of population assignment for each ST is also calculated by BAPS. For BAPS analyses, we used sequence data from the seven B. pseudomallei MLST loci. The codon linkage model and an upward bound of 20 populations were chosen for the “clustering of linked loci” module. As with Structure, eight iterations were run where K = 2–4 and three iterations were run where K = 5–17.
Assessment of Structure and BAPS population assignments
As there is no empirical measure of determining the accuracy of population assignments, we further assessed Structure and BAPS assignments of B. pseudomallei using MLST data, by comparing individual ST assignments made by Structure and BAPS to the geographic information listed in the MLST database and to the likelihood of assignment into each population as calculated by Genetic Analysis in Excel (GenAlEx) v.6 . We also used GenAlEx to measure the degree of population differentiation among populations defined by Structure and BAPS.
GenAlEx is a free Microsoft Excel add-in where datasets can be analyzed and manipulated without the requirement for multiple programs. We used the population assignment method in GenAlEx to determine the likelihood of inclusion in each population for each ST. Unlike Structure and BAPS, GenAlEx requires a priori population designations to define population allele frequencies and subsequently calculate the likelihood of population assignment for each ST. We compared the population assignment results from our Structure and BAPS results to the likelihood of population assignment calculated by GenAlEx. Also, for population defined by Structure and BAPS, we performed analyses of molecular variance (AMOVA) to calculate the degree and statistical significance of population differentiation.
Characterization of B. pseudomallei populations
The number of populations supported by Structure and BAPS are two and three respectively. We therefore used the results from the Structure run with the highest likelihood score at K = 2 and the BAPs run with the highest likelihood score at K = 3 to infer population assignments for each ST. To show the extent of genetic differentiation among these populations, we used GenAlEx  to calculate ΦPT, using 999 permutations . In assessing assignment results, we categorized STs according to the likelihood of assignment of each ST into a population by Structure or BAPS, allowing us to evaluate the effect of assignment confidence on discrepancies among programs. To be conservative in our assignment of STs to a population, we suggest that a ST only be considered to be from a given population if Structure or BAPS assigned it to that population ≥95% of the time. As BAPS measures likelihoods in 10% intervals, this threshold is effectively 100% for BAPS. STs assigned to either population <95% of the time were considered “undefined” even though studies using simulated datasets suggest that in some situations, assignment probabilities of >50% may be accurate .
Construction of allele frequency charts
We wished to provide researchers interested in B. pseudomallei population genetics with a tool for population assignment in instances where novel STs not included in this study are encountered. To achieve this goal, the frequencies of alleles belonging to STs from each population for >95% of the runs were determined. We also enumerated alleles for STs assigned to a population between 50 and 95% of the time as this measure can be useful for indicating the reliability of an allele for population assignment.
Definition of B. pseudomallei populations with single-nucleotide polymorphisms
Performing MLST on large bacterial collections is a time-consuming task; however, single nucleotide polymorphism (SNP) genotyping provides a streamlined way to characterize MLST populations even for recombining species , , . We predicted that SNPs within MLST loci could be used to distinguish between the major B. pseudomallei ST populations. The program ‘Minimum SNPs’ , with incorporated Not-N algorithm , was used to search for a set of highly informative characters among the MLST alignments that could be used to distinguish between a predefined ‘ingroup’ and the remaining ‘outgroup’ population. The 566 B. pseudomallei STs determined by Structure to be assigned to one of the two populations in ≥95% of iterations were tested using the Not-N algorithm, where each population was alternately considered the ‘ingroup’ and all other STs the ‘outgroup’. Similarly, the 607 B. pseudomallei STs identified by BAPS as belonging to any of the three populations in ≥90% of iterations were tested (BAPS measures likelihood in 10% increments). In an attempt to increase the likelihood of finding a small set of population-defining SNPs, a second ‘Minimum SNPs’ analysis including only the 413 STs assigned to a population in 100% of Structure runs and a third analysis with the 560 STs assigned to a population in 100% of BAPS runs were carried out.
Results and Discussion
Population assignment of B. pseudomallei STs using Structure
Structure was used to identify and characterize B. pseudomallei populations using MLST allelic profile data from 641 STs. The existence of two B. pseudomallei populations (K = 2) was first proposed by Pearson and coworkers  as higher values of K did not break apart the two main populations and subdivisions were inconsistent between runs. Here, we confirm that when using Structure, two populations (K = 2) garners the most statistical support when compared to other numbers of putative populations (K = 1, and 3 through 17). This support is based on three criteria that have been used in other studies to justify selected K values. First, higher values of K retained the two populations (Figure 1) . Second, the selected K value has the lowest variance of ln P(D) after K = 1 (Figure S2) . Lastly, the ΔK shows a peak at the selected K value (Figure S2) . We also tested both ‘admixture’ and ‘no admixture’ analyses and obtained the same results regarding the size of K and similar results regarding population assignments for individual STs. However, the ‘no admixture’ method provided more consistent results than the ‘admixture’ approach, yielding lower variances. The results presented here are from the “no admixture model” (see Figure S1 for a comparison of these tests).
Figure 1. Estimated population assignments of B. pseudomallei genotypes based on multilocus sequence typing data and Structure.
Each thin vertical line represents one sequence type (ST) and is divided into K portions (based on color) that represent the likelihood of assignment into K populations. STs are sorted by probability of assignment into Population 1 (predominantly Australian STs) when K = 2. Two black vertical lines show thresholds of 95% probability of assignment. We considered STs with assignment probabilities below these thresholds to be “undefined”. The pie charts indicate the geographical sources of STs that comprise each group. Rest-of-the-world (ROW, shown in light grey) is composed of STs that were isolated from regions other than Australia (illustrated as the white slice) or Southeast Asia (shown as the dark grey slice) according to the public MLST database (http://bpseudomallei.mlst.net/).doi:10.1371/journal.pntd.0001381.g001
Using a K = 2 with Structure, the two populations were significantly distinct (ΦPT = 0.123; P = 0.001). Structure assigned 88.3% of STs to either Population 1 or Population 2 with ≥95% probability of assignment, with 44% and 44.3% of STs assigned to Population 1 and 2, respectively (Figure 1). Population 1 is comprised of 95% Australian (Australia and Papua New Guinea), 3% Southeast Asian, and 2% STs from the other parts of the world. In contrast, 89% of STs in Population 2 are from Southeast Asia, 1% from Australia, and 10% from the rest of the world (Figure 1). Only 11.7% of STs were not assigned to a given population based on a 95% probability of assignment threshold. This “undefined” group is comprised of STs from Southeast Asia (59%), Australia (25%), and the rest of the world (16%).
Population assignment of B. pseudomallei STs using BAPS
We also used the population-clustering program BAPS for determining the number of B. pseudomallei populations and for assigning STs to each population. Unlike Structure we used concatenated MLST sequence data rather than the allelic data used in Structure. In BAPS, the estimated number of populations with the most statistical support was K = 3 rather than K = 2 determined by Structure. This third population defined by BAPS appears to be a sub-population of the previously identified Population 2; however, other than a mostly Asian origin, we found no geographic or epidemiological correlation among these subdivided Population 2 STs. We therefore refer to these two BAPS Asian populations as Population 2a and Population 2b. Evidence of this population subdivision was also observed in Structure when K = 3 (Figure 1); however in Structure, both Population 1 and Population 2 were alternately subdivided depending on the run and assignments of STs to either sub-population were inconsistent. In BAPS however, Population 2 is consistently subdivided and ST assignments are consistent among runs. Therefore, it is possible that further sub-structure exists in the B. pseudomallei populations, but remain unresolved due to the limitation of having only seven MLST loci, which may not provide the genetic resolution to detect further subdivision.
Comparing Structure and BAPS population assignments
We compared the population assignments made by the run with the highest likelihood from Structure (K = 2) and BAPS (K = 3) (Figure 2). As BAPS Populations 2a and 2b are essentially subpopulations of Structure Population 2, we searched for discrepant STs assigned to Population 1 with >50% likelihood by one program and Population 2 with >50% likelihood by the other. Of the 29 discrepancies (Figure 2B), 16 were assigned by either program with a confidence level ≥95% (one ST was assigned by both programs with a confidence level ≥95%). As a further measure of assignment accuracy, we compared these 16 discrepant STs to the geographical data listed in the MLST database. Eight of the nine discrepancies assigned to a population ≥95% using Structure matched the geographical data listed in the MLST database. For the discrepancies assigned to a population ≥95% with BAPS, 3/8 originated from the geographical region of the population assigned by BAPS. Even though the listed geographic source of a ST is not a perfect indicator of population, it is possible that both programs make assignment errors even when confidence values are >95%, however such errors are probably rare. The geographic sources of STs that comprise each BAPS population are shown in Figure 2C.
Figure 2. Estimated population assignments and comparisons using BAPS and Structure simulations.
(A) Likelihood of ST assignment into two populations by Structure (top) and three populations by BAPS (bottom). The order of STs in both plots are the same and sorted by probability of assignment into Population 1 by Structure. Each thin vertical line represents one ST and is divided into two and three portions (for Structure and BAPS respectively) that represent the likelihood of assignment into each population. (B) A comparison of Structure and BAPS results. STs placed by both programs into Population 1 are shown in red and Population 2 (Populations 2a and 2b given by BAPS) are represented in green. The discrepant assignments by the two programs are shown as circles where a red interior denotes assignment into Population 1 by Structure and a green interior denotes assignment into Population 2 by Structure. (C) A breakdown of BAPS Populations 1, 2a, and 2b according to BAPS results and source data on the MLST database. The white region denotes Australian STs, the light grey region represents the ROW STs, and the dark grey color represents the Southeast Asian STs.doi:10.1371/journal.pntd.0001381.g002
Comparison of Structure and BAPS population assignments with GenAlEx
To further evaluate Structure and BAPS assignments, we used GenAlEx to calculate the likelihood of assignment of each ST in each population. When STs with high probabilities of assignment using either Structure or BAPS were analyzed with GenAlEx, a more distinct differentiation of populations could be seen (Figures 3 & 4) and the likelihood calculations from GenAlEx placed only a few STs in a different population than Structure or BAPS. As expected, differentiation among populations eroded (reflected in a decline of ΦPT values) and the number of discrepancies between either Structure or BAPS and GenAlEx increased as STs with lower assignment probabilities from Structure or BAPS were analyzed with GenAlEx (Figures 3 & 4).
Figure 3. Population assignments of STs using GenAlEx and assignment discrepancies with Structure.
Charts (A–C) represent the log likelihood of assignment of each ST by GenAlEx. A priori population designations were made with Structure and those STs assigned to a population in 100% of iterations (A), ≥95% of iterations (B), and ≥50% of iterations (C). STs with a priori designation as Population 1 are shown in red while those designated as from Population 2 are shown in green. STs with a log likelihood of assignment as calculated by GenAlEx that was in disagreement with Structure assignments are outlined in black. See text for a discussion on ST339 and ST660 indicated in A and C. (D) The relationship between % confidence and discrepancies between Structure and GenAlEx, between Structure results and published origin in the MLST database, and with the estimate of the population genetic differentiation (ΦPT).doi:10.1371/journal.pntd.0001381.g003
Figure 4. Population assignments of STs using GenAlEx and assignment discrepancies with BAPS.
STs from the three (K = 3) populations identified by BAPs were assigned to three populations in GenAlEx. Charts (A–C) represent the log likelihood of assignment of each ST by GenAlEx. A priori population designations were made with BAPS and those STs assigned to a population in 100% of iterations (A), ≥95% of iterations (B), and ≥50% of iterations (C). STs with a priori designation as Population 1 are shown in red, Population 2a STs are shown in green, and Population 2b are shown in yellow. STs with a log likelihood of assignment as calculated by GenAlEx that are in disagreement with BAPS assignments are outlined in black. Some discrepancies may not be visible due to the three-dimensional structure of the figure. (D) The relationship between % confidence and discrepancies between BAPS and GenAlEx, discrepancies between BAPS results and published origin in the MLST database, and with the estimate of the genetic differentiation between populations (ΦPT).doi:10.1371/journal.pntd.0001381.g004
When only STs with 100% probability of assignment in Structure were analyzed with GenAlEx, there was only one discrepancy (ST339). We confirmed that ST 399 is an environmental isolate from the Darwin region of the Northern Territory, Australia. Structure assigned this ST to Population 1, as expected, but was given a higher likelihood of belonging to Population 2 by GenAlEx (Figure 3A & 3D). When STs with ≥95% probabilities of assignment with Structure were analyzed with GenAlEx, there were eight discrepancies. These discrepant STs clustered with STs from Population 1, despite log likelihood values from GenAlEx that suggested they belonged in Population 2, albeit with little difference in log likelihood values (Figure 3B). The geographic sources of these eight discrepancies suggest that only one ST may have been erroneously assigned by Structure; specifically, ST660 is from rain water in Hong Kong and would be expected to be in Population 2, whereas the other seven were from Northern Australia which is consistent with their position within Population 1. As STs with decreasing probabilities of assignment with Structure were analyzed with GenAlEx, the number of discrepancies increased slightly, except for a large increase when all STs were analyzed (Figure 3).
Interestingly, more discrepancies occurred with Population 1 than Population 2. It has been previously observed that the Southeast Asian B. pseudomallei population (i.e. Population 2) has high levels of recombination but low allelic diversity, due to a monophyletic introduction of B. pseudomallei into Southeast Asia. In contrast, the Australian population appears to be paraphyletic with greater allelic diversity in spite of lower recombination between STs . Therefore, the greater diversity of Australian alleles may make Bayesian assignment of STs into Population 1 more complex than Population 2. Our cut-off value of ≥95% is likely to result in very few erroneous assignments using Structure. Indeed, ST660 is the only potentially inaccurate assignment that we identified at this cut-off value.
When GenAlEx was compared against the BAPS K = 3 dataset, there were 14 discrepancies when only STs with 100% probability of assignment were analyzed with GenAlEx (Figure 4A & 4D). For only one of these discrepancies (ST514), the assignment by BAPS into Population 2a is not consistent with the geographic origin listed in the MLST database (Australia), representing a potentially erroneous assignment by BAPS. Four STs were assigned to Population 2a by one program and Population 2b by the other. As geographic correlates for these two populations are unknown, it is impossible to determine which assignment is more likely. For the remaining nine discrepancies between BAPS and GenAlEx, the geographic origin listed in the MLST database is consistent with the BAPs population assignment. When STs at the ≥90% assignment probability with BAPS were analyzed with GenAlEx, there were 19 discrepancies. Two of these discrepancies (ST 514 and ST 660) are likely erroneous assignments by BAPS into Populations 2a and 1 respectively as their geographic origins as listed in the MLST database are Australia and Hong Kong, respectively. The number of discrepancies continues to rise as more STs are analyzed and the threshold for inclusion drops to ≥50% assignment probability with BAPS. At all levels of assignment probability by BAPS, most discrepancies involved assignments by BAPS into Population 1 while few discrepancies occurred with STs assigned by BAPS into Population 2b. This is similar to the pattern of discrepancies found with Structure assignments. This observation suggests that assignments into Population 1 are the most challenging while assignments into Population 2b are least difficult and probably more robust. In comparison to the Structure-GenAlEx comparisons, there were more overall discrepancies for GenAlEx and BAPS; however, this was expected as BAPS is splitting STs into three populations rather than just two.
In addition to evolutionary dynamics and computer algorithms, discrepant population assignment of certain STs can occasionally be attributed to database errors. Indeed, it has been shown that the listed origins for some B. pseudomallei STs are not always accurate due to curation difficulties or by not being able to account for patient travel histories. For example, several isolates recovered in the USA were likely from infections acquired during travel in Southeast Asia . Using our population assignment data, we have identified and corrected some database errors, however, it is possible that other errors remain. There are discrepancies between Structure and BAPS assignments and the listed origin of a ST in the MLST database. We therefore paid particular attention to those STs where both GenAlEx and the MLST database suggested a different population assignment than Structure (Figure 3D) or BAPS (Figure 4D). At the 95% likelihood level for Structure, only one such discrepancy (ST660) exists. Although erroneous attribution must always be considered, it is possible that this ST is derived from a recent, but ecologically established introduction into Hong Kong. Another possibility is that this ST was erroneously assigned by Structure to Population 1. However, BAPS similarly assigned ST660 to Population 1 albeit with 82% likelihood. At or above the 95% likelihood level, we could therefore find only one potential example of an inappropriate assignment by Structure.
At ≥90% likelihood level for BAPS, we found one potential discrepancy when compared to Structure, the MLST database and GenAlEx. Sequence type 514 was assigned by Structure at 100% confidence in Population 1. However, BAPS assigned ST514 with 100%confidence into Population 2a. The MLST database lists ST514 as being collected from a human source in Australia. Unfortunately, this information does not confirm the origin since travel between Thailand and Australia is prevalent. Whole genome sequencing of this ST will help resolve uncertainties regarding Australian and Southeast Asian population assignments as phylogenetic analyses can be expected to reflect population subdivisions as they have for the Australian and Southeast Asian populations .
Of the discrepancies between Structure and BAPS versus GenAlEx, the Structure results were most closely aligned with the geographical origin of STs as listed in the MLST database. However, both BAPS and GenAlEx were able to identify instances where Structure population assignments were inconsistent with the epidemiological data, indicating that no single program was 100% effective in B. pseudomallei ST population assignment. Therefore, we suggest, where possible, that Structure and BAPS are used in concert with large epidemiological datasets for highly recombinant organisms to make the most robust population assignments. The addition of more loci and more thoroughly sampling isolates not assigned to either population with high confidence will likely lead to a better understanding of the intricacies of B. pseudomallei population structure.
The search for population-defining SNPs
Given the genetic delineation of up to three populations using population assignment software, we hypothesized that a combination of SNPs might be identified that readily differentiate between these B. pseudomallei populations. We used the program ‘Minimum SNPs’, with incorporated Not-N algorithm , to find population-specific SNPs from both Structure and BAPS defined populations. Using STs with ≥95% population assignment from Structure, we identified a set of 25 SNPs that were needed to discriminate STs from Population 2 from all other STs, albeit with a confidence of only 92.5%. In other words, even with a set of 25 SNPs, only 92.5% of the Population 2 STs could be distinguished from the Population 1 STs. No addition SNPs could be added by the algorithm to increase the percentage of Population 2 STs that could be distinguished from Population 1 STs. In order to increase the likelihood of identifying a smaller number of SNPs for population differentiation, we narrowed down our population definition by including only STs assigned to each population in 100% of Structure runs. Our results showed that a set of 16 SNPs were needed to separate the Population 2 STs from the Population 1 STs at a confidence level of 97.6%. As inaccurately assigned STs can hamper the ability of ‘Minimum SNPs’ to find population specific SNPs, we also used the BAPS population designations at both the ≥90% and 100% thresholds for population assignment. For STs assigned to each population in 90% of BAPS runs, the Not-N algorithm identified a set of 26 SNPs that discriminated Populations 2a and 2b apart from Population 1 with a confidence of 81.1%. For STs assigned to a given population in 100% of BAPS runs, a set of 26 SNPs discriminated Populations 2a and 2b apart from Population 1 with 84.3% confidence. A set of 21 SNPs discriminated Population 2a apart from Populations 1 and 2b with 95.5% confidence while a set of 13 SNPs discriminated population 2b from the others with 99.2% confidence. Finally, by analyzing only the Population 2 STs identified at the 100% threshold with BAPS, we found a single SNP (at position 192 in the narK locus) that distinguishes all STs in Population 2b (C nucleotide) from all STs in Population 2a (G or T nucleotide). These results suggest that complete population identification of all members of all populations by a combination of SNPs from MLST data is not possible.
Development of reference tools for population assignment
A more recent version of the MLST database was downloaded and used to repeat our Structure and BAPS analyses. Once the analyses on the updated database were complete (November 6th, 2010) these data were compared to the database originally downloaded for this study (January 15th, 2009). This comparison verified the consistency of Structure and BAPS results between the temporal datasets. Of note, however, is the identification by BAPS of a fourth population consisting of three STs, two of which were included in the original database and were formerly placed in Population 1. The third ST in this new population (ST698) is novel and is a human isolate from the USA. Because this population appears to be part of the Australian population, we refer to it as Population 1b and the other Australian population as Population 1a.
Population assignments and likelihood values for each ST based on the updated MLST database are shown in Table 1. This table provides a resource that can be used by researchers interested in determining the geographic source population of B. pseudomallei STs. Comparisons with other population assignment methods and with geographic source information listed on the MLST database suggest that the risk of assignment by Structure and BAPS into the incorrect population is low when a high percentage of iterations result in the same assignment. In addition, there appear to be fewer potential errors with STs assigned to Population 2 by Structure and 2a or 2b by BAPS. We therefore suggest that a cut off value of ≥95% (≥90% for BAPS) assignment probability can serve as a conservative threshold above which assignment errors are not likely and which include a large proportion (~90%) of the entire ST populations. The threshold used by different investigators does not need to be universal, and our recommendation of ≥95% is solely intended as a conservative guide. Indeed, for STs assigned to Population 2 (or 2a/2b), which is a monophyletic population, it is likely that a lower threshold of even ≥60% assignment probability is not likely to result in erroneous assignments.
Table 1. Population assignment for each ST and likelihood of assignment.doi:10.1371/journal.pntd.0001381.t001
While we present here a list of STs and the likelihood of assignment into each population, we recognize that new STs will be found with future sampling, limiting the long-term utility of our analyses. However, due to the relatively low diversity and high recombination rates relative to mutation in B. pseudomallei , it is likely that many new STs will not contain novel alleles, but rather will comprise new combinations of characterized alleles. As population assignments with Structure are based on allele frequencies in a population, we include this information here with the expectation that this resource will continue to be useful even as novel STs are discovered (Figure S3 and Table S1). We suggest that alleles that are predominantly associated with population 1 or population 2 can be used to estimate population assignment for novel STs. Of 50 randomly selected STs, all but three could be assigned based on the presence of alleles predominantly associated with one population (≥95% of their occurrence is attributed to one population). These three STs do not have a high affinity to either populations as all were originally assigned with <95% confidence by Structure and BAPS. Of the 664 B. pseudomallei STs, 80% have alleles that are exclusively found in one of the two main populations and 93% have alleles that are associated with one of these populations in ≥95% of their occurrences. Thus for new STs, allele frequency data can shed light on appropriate population assignments.
As lateral gene transfer is increasingly found to play an important role in the population dynamics of a range of bacterial species, population genetics tools such as Structure and BAPS will become more widely used by epidemiologists. The approach described here facilitates rapid assignment of isolates to established populations without needing to compile data, or learn and run a new application. Population assignment is one of the first steps in epidemiological tracking of disease and can be used to identify and track bacterial introductions into new regions. We have expanded on our previous work  by rigorously exploring the composition of the two major populations of B. pseudomallei. Our results suggest that the programs Structure and BAPS are both sensitive and accurate for population assignment of B. pseudomallei using MLST data, as the two programs provide similar results. The relative rate of recombination to mutation at MLST loci for B. pseudomallei is higher than for any other bacterial species yet reported , meaning that allele frequency differences among populations is an appropriate method for determining population structure. Examining allele frequencies when deciphering population structure is standard for eukaryotes, where high recombination rates cause allelic frequency differences among populations through genetic drift .
Population assignment is an important aspect of epidemiological and forensic attribution. As knowledge of population dynamics and geographical distribution of a species increases, attribution can be attempted at an increasingly fine scale, allowing investigators to focus their attention on a very small and well-defined population and geographic region. For B. pseudomallei, little is currently known about population dynamics, evolution and even geographical distribution. High relative rates of recombination to mutation complicate attempts to discern population structure for this species using strictly phylogenetic approaches. MLST analyses are popular for bacterial pathogens and the large data set collected for B. pseudomallei has allowed for the robust identification of two main populations that correspond to the endemic geographical regions of Southeast Asia and Australia. While substructure within these two populations likely exists, such as the third population identified by BAPS, the seven MLST genes and the current set of STs do not provide enough resolution for further robust differentiation among subpopulations. Genotype interrogation at more loci or great numbers of STs will increase our knowledge of subpopulation dynamics, but in the meantime our current ability to differentiate between the two or three major populations is an important first step for epidemiological attribution. Increasing knowledge of the geographic distribution and population structure of B. pseudomallei STs form the foundation for future work on the evolution, population dynamics and geographical distribution of subpopulations of this bacterium.
Comparison of K values from Structure using both ‘admixture’ and ‘no admixture’ models. (A) Log likelihood and average within run variance associated with different values of K, (B) Log likelihood of different values of K, (C) ΔK for different values of K. Importantly, in this figure it should be noted that the most likely value is K = 2.
Comparison of K values from Structure runs using the ‘no admixture’ model. (A) Log likelihood and average variance associated with different values of K from Structure along with results from calculating ΔK from these values and (B) plot showing the change in ΔK for each population of K. Analyses using both models show that K = 2 (i.e. two populations) is the most supported K value in the B. pseudomallei MLST dataset using Structure.
Allele frequencies across 664 STs in each B. pseudomallei population. The frequencies of alleles from STs assigned to each population based on Structure and BAPS are shown as a stacked bar graph. For BAPS data, Population 1a and Population 1b were combined as Population 1b only consisted of three STs. The red bar represents alleles placed in Population 1 (predominantly Australian STs) with ≥95% probability of assignment, the green bar represents alleles placed in Population 2 (Structure) and 2a (BAPS) (predominantly Southeast Asian STs) with ≥95% probability of assignment, the yellow bar represents alleles placed in Population 2b with probability of assignment by BAPS, and the error bars represents the number of alleles placed in a population with a probability of assignment <95%.
Allele frequencies across 664 STs in each B. pseudomallei population.
Calculating Δ K (adapted from Evanno et al.  ).
We thank the melioidosis research community for collecting and sequencing the many Burkholderia isolates that populate the public MLST database.
Analyzed the data: JD EPP HH JDB MM DG JTF DMW AT BGS BJC TP. Contributed reagents/materials/analysis tools: MM DG VW AB JW BGS SJP BJC. Wrote the paper: JD EPP TP. Interpreted results: JD EPP JDB DG JTF DMW AT BGS BJC PK TP.
- 1. Currie BJ, Dance DAB, Cheng AC (2008) The global distribution of Burkholderia pseudomallei and melioidosis: an update. Trans R Soc Trop Med Hyg 102: S1–S4.
- 2. Aldhous P (2005) Tropical medicine: melioidosis? Never heard of it. Nature 434: 692–693.
- 3. Tuanyok A, Auerbach RK, Brettin TS, Bruce DC, Munk AC, et al. (2007) A horizontal gene transfer event defines two distinct groups within Burkholderia pseudomallei that have dissimilar geographic distributions. J Bacteriol 189: 9044–9049.
- 4. White NJ (2003) Melioidosis. Lancet 361: 1715–1722.
- 5. Cheng AC, Currie BJ (2005) Melioidosis: epidemiology, pathophysiology, and management. Clin Microbiol Rev 18: 383–416.
- 6. Dance DA (2002) Melioidosis. Curr Opin Infect Dis 15: 127–132.
- 7. Maiden MC (2006) Multilocus sequence typing of bacteria. Annu Rev Microbiol 60: 561–588.
- 8. Godoy D, Randle G, Simpson AJ, Aanensen DM, Pitt TL, et al. (2003) Multilocus sequence typing and evolutionary relationships among the causative agents of melioidosis and glanders, Burkholderia pseudomallei and Burkholderia mallei. J Clin Microbiol 41: 2068–2079.
- 9. Vesaratchavest M, Tumapa S, Day NP, Wuthiekanun V, Chierakul W, et al. (2006) Nonrandom distribution of Burkholderia pseudomallei clones in relation to geographical location and virulence. J Clin Microbiol 44: 2553–2557.
- 10. Currie BJ, Thomas AD, Godoy D, Dance DA, Cheng AC, et al. (2007) Australian and Thai isolates of Burkholderia pseudomallei are distinct by multilocus sequence typing: revision of a case of mistaken identity. J Clin Microbiol 45: 3828–3829.
- 11. Pearson T, Giffard P, Beckstrom-Sternberg S, Auerbach R, Hornstra H, et al. (2009) Phylogeographic reconstruction of a bacterial species with high levels of lateral gene transfer. BMC Biol 7: 78.
- 12. Turner SL, Young JP (2000) The glutamine synthetases of rhizobia: phylogenetics and evolutionary implications. Mol Biol Evol 17: 309–319.
- 13. Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155: 945–959.
- 14. Falush D, Wirth T, Linz B, Pritchard JK, Stephens M, et al. (2003) Traces of human migrations in Helicobacter pylori populations. Science 299: 1582–1585.
- 15. Falush D, Stephens M, Pritchard JK (2003) Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164: 1567–1587.
- 16. Duan J, Mu J, Thera MA, Joy D, Kosakovsky Pond SL, et al. (2008) Population structure of the genes encoding the polymorphic Plasmodium falciparum apical membrane antigen 1: implications for vaccine design. Proc Natl Acad Sci U S A 105: 7857–7862.
- 17. Evanno G, Regnaut S, Goudet J (2005) Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol 14: 2611–2620.
- 18. Latch E, Dharmarajan G, Glaubitz J, Rhodes OJ (2006) Relative performance of Bayesian clustering software for inferring population substructure and individual assignment at low levels of population differentiation. Conservation Genetics 7: 295–302.
- 19. Corander J, Gyllenberg M, Koski T (2007) Random partition models and exchangeability for Bayesian identification of population structure. Bull Math Biol 69: 797–815.
- 20. Corander J, Marttinen P (2006) Bayesian identification of admixture events using multilocus molecular markers. Mol Ecol 15: 2833–2843.
- 21. Corander J, Marttinen P, Siren J, Tang J (2008) Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations. BMC Bioinformatics 9: 539.
- 22. Corrander J, Waldmann P, Marttinen P, Sillanpaa MJ (2004) BAPS 2: enhanced possibilities for the analysis of genetic population structure. Bioinformatics 20:
- 23. Peakall RaSPE (2006) GENALEX 6: genetic analysis in Excel. Population genetic software for teaching and research. Molecular Ecology Notes 288–295.
- 24. Smouse PE, Peakall R (1999) Spatial autocorrelation analysis of individual multiallele and multilocus genetic structure. Heredity 82(Pt 5): 561–573.
- 25. Price EP, Thiruvenkataswamy V, Mickan L, Unicomb L, Rios RE, et al. (2006) Genotyping of Campylobacter jejuni using seven single-nucleotide polymorphisms in combination with flaA short variable region sequencing. J Med Microbiol 55: 1061–1070.
- 26. Robertson GA, Thiruvenkataswamy V, Shilling H, Price EP, Huygens F, et al. (2004) Identification and interrogation of highly informative single nucleotide polymorphism sets defined by bacterial multilocus sequence typing databases. J Med Microbiol 53: 35–45.
- 27. Stephens AJ, Huygens F, Inman-Bamber J, Price EP, Nimmo GR, et al. (2006) Methicillin-resistant Staphylococcus aureus genotyping using a small set of polymorphisms. J Med Microbiol 55: 43–51.
- 28. Price EP, Inman-Bamber J, Thiruvenkataswamy V, Huygens F, Giffard PM (2007) Computer-aided identification of polymorphism sets diagnostic for groups of bacterial and viral genetic variants. BMC Bioinformatics 8: 278.
- 29. Hanage WP, Fraser C, Spratt BG (2006) The impact of homologous recombination on the generation of diversity in bacteria. J Theor Biol 239: 210–219.