Benchmark

XspecT is a tool designed for fast and accurate species classification of genome assemblies and simulated reads. To evaluate its classification accuracy, we conducted a benchmark using a set of Acinetobacter genomes.

The benchmark was performed by first download all available Acinetobacter genomes from Genbank, filtered on a passed ("OK") taxonomy check status. Genomes assigned to strain IDs were remapped to their respective species IDs, after which genomes with species IDs not contained in XspecT's Acinetobacter model were removed. The remaining genomes were then used to classify both assemblies and simulated reads generated from them. Simulated reads were generated by first filtering on genomes that were not part of the training data and that were categorized as "complete" by NCBI. The reads were then simulated from the longest contig of each genome (assumed to be the chromosome) using a custom Python script. Up to three genomes were selected per species. 100 000 reads were simulated for each genome, with a read length of 100 bp and no simulated sequencing errors. The reads were then classified using XspecT with predictions based on the maximum-scoring species.

Benchmark Results

The benchmark results show that XspecT achieves high classification accuracy, with an overall accuracy of nearly 100% for whole genomes and 82% for simulated reads. However, the low macro-average F1 score (0.41) for the read dataset highlights a substantial class imbalance.

Dataset	Total Samples	Matches	Mismatches	Match Rate	Mismatch Rate	Accuracy	Macro Avg F1	Weighted Avg F1
Assembly	44,905	44,879	26	99.94%	0.06%	≈1.00	0.95	1.00
Reads	9,200,000	7,526,902	1,673,098	81.81%	18.19%	0.82	0.41	0.87

Running the benchmark yourself

To benchmark XspecT performance yourself, you can use the Nextflow workflow provided in the scripts/benchmark directory. This workflow allows you to run XspecT on a set of samples and measure species classification accuracy on both genome assemblies, as well as on simulated reads.

Before you run the benchmark, you first need to download benchmarking data to the data directory, for example from NCBI. To do so, you can use the bash script in scripts/benchmark-data to download the data using the NCBI Datasets CLI, which needs to be installed first. The script will download all available Acinetobacter genomes, as well as taxonomic data.

To run the benchmark, install Nextflow and run the following command:

nextflow run scripts/benchmark

This will execute the benchmark workflow, which will classify the samples, as well as reads generated from them, using XspecT. The results will be saved in the results directory:

results/classifications.tsv for the classifications of the assemblies
results/read_classifications.tsv for the classifications of the simulated reads
results/confusion_matrix.png for the confusion matrix of genome assembly classifications
results/mismatches_confusion_matrix.png for a confusion matrix filtered on mismatches of genome assembly classifications
results/stats.txt for the statistics of the benchmark run