Using cluster edge counting to aggregate iterations of centroid-linkage clustering results and avoid large distance matrices

Matthew Kellom; Jason Raymond

doi:10.14440/jbm.2017.153

Matthew Kellom

School of Earth and Space Exploration, Arizona State University

Jason Raymond

School of Earth and Space Exploration, Arizona State University

Keywords

clustering, cluster, centroid-linkage, distance matrix, aggregate

Abstract

Sequence clustering is a fundamental tool of molecular biology that is being challenged by increasing dataset sizes from high-throughput sequencing. The agglomerative algorithms that have been relied upon for their accuracy require the construction of computationally costly distance matrices which can overwhelm basic research personal computers. Alternative algorithms exist, such as centroid-linkage, to circumvent large memory requirements but their results are often input-order dependent. We present a method for bootstrapping the results of many centroid-linkage clustering iterations into an aggregate set of clusters, increasing cluster accuracy without a distance matrix. This method ranks cluster edges by conservation across iterations and reconstructs aggregate clusters from the resulting ranked edge list, pruning out low-frequency cluster edges that may have been a result of a specific sequence input order. Aggregating centroid-linkage clustering iterations can help researchers using basic research personal computers acquire more reliable clustering results without increasing memory resources.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Abstract 24 | HTML Downloads 49 PDF Downloads 58 Supplemental File 1 Downloads 0

References

1. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007 Nov 1;23(21):2947–8.
2. Gronau I, Moran S. Optimal implementations of UPGMA and other common clustering algorithms. Information Processing Letters. 2007 Dec 16;104(6):205–10.
3. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, et al. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucl Acids Res. 2009 Jan 1;37(suppl 1):D141–5.
4. Huse SM, Welch DM, Morrison HG, Sogin ML. Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environmental Microbiology. 2010 Jul 1;12(7):1889–98.
5. Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics.2010 Oct 1;26(19):2460–1.
6. Kellom M, Raymond J. Using dendritic heat maps to simultaneously display genotype divergence with phenotype divergence. 2016 Aug 18;11(8):e0161292.
7. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012 Dec 1;28(23):3150–2.
8. Ghodsi M, Liu B, Pop M. DNACLUST: accurate and efficient clustering of phylogenetic marker genes. 2011 Jan 5;12(1):1

HTML PDF Supplemental File 1

Published

Mar 16, 2017

DOI https://doi.org/10.14440/jbm.2017.153

How to Cite

1.

Kellom M, Raymond J. Using cluster edge counting to aggregate iterations of centroid-linkage clustering results and avoid large distance matrices. J Biol Methods [Internet]. 2017 Mar. 16 [cited 2024 Apr. 19];4(1):e68. Available from: https://jbmethods.org/index.php/jbm/article/view/153

Issue

Vol. 4 No. 1 (2017)

Section

Articles

Authors who publish with JBM agree to the following terms:

Authors retain copyright and grant JBM right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).

Main Article Content

Keywords

Abstract

Downloads

Metrics

References

Article Sidebar

Article Details