Presentation Information
[SS16-04]Deep distributed computing for clustering extremely large datasets
*Nozomu Yachie1,2 (1. The University of British Columbia SBME (Canada), 2. Osaka University PRIMe (Japan))
Keywords:
Deep distributed computing,Large data clustering,Single-cell RNA sequencing,Cell lineage tracing
Biology today is envisioning large-scale data generation and integration. CRISPR cell lineage tracing is expected to produce an extremely large dataset of mutated sequences from single cells in vertebrate bodies (the adult mouse consists of 1 billion nucleated cells). Single-cell gene expression datasets are rapidly accumulating, with a total size of nearly 100 million single cells as of February 2025. However, no fast and efficient clustering approach for such enormous datasets has been proposed, hindering their integrative analyses. In 2022, we introduced a deep distributed computing framework, FRACTAL, for CRISPR cell lineage and evolutionary tree reconstructions. This framework first roughly reconstructs an upstream hierarchy (or lineage tree) of data clusters and then recursively iterates the same procedure for its downstream clusters using independent computing nodes. In this symposium, I will share an updated architecture of this distributed computing strategy.