HYBRID EVENT
Exact and approximate algorithm for accurate clustering of mid-sized networked datasets
Samin Aref (University of Toronto)
Max Planck Institute for Demographic Research (MPIDR), May 03, 2023
Abstract
Accurate computational methods are critical in data science for several reasons. They provide reliable insights and allow data-driven discovery to remain trustworthy. Not all commonly used data science methods are ideally accurate. In fact, there are commonly used methods for generic data science tasks which have substantial inaccuracies. Clustering networked data seems to be one such task. Clustering networked data, also known as community detection, is a fundamental task in data science with extensive applications in different fields. Among numerous approaches, the most common method is modularity maximization. Modularity maximization algorithms are designed to maximize a utility function, modularity, across all possible ways that the nodes of the input network can be partitioned into clusters (communities). Despite their design philosophy and wide adoption, heuristic modularity maximization algorithms rarely return an optimal (a maximum-modularity) partition or anything similar (arxiv.org/pdf/2302.14698). We propose a specialized algorithm, Bayan, which returns partitions with a guarantee of either optimality or proximity to an optimal partition. At the core of the Bayan algorithm is a branch-and-cut scheme that solves an integer programming formulation of the problem to optimality or approximate it within a factor (arxiv.org/pdf/2209.04562). We demonstrate Bayan's distinctive accuracy and stability over 21 other algorithms in retrieving ground-truth communities in synthetic benchmarks and node labels in real networks. Bayan is several times faster than open-source and commercial solvers for modularity maximization making it capable of finding optimal partitions for instances that cannot be optimized by any other existing method. Overall, our assessments point to Bayan as a suitable choice for exact maximization of modularity in networks with up to 3000 edges (in their largest connected component) and approximating maximum modularity in larger networks on ordinary computers. A Python implementation of the Bayan algorithm (bayanpy) is publicly available through the package installer for Python (pip) which is demonstrated in https://colab.research.google.com/drive/1_Clxp5FcEYJVn_w7Q6zm8bX0t1TEg7m_?usp=sharing
About
Samin Aref is an assistant professor at the University of Toronto Department of Mechanical & Industrial Engineering. He obtained his PhD degree in Computer Science from the University of Auckland (New Zealand) and worked at Max Planck Institute for Demographic Research, Laboratory of Digital and Computational Demography a research scientist and research area chair between 2018-2021. He holds an MSc degree in Industrial Engineering