May
03

HYBRID EVENT

Exact and approximate algorithm for accurate clustering of mid-sized networked datasets

Samin Aref (University of Toronto)
Max Planck Institute for Demographic Research (MPIDR), May 03, 2023

Abstract

Accurate computational methods are critical in data science for several reasons. They provide reliable insights and allow data-driven discovery to remain trustworthy. Not all commonly used data science methods are ideally accurate. In fact, there are commonly used methods for generic data science tasks which have substantial inaccuracies. Clustering networked data seems to be one such task. Clustering networked data, also known as community detection, is a fundamental task in data science with extensive applications in different fields. Among numerous approaches, the most common method is modularity maximization. Modularity maximization algorithms are designed to maximize a utility function, modularity, across all possible ways that the nodes of the input network can be partitioned into clusters (communities). Despite their design philosophy and wide adoption, heuristic modularity maximization algorithms rarely return an optimal (a maximum-modularity) partition or anything similar (arxiv.org/pdf/2302.14698). We propose a specialized algorithm, Bayan, which returns partitions with a guarantee of either optimality or proximity to an optimal partition. At the core of the Bayan algorithm is a branch-and-cut scheme that solves an integer programming formulation of the problem to optimality or approximate it within a factor (arxiv.org/pdf/2209.04562). We demonstrate Bayan's distinctive accuracy and stability over 21 other algorithms in retrieving ground-truth communities in synthetic benchmarks and node labels in real networks. Bayan is several times faster than open-source and commercial solvers for modularity maximization making it capable of finding optimal partitions for instances that cannot be optimized by any other existing method. Overall, our assessments point to Bayan as a suitable choice for exact maximization of modularity in networks with up to 3000 edges (in their largest connected component) and approximating maximum modularity in larger networks on ordinary computers. A Python implementation of the Bayan algorithm (bayanpy) is publicly available through the package installer for Python (pip) which is demonstrated in https://colab.research.google.com/drive/1_Clxp5FcEYJVn_w7Q6zm8bX0t1TEg7m_?usp=sharing

About

Samin Aref is an assistant professor at the University of Toronto Department of Mechanical & Industrial Engineering. He obtained his PhD degree in Computer Science from the University of Auckland (New Zealand) and worked at Max Planck Institute for Demographic Research, Laboratory of Digital and Computational Demography a research scientist and  research area chair between 2018-2021. He holds an MSc degree in Industrial Engineering

The Max Planck Institute for Demographic Research (MPIDR) in Rostock is one of the leading demographic research centers in the world. It's part of the Max Planck Society, the internationally renowned German research society.