Towards an Unsupervised Federated Learning Approach for Speaker Diarization on IoT-style Audio Devices

Amit Kumar Bhuyan; Hrishikesh Dutta; Subir Biswas

doi:10.36227/techrxiv.170492258.87092583/v1

loading page

Towards an Unsupervised Federated Learning Approach for Speaker Diarization on IoT-style Audio Devices

Amit Kumar Bhuyan,
Hrishikesh Dutta,
Subir Biswas

Abstract

This paper presents a computationally efficient and distributed speaker diarization framework suitable for a network of IoT-style audio devices. This work proposes a Federated Learning based speaker identification model which can identify the participants in a conversation without the requirement of a large audio database for training. An unsupervised online update mechanism is proposed for the Federated Learning model which depends on cosine similarity of speaker embeddings. Moreover, the proposed diarization system solves the problem of speaker change detection via. unsupervised segmentation techniques using Hotelling's tsquared Statistic and Bayesian Information Criterion. In this new approach, speaker change detection is biased around detected quasi-silences, which reduces the severity of the tradeoff between the missed detection and false detection rates. Additionally, the computational overhead due to frame-byframe identification of speakers is reduced via. unsupervised clustering of speech segments. The results show the effectiveness of the proposed training method in the presence of non-IID speech data. It also shows a considerable improvement in the reduction of false and missed detection at the segmentation stage while reducing the computational overhead. Improved accuracy and reduced computational cost of the proposed mechanism makes it appropriate for real-time speaker diarization, especially for a distributed IoT network.

05 Jan 2024Submitted to TechRxiv

10 Jan 2024Published in TechRxiv

Abstract

Peer review timeline