Generating Synthetic Clinical Speech Data through Simulated ASR Deletion Error

Abstract

Training classification models on clinical speech is a time-saving and effective solution for many healthcare challenges, such as screening for Alzheimer’s Disease over the phone. One of the primary limiting factors of the success of artificial intelligence (AI) solutions is the amount of relevant data available. Clinical data is expensive to collect, not sufficient for large-scale machine learning or neural methods, and often not shareable between institutions due to data protection laws. With the increasing demand for AI in health systems, generating synthetic clinical data that maintains the nuance of underlying patient pathology is the next pressing task. Previous work has shown that automated evaluation of clinical speech tasks via automatic speech recognition (ASR) is comparable to manually annotated results in diagnostic scenarios even though ASR systems produce errors during the transcription process. In this work, we propose to generate synthetic clinical data by simulating ASR deletion errors on the transcript to produce additional data. We compare the synthetic data to the real data with traditional machine learning methods to test the feasibility of the proposed method. Using a dataset of 50 cognitively impaired and 50 control Dutch speakers, ten additional data points are synthetically generated for each subject, increasing the training size for 100 to 1000 training points. We find consistent and comparable performance of models trained on only synthetic data (AUC=0.77) to real data (AUC=0.77) in a variety of traditional machine learning scenarios. Additionally, linear models are not able to distinguish between real and synthetic data.

ki:elements Detects Alzheimer’s Pathology via Automated Phone Call: Study Validates Speech Biomarker Across Five European Cohorts

SAARBRÜCKEN, Germany–(BUSINESS WIRE)–New peer-reviewed research published by ki:elements and the PROSPECT-AD consortium demonstrates that the company’s Speech Biomarker for Cognition (SB-C) can reliably detect cognitive impairment and

Read

Speech-based digital cognitive assessment for clinical trials: Detecting cognitive impairment stages and AD biomarker relations across European cohorts

König et al., 2026.

Read

Enhancing Recruitment in Alzheimer’s Trials Using Speech-Based Cognitive Biomarkers: Preliminary Findings from the RETAIN Study

Kyani et al., 2026.

Read

Optimize your Alzheimer's study: Get the new white paper on avoiding trial pitfalls

Generating Synthetic Clinical Speech Data through Simulated ASR Deletion Error

Abstract

ki:elements Detects Alzheimer’s Pathology via Automated Phone Call: Study Validates Speech Biomarker Across Five European Cohorts

Speech-based digital cognitive assessment for clinical trials: Detecting cognitive impairment stages and AD biomarker relations across European cohorts

Enhancing Recruitment in Alzheimer’s Trials Using Speech-Based Cognitive Biomarkers: Preliminary Findings from the RETAIN Study

Get in touch

Resources

Follow us