Article
#1006
Issue
MathAI 2026 Selected Papers
Special Issue
Received
02 Apr 2026
Accepted
15 May 2026
Published
22 May 2026
Deep Learning for Educational Video Analysis: Benchmarking ASR Systems and Pipeline Optimization
MathAI 2026 Selected Papers
Special Issue
Deep Learning
Automatic Speech Recognition (ASR)
Pipeline Optimization
Cost Optimization
Educational Video
Transcription
Abstract
We present a comparative analysis of eight managed commercial speech recognition providers (provider-side preprocessing, segmentation, and serving) for educational video transcription and enrichment, evaluated on over 700 lecture recordings (900+ hours) across disciplines. The Fireworks whisper-v3-turbo endpoint offers a favorable cost–quality–latency trade-off versus surveyed alternatives. Audio preprocessing reduces billed duration by 10–25% with negligible accuracy loss. Prompt-based “Video Vocabulary” reduces terminology errors without fine-tuning. We implement a parallel pipeline that cuts end-to-end turnaround from over 30 minutes of manual effort per recording to under two minutes, supports up to 50 concurrent jobs, and achieves roughly 22× speedup at about $0.075 per hour of content for transcription plus pedagogical enrichment (summaries, chapter topics, self-check questions) at list prices. The system is deployed in production.
Cite this article
Zuev, G.; Kantonistova, E. Deep Learning for Educational Video Analysis: Benchmarking ASR Systems and Pipeline Optimization. Mathematics & AI 2026, 1, 6. https://enigma.ist/j/mathematics-ai/1/2/6