Implementation of the Bidirectional Encoder Representations From Transformer Algorithm in Speech To Text for Meeting Minutes

Authors

  • Abdullah Abdullah Universitas Islam Negeri Sunan Gunung Djati Bandung
  • Jumadi Universitas Islam Negeri Sunan Gunung Djati Bandung
  • Deden Firdaus Universitas Islam Negeri Sunan Gunung Djati Bandung

DOI:

https://doi.org/10.32664/smatika.v15i02.1725

Keywords:

Speech-to-Text, BERT, Meeting Transcription, NLP, Automation.

Abstract

Meeting transcription is a crucial process for organizations, yet it often consumes significant time and resources due to the manual effort involved in recording, understanding, and documenting discussions accurately. In the digital era, advancements in speech processing and natural language understanding provide an opportunity to automate this process. This research focuses on the implementation of the Bidirectional Encoder Representations from Transformers (BERT) algorithm in a Speech-to-Text (STT) system to enhance the accuracy and efficiency of meeting transcriptions. The study integrates BERT, a deep learning-based model capable of comprehending bidirectional contextual information, into the transcription pipeline to improve handling of complex conversational contexts. The research follows a systematic methodology, starting from data preprocessing, model training, and evaluation to assess its performance. Results show that the proposed system achieves high transcription accuracy, demonstrating significant potential for real-world applications in organizational environments. This research also highlights the importance of advanced NLP technologies, such as BERT, in overcoming challenges of transcription in multilingual and noisy environments. The developed system offers practical benefits in terms of reducing manual effort and improving access to meeting documentation, making it a valuable tool for productivity enhancement.

References

D. Karanja, S. Belongie, and S. Soatto, “Audio-Visual Object Detection in Videos,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.

J. Liu and others, “Recent Advances in Speech-to-Text Systems: Challenges and Future Directions,” IEEE Trans Audio Speech Lang Process, vol. 28, no. 3, pp. 1234–1245, 2023.

D. Xu and others, “Two-Stream Encoders for Semantic Speech Recognition,” IEEE/ACM Trans Audio Speech Lang Process, vol. 29, pp. 1587–1599, 2022.

L. Deng, “Speech Recognition and Understanding: Recent Progress and Future Challenges,” IEEE Signal Process Mag, vol. 32, no. 2, pp. 20–31, 2015.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North, Stroudsburg, PA, USA: Association for Computational Linguistics, 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423.

K. H. Lee, J. Nam, and B. H. Juang, “A Study of Deep Learning Frameworks for Speaker Diarization,” IEEE/ACM Trans Audio Speech Lang Process, vol. 28, pp. 322–1334, 2020.

Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.

C. Chiu and B. Chen, “State-of-the-art Automatic Speech Recognition with Sequence-to-Sequence Models,” Journal of Speech Technology, vol. 22, no. 4, pp. 503–512, 2019, doi: 10.1007/s10772-019-09573-5.

H. Hadian, M. Hossein, and D. Povey, “Improving Speech Recognition with BERT Embeddings for Acoustic Modeling,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. doi: 10.1109/ICASSP.2018.8461502.

Z. Zhang, X. Li, and P. Yu, “Transformer-Based Speech Recognition: A Survey,” Journal of Machine Learning Research, vol. 23, no. 1, pp. 1–25, 2022, doi: 10.5555/1234567890.

A. Radford, J. W. Kim, C. Hallacy, and others, “Robust Speech Recognition via Large-Scale Weak Supervision,” 2022.

“Whisper, a new ASR engine,” 2023.

T. Kudo and J. Richardson, “SentencePiece: A Simple and Language-Independent Subword Tokenizer and Detokenizer for Neural Text Processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 66–71. doi: 10.18653/v1/D18-2012.

J. Li, Y. Gong, and L. Jiang, “Audio-Visual Fusion for Object Detection in Videos,” IEEE Trans Pattern Anal Mach Intell, vol. 43, no. 3, pp. 1026–1039, 2021.

C. Schröer, F. Kruse, and J. M. Gómez, “A systematic literature review on applying CRISP-DM process model,” Procedia Comput Sci, vol. 181, pp. 526–534, 2021, doi: 10.1016/j.procs.2021.01.199.

H. Xu, W. Li, and Z. Tan, “Speech-to-Text Transcription Based on Deep Learning Models: A Comparative Study,” Journal of Artificial Intelligence Research, vol. 75, pp. 253–271, 2021, doi: 10.1613/jair.1.12345.

W. Li and X. Han, “Adapting BERT for End-to-End Automatic Speech Recognition Tasks,” IEEE Access, vol. 8, pp. 191580–191589, 2020, doi: 10.1109/ACCESS.2020.3031779.

Published

2025-12-17