Membangun Sistem Text-to-Audiovisual Bahasa Indonesia Berdasarkan Database Suara Berbasis Suku Kata Untuk Mendukung Pembelajaran Pelafalan Bahasa Indonesia

  • Arifin Arifin Dian Nuswantoro University
  • Surya Sumpeno Institut Teknologi Sepuluh Nopember
  • Mochamad Hariadi Institut Teknologi Sepuluh Nopember
  • Arry Maulana Syarif Dian Nuswantoro University
Keywords: Database Suara Berbasis Suku Kata, Kalimat-Kalimat Berbahasa Indonesia, Sistem Text-to-Audiovisual Bahasa Indonesia, Viseme (Visual Phoneme)

Abstract

This paper aims to develop a system Text-to-Audio Visual Indonesian to support learning of Indonesian pronunciation based on speech database syllable-based. This system can visualize the pronunciation of the sentences Indonesian synchronized with speech signals. We conduct several research stages, namely forming the Indonesian viseme models, creating the speech database syllable-based, converting the text into syllables dan synchronizing. The synchronization process is a compilation the viseme models and the speech signal based on input text. This system was evaluated by involving 30 respondents who rate the system based on “lip-reading”. Each respondent provides an assessment of the 10 Indonesian sentences about the level of compatibility between the visualization of syllable and speech spoken based on text input. The MOS methode (Mean Opinion Score) is used to calculate the average ratings of respondents. MOS calculation results is 4.24, It shows that the level of conformity visualization syllable pronunciation and spoken voice is good.

Downloads

Download data is not yet available.

References

[1] Furui, S., “Digital Speech Processing; Synthesis and Recognition”, Marcel Dekker Inc., New York, 2001.

[2] Salil Deena, Shaobo Hou and Aphrodite Galata, “Visual Speech Synthesis by Modelling Coarticulation Dynamic using a Non-Parametric Switching State-Space Model”, School of Computer Science, university of Mancester, UK, 2010.

[3] Hui Zhao and Chaojing Tang, “Visual Speech Synthesis Based on Chinese Dynamic Visemes”, Proceeding of the 2008 IEEE International Conference on Information and Automation, June 20-23, Zhanjiajie, China, 2008.

[4] Arifin, Surya Sumpeno, Mochamad Hariadi, Hanny Haryanto, “A Text-to- Audiovisual Synthesizer for Indonesian by Morphing Viseme”, International Review on Computers and Software (IRECOS), Vol. 10, N. 11, ISSN 1828-6003, pp. 1149-1156, November 2015.

[5] Johan Wouters, Michael W. Macon, “Control of Spectral Dynamics in Concatenative Speech Synthesis”, IEEE Transaction on Speech and Audio Processing, Vol. 9, No. 1, pp. 30-38, January 2001.

[6] Sarah L. Taylor, Moshe Mahler, Barry-John Theobald and Lain Matthews, “Dynamic Units of Visual Speech”, ACM SIGGRAPH Symposium on Computer Animation, 2012.

[7] Arifin, Surya Sumpeno, Mochamad Hariadi, Hanny Haryanto, “A Text-to- Audiovisual Synthesizer for Indonesian by Morphing Viseme”, International Review on Computers and Software (IRECOS), Vol. 10, N. 11, ISSN 1828-6003, pp. 1149-1156, November 2015.

[8] Arifin, Mulyono, Surya Sumpeno, Mochamad Hariadi, “Towards Building Indonesian Viseme : A Clustering-Based Approach”, CYBERNETICSCOM 2013 IEEE International Conference on Computational Intelegence and Cybernetics, Yogyakarta, December 2013.

[9] Chaer, Abdul, “Linguistik Umum”, Jakarta: PT. Rineka Cipta, 2003.

[10] Turk MA and Pentland AP., “Face Recognation Using Eigenfaces”, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 586-591, 1991.

[11] K.A. Abdul Nazeer, M.P. Sebastian, “Improving the Accuracy and Efficiency of k-means Clustering Algorithm”, Proceedings of the World Congress on Engineering, July 1 – 3, London, U.K., Vol I, ISBN : 978-988-17012-5-1, 2009.

[12] T. Larose, “Discovering Knowledge in Data”, A John Wiley & Sons, Inc. Publication, USA, pp. 153–157, 2005.

[13] Subaryani D.H. Soedirdjo, Hasballah Zakaria, Richard Mengko, “Indonesian Text-to-Speech Syllable Concatenation for PC-based Low Vision Aid”, 2011 International Conference on Electrical Engineering and Informatics, Bandung, Indonesia, 17-19 July 2011.

[14] K. Tokuda, T. Kobayashi, T. Masuko and S. Imai, “Melgeneralizedcepstral analysis — A unified approach to speech spectral estimation,” Proc. ICSLP’94, pp.1043– 1046, Sep. 1994.

[15] T. Ezzat and T. Poggio, “Visual Speech Synthesis by Morphing Visemes”, International Journal of Computer Vision, vol.38, no.1, pp.45-57, 2000.
Published
2018-02-15
Section
Articles