簡易檢索 / 詳目顯示

研究生: 林蓓佳
Lin, Pei-Chia
論文名稱: 人與機器在現代詩裡所呈現之韻律研究
An Analysis of Prosody in the Poems Produced by Human and Machine
指導教授: 甯俐馨
Ning, Li-Hsin
口試委員: 張詠翔
Chang, Yung-Hsiang
陳正賢
Chen, Cheng-Hsien
甯俐馨
Ning, Li-Hsin
口試日期: 2022/07/26
學位類別: 碩士
Master
系所名稱: 英語學系
Department of English
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 67
中文關鍵詞: 韻律特徵現代詩朗讀停頓文字轉語音
英文關鍵詞: prosodic features, poem reading, durational features, pause, Text-to-speech synthesis
研究方法: 實驗設計法
DOI URL: http://doi.org/10.6345/NTNU202200971
論文種類: 學術論文
相關次數: 點閱:95下載:7
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 此研究探討人與機器在閱讀與生成中文現代詩所呈現之韻律表現。過去文獻多探討唐詩或古典韻律詩的韻律特徵,本研究旨在分析朗讀中文現代詩中的韻律,並且進而探討文字佈局與性別對韻律特徵的影響。我們著重分析韻律中的時間性特徵,包含聲檔中聲韻單位的個數、聲韻單位時長、停頓個數、停頓位置與停頓時長。

    為研究人和機器呈現語音之韻律差異,此研究採用一文字轉語音系統生成語音,並招募一組中文母語者朗讀對照。我們以兩首結構各異的現代詩作為朗讀材料,分別以四種不同文字佈局呈現給朗讀者。讀者錄音檔和線上語音系統所生成之語音檔案於匯集、下載後,依據聲韻單位標記原則進行分析。

    研究結果顯示人類說話者表現出更多並且更短的聲韻單位、更多變的停頓位置與更長的停頓時間。機器則表現出相對少量但更長的聲韻單位、可預測的停頓位置和更短的停頓時長。此外,本研究中也發現文字佈局與性別對韻律特徵的影響。相較於沒有標點符號的文本,人類和機器的語音皆在有標點符號的文本中呈現更多停頓。然而在含有不同小節或段落的文本中,只有機器呈現較長的停頓時間。性別對停頓策略的影響則顯示於女性朗讀語音中更多的停頓和男性朗讀語音中對停頓的省略。

    本研究結果提供應用於語音系統朗讀現代詩的韻律特徵,並證實文字佈局和性別對人類語音韻律呈現息息相關。為促進語音合成系統在更多文本類型與說話風格之韻律表現,此實驗所探討之聲韻特徵如停頓位置、停頓時長可作為語音系統發展之重要影響因素。

    關鍵字:韻律特徵、現代詩朗讀、停頓、文字轉語音

    This study investigates prosodic performances between human speech and machine-generated speech in reading/generating Mandarin poems. Previous research has acknowledged the prosodic features in Tang poetry or poems with classical rhythmic format. This study therefore explores the prosody in contemporary Mandarin poems, and puts forward to the investigation of text layouts and gender effect. We focused on the durational features in prosody, namely the number of prosodic units (PU), PU duration, the number of pauses, pause location , and pause duration in each speech type.

    To examine the prosodic differences between human and machine-generated speech, one Text-to-Speech (TTS) system and a group of Mandarin native speakers were recruited to read the poems. The two selected poems featured in varied structure were placed in four different text layouts as reading materials. We downloaded the machine-generated speech from the online TTS website and recorded the human speech, then analyzed each speech file with the PU-labeling principles.

    Concerning the different prosodic performances between human and machine, our results showed that human speakers showed more and shorter PUs, more flexibility in pausing location and longer pause duration, while the machine displayed relatively fewer and longer PUs, predictable pause location and shorter pause duration. Evident effect of layouts can be seen in the current study as well. More pauses were shown in the layouts with punctuations in both human and machine speech compared with the text without punctuations, and longer pause duration was presented in the machine- generated speech in text with stanza breaks. Additionally, gender effect was observed in the pausing strategy, in which female speakers displayed more pauses and the male speakers missed more pauses.

    These findings shed light on how prosodic features can be applied to TTS systems in poetry reading style, and demonstrated that text layouts and gender differences are encoded in prosody of human speech. To enhance TTS development in more text types or speaking styles, durational features such as pause location and pause duration may be the influential factors for the furtherance of machine speech.

    Keywords: prosodic features, poem reading, durational features, pause, Text-to-speech synthesis

    ACKOWLEDGEMENT.............................................................................i CHINESE ABSTRACT...............................................................................ii ABSTRACT............................................................................................. iv TABLE OF CONTENT..............................................................................vi LIST OF TABLES.......................................................................................viii LIST OF FIGURES......................................................................................ix Chapter 1 Introduction.........................................................................1 1.1 Research Background and Motivation.................................1 1.2 Organization of the Study...................................................... 3 Chapter 2 Literature Review................................................................4 2.1 The Importance of Prosody in Text-to-Speech Development and in Speech Communication......................................................................4 2.1.1 Current Development and Future Perspectives of TTS and Prosody Models.....................................................................................................5 2.1.2 Manipulations to Achieve Fluent Prosody in Speech Communication.....................................................................................7 2.2 Speech Prosody Model for Mandarin....................................15 2.2.1 S. C. Tseng’s Principle for Labeling Prosodic Units in Mandarin Spontaneous Speech (2008)................................................................15 2.2.2 Implementing the Framework to More Text Types......16 2.3 Reading Poetry.........................................................................17 2.3.1 The Role of Prosody in Poetry........................................17 2.3.2 The Manifestation of Prosody in Poems.......................18 2.4 Research Questions..................................................................21 Chapter 3 Methodology.......................................................................24 3.1 Collecting Speech Data..........................................................24 3.1.1 Human Speakers and the TTS System...........................24 3.1.2 Materials......................................................................... 25 3.1.3 Procedures......................................................................29 3.2 Data Analysis.............................................................................30 Chapter 4 Results....................................................................................33 4.1 Prosodic Unit (PU).......................................................................34 4.1.1 The Number of PU............................................................34 4.1.2 The Average Duration of PU...........................................36 4.2 Pause...........................................................................................38 4.2.1 The Number of Pauses.....................................................38 4.2.2 Pause Location.................................................................39 4.2.3 The Average Pause duration..........................................46 4.3 Summary......................................................................................47 Chapter 5 Discussion................................................................................50 5.1 The Differences Between Human and Machine speech........50 5.2 The Effect of Layouts....................................................................51 5.3 Differences Between Female and Male Speakers...................54 5.4 Limitations and Future Study.......................................................56 Chapter 6 Conclusion...............................................................................58 References..................................................................................................60

    Arias, Juan P., Carlos Busso, and Nestor B. Yoma. 2014. Shape-based modeling of the fundamental frequency contour for emotion detection in speech. Computer Speech and Language 28, 278–294.
    Amazon. (2019, November). Use New Alexa Emotions and Speaking Styles to Create a More Natural and Intuitive Voice Experience. Retrieved January 27, 2022, from https://developer.amazon.com/en-US/blogs/alexa/alexa-skills-kit/2019/11/new-alexa-emotions-and-speaking-styles?tag=theverge02-20
    Amazon. (2020, November). Alexa  Speaking Styles and Emotions Now Available in Additional Languages. Retrieved February 20, 2022, from https://developer.amazon.com/en-US/blogs/alexa/alexa-skills-kit/2020/11/alexa-speaking-styles-emotions-now-available-additional-languages
    Brown, K. Currie, and J. Kenworthy. 1980. Questions of Intonation. University Park Press, Baltimore.
    Bolinger, D., 1989. Intonation and its Uses: Melody in Grammar and Discourse. Standford University Press, Standford, CA.
    Bartkova, K., Haffner, P., & Larreur, D. 1993. Intensity prediction for speech synthesis in French. In ESCA Workshop on Prosody.
    Black, A. W. 2003. Unit selection and emotional speech. In Interspeech.
    Bänziger, T., and Scherer, K. R. 2005. The role of intonation in emotional expressions. Speech communication, 46(3-4), 252-267.
    Basu, Tulika., Saha, Arup. Evaluation of prosody in text-to-speech synthesis system of Bangla. In 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation pp. 1-6. IEEE.
    Brockmann, M., Drinnan, M. J., Storck, C., & Carding, P. N. 2011. Reliable jitter and shimmer measurements in voice clinics: the relevance of vowel, gender, vocal intensity, and fundamental frequency effects in a typical clinical task. Journal of voice, 25(1), 44-53.
    Coleman, R. O. 1976. A comparison of the contributions of two voice quality characteristics to the perception of maleness and femaleness in the voice. Journal of Speech & Hearing Research, 19(1), 168–180.
    Cahn, J. E. 1989. Generating expression in synthesized speech (Doctoral dissertation, Massachusetts Institute of Technology).
    Chen, S. H., Lai, W. H., & Wang, Y. R. 2003. A new duration modeling approach for Mandarin speech. IEEE Transactions on Speech and Audio Processing, 11(4), 308-320.
    Campbell, Nick. 2007. Evaluation of speech synthesis. In Evaluation of text and speech systems (pp. 29-64). Springer, Dordrecht.
    Clopper, C. G., & Smiljanic, R. 2011. Effects of gender and regional dialect on prosodic patterns in American English. Journal of phonetics, 39(2), 237-245.
    Dowhower, S. L. 1991. Speaking of prosody: Fluency's unattended bedfellow. Theory into practice, 30(3), 165-175.
    Dillon, G. L. 1976. Clause, pause, and punctuation in poetry. Linguistics, 169, 5-20.
    Dyson, M. C. 2004. How physical text layout affects reading from screen. Behaviour & information technology, 23(6), 377-393.
    Den Ouden, H., Noordman, L., & Terken, J. 2009. Prosodic realizations of global and local structure and rhetorical relations in read aloud news reports. Speech Communication, 51(2), 116-129.
    Fackrell, Justin, et al. 2000. Prosodic variation with text type. Sixth International Conference on Spoken Language Processing.
    Fant, G., Kruckenberg, A., Ferreira, J. B. 2003. Individual variation in pausing. A study of read speech. PHONUM 9, 193-196.
    Granström, B., & Nord, L. 1992. Neglected dimensions in speech synthesis. Speech Communication, 11(4-5), 459-462.
    Gross, H. S., & McDowell, R. 1996. Sound and form in modern poetry. University of Michigan Press.
    Gustafson-Capkova, S., & Megyesi, B. 2001. A comparative study of pauses in dialogues and read speech. In Seventh European Conference on Speech Communication and Technology.
    Gelfer, M. P., & Mikos, V. A. 2005. The relative contributions of speaking fundamental frequency and formant frequencies to gender identification based on isolated vowels. Journal of voice, 19(4), 544-554.
    Hieke, A. E., Kowal, S., & O'Connell, D. C. 1983. The trouble with “articulatory” pauses. Language and speech, 26(3), 203-214.
    Howell, P., & Kadi-Hanifi, K. 1991. Comparison of prosodic properties between read and spontaneous speech material. Speech communication, 10(2), 163-169.
    Hirschberg, J., Nakatani, C., 1996. A prosodic analysis of discourse segments in direction-giving monologues. In: Proceedings of 34th Annual Meeting––Assoc. Comp. Ling. 286–293.
    Hanauer, David. 1998. The genre-specific hypothesis of reading: Reading poetry and encyclopedic items. Poetics, 26(2), 63-80.
    House, David., Bell, Linda., Gustafson, Kjell., and Johansson, Linn. 1999. Child-directed speech synthesis: Evaluation of prosodic variation for an educational computer program. In Sixth European Conference on Speech Communication and Technology.
    Herman, R., 2000. Phonetic markers of global discourse structures in English. Journal of Phonetics 28, 466–493.
    Hsu, Sheng-Hsiung, and Kuo-Chen Huang. 2000. Interword spacing in Chinese text layout. Perceptual and Motor Skills 91, 2, 355–365.
    Johnson, W. L., Narayanan, S., Whitney, R., Das, R., Bulut, M., & LaBore, C. 2002. Limited domain synthesis of expressive military speech for animated characters. In Proceedings of 2002 IEEE Workshop on Speech Synthesis, pp163-166. IEEE.
    Jacewicz, E., Fox, R. A., O'Neill, C., & Salmons, J. 2009. Articulation rate across dialect, age, and gender. Language variation and change, 21(2), 233-256.
    Jacewicz, Ewa, Robert Allen Fox, and Lai Wei. 2010. Between-speaker and within-speaker variation in speech tempo of American English. The Journal of the Acoustical Society of America 128.2: 839-850.
    Kuhn, M. R., & Stahl, S. A. 2003. Fluency: A review of developmental and remedial practices. Journal of Educational Psychology, 95, 3–21.
    Koolagudi, Shashidhar G. and K. Sreenivasa Rao. 2012. Emotion recognition from speech: a review. International Journal of Speech Technology 15, 2, 99–117.
    Lieberman, P., Michaels, S.B., 1962. Some aspects of fundamental frequency and envelope amplitude as related to the emotional content of speech. Journal of the Acoustical Society of America 34, 922–927.
    Lehiste, I., Olive, J. P., & Streeter, L. A. (1975). Role of duration in disambiguating syntactically ambiguous sentences. The Journal of the Acoustical Society of America,  57(S1), S47-S47.
    Lehiste, I. 1979. The perception of duration within sequences of four intervals. Journal of phonetics, 7(4), 313-316.
    Liu, Y.-F. and S.-C. Tseng. 2009. Linguistic patterns detected through a prosodic segmentation in spontaneous Taiwan Mandarin speechLinguistic patterns detected through a prosodic segmentation in spontaneous Taiwan Mandarin speech. In Tseng, S.-C. (ed.), Linguistic Patterns in Spontaneous Speech, 147-166. Taipei: Institute of Linguistics, Academia Sinica.
    Li, Y., Tao, J., Lai, W., & Xu, X. 2017. Quantitative intonation modeling of interrogative sentences for Mandarin speech synthesis. Speech Communication, 89, 92-102.
    Murray, I. R., & Arnott, J. L. 1993. Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. The Journal of the Acoustical Society of America, 93(2), 1097-1108.
    Murray, I. R., & Arnott, J. L. 1995. Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16(4), 369-390.
    Maeda, S., 1976. A Characterization of American English Intonation. Ph.D. thesis, MIT.
    Montaño, R., & Alías, F. 2017. The role of prosody and voice quality in indirect storytelling speech: A cross-narrator perspective in four European languages. Speech Communication, 88, 1-16.
    Microsoft. (2020, April). Introducing new voice styles in Azure Cognitive Services. Retrieved February 5, 2022, from https://techcommunity.microsoft.com/t5/azure-ai-blog/introducing-new-voice-styles-in-azure-cognitive-services/ba-p/1248368
    Pierre-Yves, O. 2003. The production and recognition of emotions in speech: features and algorithms. International Journal of Human-Computer Studies, 59(1-2), 157-183.
    Pfitzinger, H. R. 2006. Five dimensions of prosody: Intensity, intonation, timing, voice quality, and degree of reduction. In  Speech Prosody,  40, 6-9.
    Patel, R., & McNab, C. 2011. Displaying prosodic text to enhance expressive oral reading. Speech Communication, 53(3), 431-441.
    Parlikar, A., & Black, A. W. 2012. Modeling pause-duration for style-specific speech synthesis. In Thirteenth Annual Conference of the International Speech Communication Association.
    Pépiot, E. 2014. Male and female speech: a study of mean f0, f0 range, phonation type and speech rate in Parisian French and American English speakers. In Speech Prosody. 305-309.
    Prévot, L., S.-C. Tseng, K. Peshkov, and A. C.-H. Chen. 2015. Processing units in conversation: A comparative study of French and Mandarin data. Language and Linguistics 16(1), 69-92.
    Prateek, N., Łajszczak, M., Barra-Chicote, R., Drugman, T., Lorenzo-Trueba, J., Merritt, T., ... & Wood, T. 2019. In other news: A bi-style text-to-speech model for synthesizing newscaster voice with limited data. arXiv preprint arXiv:1904.02790.
    Rebordao, A. R. F., Shaikh, M. A. M., Hirose, K., & Minematsu, N. 2009. How to improve TTS systems for emotional expressivity. In Tenth Annual Conference of the International Speech Communication Association.
    Roekhaut, S., Goldman, J. P., Simon, A. C. 2010. A model for varying speaking style in TTS systems. In Speech Prosody 2010-Fifth International Conference.
    Sorin, Christel. 1981. Functions, roles, and treatments of intensity in speech. Journal of Phonetics, 9(4), 359-374.
    Swerts, M., Strangert, E., & Heldner, M. 1996. F0 declination in spontaneous and read-aloud speech. In Proceedings of ICSLP, Philadelphia (3), 1501-1504.
    Shih, C. 1997. Declination in mandarin. In Intonation: Theory, Models and Applications.
    Silva, A., Vala, M., Paiva, A., & Redol, R. A. 2001. The Storyteller: Building a synthetic character that tells stories. In Proc. Workshop Multimodal Communication and Context in Embodied Agents. 53-58.
    Schwanenflugel, P. J., Hamilton, A. M., Kuhn, M. R., Wisenbaker, J. M., & Stahl, S. A. 2004. Becoming a fluent reader: reading skill and prosodic features in the oral reading of young readers. Journal of educational psychology, 96(1), 119.
    Smith, C. L. 2004. Topic transitions and durational prosody in reading aloud: production and modeling. Speech Communication, 42(3-4), 247-270.
    Stevens, C., Lees, N., Vonwiller, J., & Burnham, D. 2005. On-line experimental methods to evaluate text-to-speech (TTS) synthesis: effects of voice gender and signal quality on intelligibility, naturalness and preference. Computer speech & language, 19(2), 129-146.
    Strangert, E. 2005. Prosody in public speech: analyses of a news announcement and a political interview. In Ninth European Conference on Speech Communication and Technology.
    Tseng, C., Pin, S., Lee, Y., 2004. Speech prosody: issues, approaches and implications. From Traditional Phonology to Mandarin Speech Processing, Foreign Language Teaching and Research Process. 417–438.
    Tseng, C. Y., & Lee, Y. L. 2004. Speech rate and prosody units: Evidence of interaction from Mandarin Chinese. In Speech Prosody 2004, International Conference.
    Tseng, Chiu-yu, et al. 2005. Fluent speech prosody: Framework and modeling. Speech communication 46.3-4: 284-309.
    Theune, M., Meijs, K., Heylen, D., & Ordelman, R. 2006. Generating expressive speech for storytelling applications. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1137-1144.
    Tseng, Shu-Chuan. 2006. Linguistic markings of units in spontaneous Mandarin. In Q. Huo et al. (ed.), Lecture Notes in Artificial Intelligence 4272, 43-54. Springer Verlag: Berlin-Heidelberg.
    Tseng, Shu-Chuan. 2008. Spoken corpora and analysis of natural speech. Taiwan Journal of Linguistics 6(2), 1-26.
    Tamuri, K. 2015. Fundamental frequency in Estonian emotional read-out speech. Eesti ja soome-ugri keeleteaduse ajakiri. Journal of Estonian and Finno-Ugric Linguistics, 6(1), 9-21.
    Van de Water, D. A., & O’connell, D. C. 1985. In and about the poetic line. Bulletin of the Psychonomic Society, 23(4), 397-400.
    Van Donzel, M. 1999. Prosodic aspects of information structure in discourse. Netherlands Graduate School of Linguistics.
    Williams, C. E., & Stevens, K. N. 1972. Emotions and speech: Some acoustical correlates. The Journal of the Acoustical Society of America, 52(4B), 1238-1250.
    Whiteside, S. P. 1996. Temporal-based acoustic-phonetic patterns in read speech: Some evidence for speaker sex differences. Journal of the International Phonetic Association, 26(1), 23-40.
    Wang, Q., Wang, X., Liu, W., & Chen, G. 2021. Predicting the Chinese Poetry Prosodic Based on a Developed BERT Model. In 2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), 583-586. IEEE.
    Xu, Y., Cao, S., Ji, J., Xiao, Q., Wu, A., & Wang, X. 2020, December. Differentiated Prosodic Adaption of Chinese and English Poetry: An Acoustic Approach to Reading of Chinese Tang Poetry and Shakespearean Sonnets. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 211-215. IEEE.
    Xue, L., Pan, S., He, L., Xie, L., & Soong, F. K. 2021. Cycle consistent network for end-to-end style transfer TTS training. Neural Networks, 140, 223-236.
    Yuan, J., & Liberman, M. 2014. F0 declination in English and Mandarin broadcast news speech. Speech Communication, 65, 67-74.
    Yuan, J., Xu, X., Lai, W., & Liberman, M. 2016. Pauses and pause fillers in Mandarin monologue speech: The effects of sex and proficiency. Proceedings of Speech Prosody 2016, 1167-1170.
    Zvonik, E., & Cummins, F. 2002. Pause duration and variability in read texts. In Seventh International Conference on Spoken Language Processing.

    下載圖示
    QR CODE