簡易檢索 / 詳目顯示

研究生: 車信璋
Che, Sin-Jhang
論文名稱: 韻律特徵於YouTube言語體裁多模態分類中之潛力
Potentiality of Prosody in Multimodal Classification of Speech Genres on YouTube
指導教授: 陳正賢
Chen, Cheng-Hsien
口試委員: 甯俐馨
Ning, Li-Hsin
郭貞秀
Kuo, Chen-Hsiu
陳正賢
Chen, Cheng-Hsien
口試日期: 2023/10/06
學位類別: 碩士
Master
系所名稱: 英語學系
Department of English
論文出版年: 2023
畢業學年度: 112
語文別: 英文
論文頁數: 144
中文關鍵詞: 韻律言語體裁多模態分類體裁分類台灣華語YouTube創作內容
英文關鍵詞: prosody, speech genre, multimodal classification, genre classification, Taiwan Mandarin YouTube content
研究方法: 實驗設計法
DOI URL: http://doi.org/10.6345/NTNU202301790
論文種類: 學術論文
相關次數: 點閱:187下載:29
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究旨在分析YouTube臺灣華語創作內容中,娛樂型和知識型言語體裁之韻律特質,以及不同特徵模組(feature mode)對於自動化言語體裁分類模型之成效。我們建立了一個由5049語句所組成的語料庫。在此研究中,單一語句定義為言語中兩間隔停頓間之單位,每個語句紀錄了其文本、言語體裁、時長特徵[包含:語句時長、停頓時長、語速、時長成對變異指數(duration-based PVI)]、基頻特徵[包含:平均值、全距、基頻成對變異指數(f0-based PVI)]。我們也進一步將每個語句的文本以TF-IDF方法轉換成文字特徵。本研究是以每個單一語句為分析單位。首先,我們運用所提出的七個時長及基頻特徵,建立了羅吉斯迴歸模型,以分析娛樂型及知識型言語體裁分別具有特定哪些韻律特質。再者,我們建立了三種自動化言語體裁分類模型,包含了韻律特徵模型、文字特徵模型、多模態特徵(結合韻律及文字特徵)模型,以研究韻律特徵於言語體裁分類之潛力、多模態特徵是否能進一步提升言語體裁分類之結果。
    根據羅吉斯回歸模型的結果顯示,在我們所提出的七個韻律特徵中,有六個韻律特徵(排除停頓時長,包含:語句時長、語速、時長成對變異指數、基頻全距、基頻成對變異指數、基頻平均值)於模型中呈現統計顯著性,顯示娛樂型及知識型言語體裁具有不同韻律特質。此統計分析結果也顯示,與娛樂型言語體裁相比,知識型言語體裁通常具有較長的語句時長、較慢的語速、較低的音高、較明顯的語調變化,其節奏也更具等時性。再者,我們也運用提出的七個韻律特徵來訓練韻律特徵分類模型以及多模態特徵分類模型。研究結果顯示,以七個韻律特徵為本的模型分類準確率達0.733,展現了韻律特徵於言語體裁分類之潛力;此外,多模態特徵分類模型表現優於任何其他以單一特徵模組為本之模型,分類結果達到0.846準確率。我們認為在言語體裁分類任務中,韻律特徵能夠彌補文字特徵所缺乏或無法完全呈現的訊息,甚至能夠進一步提升原本就具不錯表現的文字特徵模型。總而言之,言語的多模態現象,使得進行言語體裁分類任務時必須同時考量韻律特徵及文字特徵。

    This current study aims to investigate the prosodic properties of the speech genres (i.e., entertaining and informative) within Taiwan Mandarin YouTube content and the effectiveness of different feature modes on automatic speech genre classification models. We established a corpus consisting of 5049 utterances, each of which was an inter-pausal unit in speech. All the utterances were recorded with their corresponding transcripts, speech genres, four durational features (i.e., utterance duration, pause duration, speech rate, duration-based PVI), and three pitch-related features (i.e., f0 mean, f0 range, f0-based PVI). Based on the transcript of each utterance, we also utilized the naïve TF-IDF technique to acquire the textual features. An utterance-based analytical framework was adopted. First, we used the binary logistic model fitting on the seven proposed durational and pitch-related features to investigate which prosodic features can effectively characterize the entertaining and informative speech. Second, we built three types of automatic speech genre classification models—namely, prosody-mode, text-mode, and multimodal (prosody + text) models—to study the potentiality of prosodic features in speech genre classification and whether multimodal (prosody + text) features can further improve the performance of speech genre classification.
    According to the results of our logistic model, six of the seven proposed prosodic features—utterance duration, dur-PVI, speech rate, f0 mean, f0 range, and f0-PVI—showed statistical significance in the logistic model. Our analysis suggested that in comparison to entertaining speech, informative speech tended to have longer utterance duration, slower speech rate, more isochronic rhythm, and more perceptible intonation at a lower pitch level. Additionally, we fed the seven proposed prosodic features into the prosody-mode and multimodal speech genre classification models. Our results showed that (1) the prosody-mode model attained 0.733 accuracy, showing the potentiality of prosodic features in speech genre classification; (2) the multimodal (prosody + text) model outperformed the other single-mode ones, amounting to 0.846 accuracy. We conclude that prosodic features can complement the absent or underrepresented information of textual features in speech genre classification and further improve the already high-performing text-mode classification results. The inherent multimodality of speech makes it necessary to include both prosodic and textual features in the task of speech genre classification.

    Chapter 1 Introduction 1 1.1 Background 1 1.2 Speech Genres on YouTube 4 1.3 Research Gaps 6 1.4 Research Questions 10 1.5 Organization of the Study 10 Chapter 2 Literature Review 11 2.1 Multimodality of Speech 11 2.2 Prosody and Discourse 12 2.3 Prosodic Variations in Genres 16 2.4 Genre Analysis 24 2.4.1 Text-based Analysis 26 2.4.2 Acoustic-prosodic Based Analysis 32 2.5 Computational Approach to Genre Analysis 39 2.5.1 Automatic Classification Models in NLP 39 2.5.2 Automatic Genre Classification Models 45 Chapter 3 Methodology 55 3.1 Corpus Data 56 3.2 Acoustic-prosodic Features 62 3.2.1 Durational Features 62 3.2.2 Pitch-related Features 64 3.2.3 Interim Summary 67 3.3 Textual features 69 3.4 Experimental Design 70 3.4.1 Statistical Analysis on Acoustic-prosodic Features 71 3.4.2 Classification Model Training 73 3.4.3 Classification Model Evaluation 77 3.5 Interim Summary 78 Chapter 4 Results 80 4.1 Prosodic Features Characterizing Distinct Speech Genres 80 4.1.1 Descriptive Statistics 81 4.1.2 Binary Logistic Models 83 4.2 Speech Genre Classification Using Features of Different Modes 89 4.2.1 Prosody-mode Models 90 4.2.2 Text-mode Models 91 4.2.3 Multimodal Model 94 4.2.4 Comparisons of Model Performance 96 4.3 Interim Summary 99 Chapter 5 Discussion 100 5.1 Speech Genres on YouTube from a Prosodic Perspective 101 5.1.1 Durational Patterns of Speech Genres on YouTube 103 5.1.2 Pitch Patterns of Speech Genres on YouTube 106 5.2 Prosodic Features in Multimodal Speech Genre Classification 110 Chapter 6 Conclusion 119 6.1 Summary 119 6.2 Limitations and Future Research 121 References 125 Appendices 142

    Andruski, J. E., & Costello, J. (2004). Using polynomial equations to model pitch contour shape in lexical tones: An example from Green Mong. Journal of the International Phonetic Association, 34(2), 125-140.
    Ang, J., Dhillon, R., Krupski, A., Shriberg, E., & Stolcke, A. (2002). Prosody-based automatic detection of annoyance and frustration in human-computer dialog. Proceedings of the 7th International Conference on Spoken Language Processing, 2037-2040.
    Astruc, L. (2013). Prosody. In M. J. Jones & R.-A. Knight (Eds.), The Bloomsbury companion to phonetics (pp. 126-139). Bloomsbury Academic.
    Auer, P., Couper-Kuhlen, E., & Muller, F. (1999). Language in time: The rhythm and tempo of spoken interaction. Oxford University Press.
    Baayen, R. H. (2008). Analyzing linguistic data. A practical introduction to statistics using R. Cambridge University press.
    Bakhtin, M. (2014). The problem of speech genres. In D. Duff (Ed.), Modern genre theory (pp. 82-97). Routledge.
    Baron, D., Shriberg, E., & Stolcke, A. (2002). Automatic punctuation and disfluency detection in multi-party meetings using prosodic and lexical cues. Proceedings of the 7th International Conference on Spoken Language Processing, 942-952.
    Barth-Weingarten, D., Dehé, N., & Wichmann, A. (2009). Where prosody meets pragmatics. Emerald Group Publishing.
    Belião, J. (2014). Characterizing speech genres through the relation between prosody and macrosyntax. In M. Colinet , S. Katrenko, & R. K. Rendsvig (Eds.), Pristine perspectives on logic, language, and computation (pp. 1-18). Springer.
    Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137-1155.
    Berez, A. L. (2011). Prosody as a genre-distinguishing feature in Ahtna: A quantitative approach. Functions of Language, 18(2), 210-236.
    Berger, S., Niebuhr, O., & Zellers, M. (2019). A preliminary study of charismatic speech on YouTube: Correlating prosodic variation with counts of subscribers, views and likes. Proceedings of the 20th Annual Conference of the International Speech Communication Association, 1761-1765.
    Biadsy, F., Hirschberg, J., Rosenberg, A., & Dakka, W. (2007). Comparing American and Palestinian perceptions of charisma using acoustic-prosodic and lexical analysis. Proceedings of the 8th Annual Conference of the International Speech Communication Association, 2221-2224.
    Biadsy, F., Rosenberg, A., Carlson, R., Hirschberg, J., & Strangert, E. (2008). A cross-cultural comparison of American, Palestinian, and Swedish perception of charismatic speech. Proceedings of the 4th International Conference on Speech Prosody, 579-582.
    Biber, D. (1986). Spoken and written textual dimensions in English: Resolving the contradictory findings. Language, 62(2), 384-414.
    Biber, D. (1988). Variation across speech and writing. Cambridge University Press.
    Biber, D. (1989). A typology of English texts. Linguistics, 27, 3-43.
    Biber, D. (1992a). The multi-dimensional approach to linguistic analyses of genre variation: An overview of methodology and findings. Computers and the Humanities, 26(5), 331-345.
    Biber, D. (1992b). On the complexity of discourse complexity: A multidimensional analysis. Discourse Processes, 15, 133-163.
    Biber, D., Conrad, S., & Cortes, V. (2004). If you look at…: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371-405.
    Biber, D., & Finegan, E. (1986). An initial typology of English text types. In J. Aarts & W. Meijs (Eds.), Corpus linguistics II (pp. 19-46). Brill.
    Boersma, P., & Weenink, D. (2022). Praat: Doing phonetics by computer (Version 6.2.09). Retrieved from: http://www.praat.org/
    Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Proceedings of the 5th Annual Workshop on Computational Learning Theory, 144-152.
    Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32.
    Brown, B. L., Strong, W. J., & Rencher, A. C. (1973). Perceptions of personality from speech: Effects of manipulations of acoustical parameters. The Journal of the Acoustical Society of America, 54, 29-35.
    Brown, G. (1977). Listening to spoken English. Longman.
    Brown, G., Currie, K., & Kenworthy, J. (1980). Questions of intonation. Croom Helm.
    Brown, P., & Fraser, C. (1979). Speech as a marker of situation. In H. Giles & K. R. Scherer (Eds.), Social markers in speech (pp. 33-62). Cambridge University Press.
    Brubaker, R. S. (1972). Rate and pause characteristics of oral reading. Journal of Psycholinguistic Research, 1(2), 141-147.
    Bruce, G., & Touati, P. (1992). On the analysis of prosody in spontaneous speech with exemplification from Swedish and French. Speech Communication, 11, 453-458.
    Butterworth, B. (1975). Hesitation and semantic planning in speech. Journal of Psycholinguistic Research, 4(1), 75-87.
    Byers, P. P. (1979). A formula for poetic intonation. Poetics, 8(4), 367-380.
    Campione, E., & Véronis, J. (2002). A large-scale multilingual study of silent pause duration. Proceedings of the 1st International Conference on Speech Prosody, 199-202.
    Castro, L., Serridge, B., Moraes, J., & Freitas, M. (2010). The prosody of the TV news speaking style in Brazilian Portuguese. Proceedings of the 3rd ITRW on Experimental Linguistics, 17-20.
    Chandler, D. (2002). Semiotics: The basics. Routledge.
    Chen, A. C.-H. (2023). F0-based pairwise variability index:A prosodic metric for holistic language processing. Proceedings of the 20th International Congress of Phonetic Sciences, 1381-1385.
    Chen, A. C.-H., & Tseng, S.-C. (2019). Prosodic encoding in Mandarin spontaneous speech: Evidence for clause-based advanced planning in language production. Journal of Phonetics, 76, 1-22.
    Chen, C., & Li, Q. (2020). A multimodal music emotion classification method based on multifeature combined network classifier. Mathematical Problems in Engineering, 2020, 1-11.
    Christodoulides, G. (2020). Speaking style prosodic variation and the prosody-syntax interface a large-scale corpus study. Proceedings of the 10th International Conference on Speech Prosody, 705-709.
    Cole, J. (2015). Prosody in context: A review. Language, Cognition and Neuroscience, 30(1-2), 1-31.
    Cole, J., Mo, Y., & Baek, S. (2010). The role of syntactic structure in guiding prosody perception with ordinary listeners and everyday speech. Language and Cognitive Processes, 25, 1141-1177.
    Cole, J., Mo, Y., & Hasegawa-Johnson, M. (2010). Signal-based and expectation-based factors in the perception of prosodic prominence. Laboratory Phonology, 1(2), 425-452.
    Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20, 273-297.
    Cotter, C. (1993). Prosodic aspects of broadcast news register. Proceedings of the 19th Annual Meeting of the Berkeley Linguistics Society, 90-100.
    Couper-Kuhlen, E. (1986). An introduction to English prosody. Arnold.
    Crystal, D. (1975). The English tone of voice: Essays in Intonation, prosody and paralanguage. Arnold.
    Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407.
    den Ouden, H., Noordman, L., & Terken, J. (2009). Prosodic realizations of global and local structure and rhetorical relations in read aloud news reports. Speech Communication, 51, 116-129.
    Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Retrieved from: https://arxiv.org/abs/1810.04805
    Dewdney, N., Van Ess-Dykema, C., & MacMillan, R. (2001). The form is the substance: Classification of genres in text. Proceedings of the Workshop on Human Language Technology and Knowledge Management, 1-8.
    Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61-74.
    Fant, G., Kruckenberg, A., & Ferreira, J. B. (2003). Individual variations in pausing. A study of read speech. Proceedings of Fonetik 2003, 193-196.
    Ferrer, L., Bratt, H., Gadde, V. R. R., Kajarekar, S. S., Shriberg, E., Sonmez, K., Stolcke, A., & Venkataraman, A. (2003). Modeling duration patterns for speaker recognition. Proceedings of the 8th European Conference on Speech Communication and Technology, 2017-2020.
    Ficano, N. (2022). Pytube Documentation (Version 12.0.0). Retrieved from: https://pypi.org/project/pytube/12.0.0/
    Finn, A., & Kushmerick, N. (2006). Learning to classify documents according to genre. Journal of the American Society for Information Science and Technology, 57(11), 1506-1518.
    Fon, J., Johnson, K., & Chen, S. (2011). Durational patterning at syntactic and discourse boundaries in Mandarin spontaneous speech. Language and Speech, 54(1), 5-32.
    Fónagy, I., & Bérard, E. (2006). Functions of intonation. In Y. Kawaguchi, I. Fónagy, & T. Moriguchi (Eds.), Prosody and syntax: Cross-linguistic perspectives (pp. 19-46). John Benjamins Publishing Company.
    Fox, J., & Weisberg, S. (2018). An R companion to applied regression. Sage.
    Gareth, J., Daniela, W., Trevor, H., & Robert, T. (2013). An introduction to statistical learning: with applications in R. Spinger.
    Geyser, W. (2023, March 13). 12 best types of YouTube content to succeed at growing a YouTube channel. Influencer Marketing Hub. Retrieved from: https://influencermarketinghub.com/types-of-youtube-content/
    Goldman, J.-P., Prsir, T., Christodoulides, G., & Auchlin, A. (2014). Speaking style prosodic variation: An 8-hour 9-style corpus study. Proceedings of the 7th International Conference on Speech Prosody, 105-109.
    Grabe, E., Kochanski, G., & Coleman, J. (2007). Connecting intonation labels to mathematical descriptions of fundamental frequency. Language and Speech, 50, 281-310.
    Grabe, E., & Low, E. L. (2002). Durational variability in speech and the rhythm class hypothesis. In C. Gussenhoven & N. Warner (Eds.), Papers in laboratory phonology (Vol. 7, pp. 515-546). Mouton de Gruyter.
    Grosz, B., & Hirschberg, J. (1992). Some intonational characteristics of discourse structure. Proceedings of the 2nd International Conference on Spoken Language Processing, 429-432.
    Halteren, H. v. (2004). Linguistic profiling for authorship recognition and verification. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, 199-206.
    Hanauer, D. (1998). The genre-specific hypothesis of reading: Reading poetry and encyclopedic items. Poetics, 26(2), 63-80.
    Herman, R. (2000). Phonetic markers of global discourse structures in English. Journal of Phonetics, 28, 466-493.
    Hirschberg, J., & Nakatani, C. (1998). Acoustic indicators of topic segmentation. Paper presented at the 5th International Conference on Spoken Language Processing, Sydney, Australia.
    Hirst, D. (2007). A Praat plugin for Momel and INTSINT with improved algorithms for modelling and coding intonation. Proceedings of the 16th International Congress of Phonetic Sciences, 1233-1236
    Hirst, D. J. (2005). Form and function in the representation of speech prosody. Speech Communication, 46, 334-347.
    Holmes, R. (1997). Genre analysis, and the social sciences: An investigation of the structure of research article discussion sections in three disciplines. English for Specific Purposes, 16(4), 321-337.
    Hübscher, I., Borràs-Comes, J., & Prieto, P. (2017). Prosodic mitigation characterizes Catalan formal speech: The Frequency Code reassessed. Journal of Phonetics, 65, 145-159.
    Hyland, K. (2008). As can be seen: Lexical bundles and disciplinary variation. English for Specific Purposes, 27, 4-21.
    Jacewicz, E., Fox, R. A., O’Neill, C., & Salmons, J. (2009). Articulation rate across dialect, age, and gender. Language Variation and Change, 21(2), 233-256.
    Kajarekar, S., Ferrer, L., Venkataraman, A., Sonmez, K., Shriberg, E., Stolcke, A., Bratt, H., & Gadde, R. R. (2003). Speaker recognition using prosodic and lexical features. Proceedings of the 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, 19-24.
    Karlgren, J., & Cutting, D. (1994). Recognizing text genres with simple metrics using discriminant analysis. Proceedings of the 15th Conference on Computational Linguistics, 1071-1075.
    Kawaguchi, Y., Fónagy, I., & Moriguchi, T. (2006). Prosody and syntax: Cross-linguistic perspectives. John Benjamins Publishing Company.
    Kessler, B., Nunberg, G., & Schütze, H. (1997). Automatic detection of text genre. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, 32–38.
    Kotz, S. A., Ravignani, A., & Fitch, W. T. (2018). The evolution of rhythm processing. Trends in Cognitive Sciences, 22(10), 896-910.
    Kress, G. (2009). Multimodality: A social semiotic approach to contemporary communication. Routledge.
    Kress, G., & Van Leeuwen, T. (2002). Colour as a semiotic mode: Notes for a grammar of colour. Visual Communication, 1(3), 343-368.
    Kuo, C.-H. (1999). The use of personal pronouns: Role relationships in scientific journal articles. English for Specific Purposes, 18(2), 121-138.
    Kyle, K., Crossley, S. A., & Kim, Y. J. (2015). Native language identification and writing proficiency. International Journal of Learner Corpus Research, 1(2), 187-209.
    Kyncl, R., & Peyvan, M. (2017). Streampunks: How YouTube and the new creators are transforming our lives. Virgin Books.
    Labov, W. (1972). Language in the inner city: Studies in the Black English vernacular. University of Pennsylvania Press.
    Ladd, D. R., & Johnson, C. (1987). ‘Metrical’ factors in the scaling of sentence-initial accent peaks. Phonetica, 44, 238-245.
    Lee, Y.-B., & Myaeng, S. H. (2002). Text genre classification with genre-revealing and subject-revealing features. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 145-150.
    Lehiste, I. (1979). Perception of sentence and paragraph boundaries. In B. Lindblom & S. Oehman (Eds.), Frontiers of speech communication research (pp. 191-201). Academic Press.
    Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., & Chen, Z. (2020). Gshard: Scaling giant models with conditional computation and automatic sharding. Retrieved from: https://arxiv.org/abs/2006.16668
    Lidy, T., & Rauber, A. (2005). Evaluation of feature extractors and psycho-acoustic transformations for music genre classification. Proceedings of the 6th International Conference on Music Information Retrieval, 34-41.
    Lin, P., & Chen, Y. (2020). Multimodality I: Speech, prosody and gestures. In S. Adolphs & D. Knight (Eds.), The Routledge handbook of English language and digital humanities (pp. 66-84). Routledge.
    Ling, L. E., Grabe, E., & Nolan, F. (2000). Quantitative characterizations of speech rhythm: Syllable-timing in singapore english. Language and Speech, 43(4), 377-401.
    Liu, Y., Shriberg, E., & Stolcke, A. (2003). Automatic disfluency identification in conversational speech using multiple knowledge sources. Proceedings of the 8th European Conference on Speech Communication and Technology, 957-960.
    Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., & Harper, M. (2006). Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1526-1540.
    Malmasi, S., Evanini, K., Cahill, A., Tetreault, J., Pugh, R., Hamill, C., Napolitano, D., & Qian, Y. (2017). A report on the 2017 native language identification shared task. Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, 62-75.
    Mann, W. C., & Thompson, S. A. (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text-interdisciplinary Journal for the Study of Discourse, 8(3), 243-281.
    Mayer, R., Neumayer, R., & Rauber, A. (2008). Rhyme and style features for musical genre classification by song lyrics. Proceedings of the 9th International Conference on Music Information Retrieval, 337-342.
    Menn, L., & Boyce, S. (1982). Fundamental frequency and discourse structure. Language and Speech, 25(4), 341-383.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Proceeding of the 26th International Conference on Neural Information Processing Systems, 3111-3119.
    Miller, C. R. (1984). Genre as social action. Quarterly Journal of Speech, 70, 151-167.
    Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2021). Deep learning-based text classification: A comprehensive review. ACM Computing Surveys, 54(3), 1-40.
    Moises Systems, I. (2022). Moises: The Musician's App (Version 1.9.2). Retrieved from: https://moises.ai/en/
    Nakajima, S. y., & Allen, J. F. (1993). A study on prosody and discourse structure in cooperative dialogues. Phonetica, 50(3), 197-210.
    Neumayer, R., & Rauber, A. (2007). Integration of text and audio features for genre classification in music information retrieval. Proceedings of the 29th European Conference on IR Research, 724-727
    Niebuhr, O., Skarnitzl, R., & Tylečková, L. (2018). The acoustic fingerprint of a charismatic voice—Initial evidence from correlations between long-term spectral features and listener ratings. Proceedings of the 9th International Conference on Speech Prosody, 359-363.
    Niebuhr, O., Voße, J., & Brem, A. (2016). What makes a charismatic speaker? A computer-based acoustic-prosodic analysis of Steve Jobs tone of voice. Computers in Human Behavior, 64, 366-382.
    Noordman, L., Dassen, I., Swerts, M., & Terken, J. (1999). Prosodic markers of text structure. In K. van Hoek, A. A. Kibrik, & L. Noordman (Eds.), Discourse studies in cognitive linguistics (pp. 133-148). John Benjamins Publishing Company.
    Norris, S. (2004). Analyzing multimodal interaction: A methodological framework. Routledge.
    Obin, N., Dellwo, V., Lacheret, A., & Rodet, X. (2010). Expectations for discourse genre identification: A prosodic study. Proceedings of the 11th Annual Conference of the International Speech Communication Association, 3070-3073.
    Obin, N., Lacheret, A., Veaux, C., Rodet, X., & Simon, A.-C. (2008). A method for automatic and dynamic estimation of discourse genre. Typology with prosodic features. Proceedings of the 9th Annual Conference of the International Speech Communication Association, 1204-1207.
    Ofuka, E., McKeown, J. D., Waterman, M. G., & Roach, P. J. (2000). Prosodic cues for rated politeness in Japanese speech. Speech Communication, 32(3), 199-217.
    Onan, A. (2016). Classifier and feature set ensembles for web page classification. Journal of Information Science, 42(2), 150-165.
    Onan, A. (2018). An ensemble scheme based on language function analysis and feature engineering for text genre classification. Journal of Information Science, 44(1), 28-47.
    Ortega, L. (2003). Syntactic complexity measures and their relationship to L2 proficiency: A research synthesis of college‐level L2 writing. Applied Linguistics, 24(4), 492-518.
    Pawelec, Ł., Lipowicz, A., Czak, M., & Mitas, A. W. (2022). The microphone type and voice acoustic parameters values—A comparative study. Proceedings of the 2022 International Conference on Information Technologies in Biomedicine, 421-431.
    Page, R. (2009). New perspectives on narrative and multimodality. Routledge.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
    Perrachione, T. K., Fedorenko, E. G., Vinke, L., Gibson, E., & Dilley, L. C. (2013). Evidence for shared cognitive processing of pitch in music and language. PLoS One, 8, e73372.
    Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2227-2237.
    Petrenz, P., & Webber, B. (2011). Stable classification of text genres. Computational Linguistics, 37(2), 385-393.
    R Core Team. (2022). R: A language and environment for statistical computing (Version 4.2.2).
    Ramos, J. (2003). Using TF-IDF to determine word relevance in document queries. Proceedings of the 1st Instructional Conference on Machine Learning, 29-48.
    Ries, K. (1999). HMM and neural network based speech act detection. Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, 497-500.
    Rodero, E. (2020). Do your ads talk too fast to your audio audience? How speech rates of audio commercials influence cognitive and physiological outcomes. Journal of Advertising Research, 60(3), 337-349.
    Rodero, E., & Cores-Sarría, L. (2023). Best prosody for news: A psychophysiological study comparing a broadcast to a narrative speaking style. Communication Research, 50(3), 361-384.
    Rosenberg, A., & Hirschberg, J. (2005). Acoustic/prosodic and lexical correlates of charismatic speech. Proceedings of the 9th European Conference on Speech Communication and Technology, 513-516.
    Rosenberg, A., & Hirschberg, J. (2009). Charisma perception from text and speech. Speech Communication, 51(7), 640-655.
    Russo, M., & Barry, W. J. (2008). Isochrony reconsidered. Objectifying relations between rhythm measures and speech tempo. Proceedings of the 4th International Conference on Speech Prosody, 419-422.
    Schegloff, E. A. (1982). Discourse as an interactional achievement: Some uses of ‘uh huh’and other things that come between sentences. In D. Tannen (Ed.), Analyzing discourse: Text and talk (pp. 71-93). Georgetown University Press.
    Sherr-Ziarko, E. (2019). Prosodic properties of formality in conversational Japanese. Journal of the International Phonetic Association, 49(3), 331-352.
    Shih, C. (2000). A declination model of Mandarin Chinese. In A. Botinis (Ed.), Intonation: Analysis, modelling and technology (Vol. 15, pp. 243-268). Springer Science & Business Media.
    Shriberg, E., Stolcke, A., Hakkani-Tür, D., & Tür, G. (2000). Prosody-based automatic segmentation of speech into sentences and topics. Speech Communication, 32, 127-154.
    Shriberg, E., Stolcke, A., Jurafsky, D., Coccaro, N., Meteer, M., Bates, R., Taylor, P., Ries, K., Martin, R., & Van Ess-Dykema, C. (1998). Can prosody aid the automatic classification of dialog acts in conversational speech? Language and Speech, 41, 443-492.
    Smith, C. L. (2004). Topic transitions and durational prosody in reading aloud: Production and modeling. Speech Communication, 42, 247-270.
    Sridhar, V. K. R., Bangalore, S., & Narayanan, S. (2009). Combining lexical, syntactic and prosodic cues for improved online dialog act tagging. Computer Speech and Language, 23, 407-422.
    Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2000). Automatic text categorization in terms of genre and author. Computational Linguistics, 26(4), 471-495.
    Stanford, J. N. (2016). Sociotonetics using connected speech: A study of Sui tone variation in free-speech style. Asia-Pacific Language Variation, 2(1), 48-81.
    Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, D., Taylor, P., Martin, R., Ess-Dykema, C. V., & Meteer, M. (2000). Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational Linguistics, 26(3), 339-373.
    Ströbel, M., Kerz, E., Wiechmann, D., & Neumann, S. (2016). Cocogen-complexity contour generator: Automatic assessment of linguistic complexity using a sliding-window technique. Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity, 23-31.
    Ströbel, M., Kerz, E., Wiechmann, D., & Qiao, Y. (2018). Text genre classification based on linguistic complexity contours using a recurrent neural network. Proceedings of the 10th International Workshop Modelling and Reasoning in Context, 56-63.
    Swales, J. M. (1981). Aspects of article introductions. LSU, Aston University.
    Swales, J. M. (1990). Genre analysis: English in academic and research settings. Cambridge University Press.
    Swerts, M. (1997). Prosodic features at discourse boundaries of different strength. The Journal of the Acoustical Society of America, 101(1), 514-521.
    Swerts, M., & Geluykens, R. (1994). Prosody as a marker of information flow in spoken discourse. Language and Speech, 37(1), 21-43.
    Tardy, C. M., & Swales, J. M. (2014). Genre analysis. In K. P. Schneider & A. Barron (Eds.), Pragmatics of discourse (pp. 165-187). De Gruyter Mouton.
    Tench, P. (1996). The intonation systems of English. Cassell.
    Tseng, C.-y., Lee, L.-S., & Su, Z.-y. (2008). Spontaneous Mandarin speech prosody—the NTU DSP lecture corpus. Proceeding of the Oriental COCOSDA 2008, 171-174.
    Tseng, C.-y., Pin, S.-h., Lee, Y., Wang, H.-m., & Chen, Y.-c. (2005). Fluent speech prosody: Framework and modeling. Speech Communication, 46, 284-309.
    Tseng, C.-y., Su, C.-y., & Huang, C.-F. (2011). Prosodic highlights in Mandarin continuous speech—Cross-genre attributes and implications. Proceeding of the 12th Annual Conference of the International Speech Communication Association, 1381-1384.
    Tseng, C.-y., Su, Z.-y., & Lee, L.-s. (2009). Mandarin spontaneous narrative planning—prosodic evidence from national Taiwan university lecture corpus. Proceedings of the 10th Annual Conference of the International Speech Communication Association, 2943-2946.
    Tseng, C.-y., Su, Z.-y., & Lee, L.-s. (2010). Prosodic patterns of information structure in spoken discourse—a preliminary study of Mandarin spontaneous lecture vs. read speech. Proceedings of the 5th International Conference on Speech Prosody, paper 446.
    Tseng, S.-C. (2008). Spoken corpora and analysis of natural speech. Taiwan Journal of Linguistics, 6(2), 1-26.
    Tseng, S.-C. (2019). ILAS Chinese spoken language resources. Proceedings of the 3rd International Symposium on Linguistic Patterns in Spontaneous Speech, 13-20.
    Van Leeuwen, T. (2005). Introducing social semiotics. Routledge.
    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Proceedings of the 2017 NeurIPS, 6000-6010.
    Veenendaal, N. J., Groen, M. A., & Verhoeven, L. (2014). The role of speech prosody and text reading prosody in children's reading comprehension. British Journal of Educational Psychology, 84(4), 521-536.
    Veiga, A., Celorico, D., Proença, J., Candeias, S., & Perdigão, F. (2012). Prosodic and phonetic features for speaking styles classification and detection. Proceedings of the IberSPEECH 2012 Conference, 89-98.
    Venkataraman, A., Ferrer, L., Stolcke, A., & Shriberg, E. (2003). Training a prosody-based dialog act tagger from unlabeled data. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 272-275.
    Wachsmuth, H., & Bujna, K. (2011). Back to the roots of genres: Text classification by language function. Proceedings of the 5th International Joint Conference on Natural Language Processing, 632-640.
    Wang, B., & Xu, Y. (2011). Differential prosodic encoding of topic and focus in sentence-initial position in Mandarin Chinese. Journal of Phonetics, 39(4), 595-611.
    Wang, G., Sun, J., Ma, J., Xu, K., & Gu, J. (2014). Sentiment classification: The contribution of ensemble learning. Decision Support Systems, 57, 77-93.
    Weber, M. (1947). The theory of social and economic organization. The Free Press.
    Wennerstrom, A. (2001). The music of everyday speech: Prosody and discourse analysis. Oxford University Press.
    Winter, B., & Grawunder, S. (2012). The phonetic profile of Korean formal and informal speech registers. Journal of Phonetics, 40(6), 808-815.
    Wolfson, N. (1982). CHP: The conversational historical present in American English narrative. De Gruyter Mouton.
    Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241-259.
    Wolters, M., & Kirsten, M. (1999). Exploring the use of linguistic features in domain and genre classification. Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, 142-149
    Xu, J., Ding, Y., Wang, X., & Wu, Y. (2008). Genre identification of chinese finance text using machine learning method. Proceedings of the 2008 IEEE International Conference on Systems, Man and Cybernetics, 455-459.
    Xu, Y. (2011). Speech prosody: A methodological review. Journal of Speech Sciences, 1(1), 85-115.
    Xu, Y., Cao, S., Ji, J., Xiao, Q., Wu, A., & Wang, X. (2020). Differentiated prosodic adaption of Chinese and English poetry: An acoustic approach to reading of Chinese Tang poetry and Shakespearean sonnets. Proceedings of the 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 211-215.
    Xu, Z., Liu, L., Song, W., & Du, C. (2017). Text genre classification research. Proceedings of the 2017 International Conference on Computer, Information and Telecommunication Systems, 175-178.
    Yang, M., & Ma, W.-Y. (2023). CKIP Transformers (Version 0.3.4). Retrieved from: https://github.com/ckiplab/ckip-transformers
    Yang, Z., Huynh, J., Tabata, R., Cestero, N., Aharoni, T., & Hirschberg, J. (2020). What makes a speaker charismatic? Producing and perceiving charismatic speech. Proceeding of the 10th International Conference on Speech Prosody, 685-689.
    Yule, G. (1980). Speakers' topics and major paratones. Lingua, 52, 33-47.
    Zhang, J. (2018). A comparison of tone normalization methods for language variation research. Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation, 823-831.
    Zillmann, D., & Vorderer, P. (2000). Media entertainment: The psychology of its appeal. Routledge.
    Zu Eissen, S. M., & Stein, B. (2004). Genre classification of web pages. Proceedings of the 27th Annual German Conference in AI, 256-269.

    下載圖示
    QR CODE