Abstract as per original application  (English/Chinese): | 
	
		 
		
				Sound is arguably the primary (and often only) medium by which spoken language is conveyed. This allows communication to proceed when the speaker is obscured, whether over the phone, in the dark, at a distance, or when wearing a face mask. At the same time, vision and other types of non-auditory perception are also important. Spoken language is often accompanied by facial expressions and manual gestures, and the ability to see a speaker’s face and mouth is known to influence how speech sounds are recognized. Sign languages, moreover, are transmitted through vision alone. This demonstrates that neither sight nor sound are strictly required for language, nor is one modality linguistically superior. Rather, human communication is inherently multimodal: Speakers and listeners maximally exploit auditory and visual information to perceive and produce language, enhancing its robustness. Determining how these cues trade off and/or reinforce one another is essential for understanding how language is optimized for efficient communication, how speech sounds are organized in the mind, and how phonological systems change over time. 
Prior audiovisual speech research has focused mostly on the listener, testing how non-auditory information influences how speech sounds are identified, and often relies on incongruous perceptual illusions that cannot occur in actual speech. The novel approach taken here is to examine how speakers actively modify their speech to enhance its visibility and how the resulting array of (real) visual speech cues affect listener perception. Three experiments explore the production and perception of English sounds that are optionally produced with visible tongue or lip gestures, including extreme variants of /l/ that have not been systematically studied. In a production experiment, speakers interact with a virtual speaking partner in clear, audio-degraded, and video-degraded conditions, testing whether hyperarticulated speech arises specifically to benefit the listener or owes instead to greater overall speaking effort. Simultaneous acoustic, ultrasound tongue imaging, and 3D motion capture data will be collected to understand how the auditory-acoustic and visual-articulatory characteristics of these sounds are altered by audiovisual enhancement. Participants in two types of perception experiment will then identify sounds produced with varying types of visible and non-visible gestures, in clear and noisy conditions. In one experiment, eye-tracking will reveal whether listeners anticipate visible gestures by attending more closely to mouth movements for potentially confusable sounds. Together, these experiments will inform theories of clear speech and hyperarticulation, sound change and the maintenance of phonological contrast, and adaptive communication. 
		
					聲音可謂口語傳遞主要的且往往是唯一的媒介。 這使得即使在說話者被掩蔽的情況下,無論是在電話中、黑暗中、遠處還是佩戴口罩時,交流也能够继续。與此同時,視覺及其他非聽覺的感知也同樣重要。口語經常伴隨著表情和手勢,且研究表明,聽話者是否能看到說話者的面部和口部將影響語音的識別。此外,手語则完全通過視覺進行傳播。這表明,視覺和聲音並非語言的必要條件,就語言本質而言,並没有哪一个模態更加優越。相反,人類的交流本質上是多模態的:說話者和聽話者會最大程度地利用聽覺和視覺的信息來感知和產出語言,以增強其穩健性。釐清這些線索如何互相制衡與彼此強化,對理解以下三個問題至關重要:1)語言如何優化以實現高效交流?2)語音如何在大腦中進行組織?3)音位系統如何隨時間變化?
此前的視聽言語研究大多聚焦於聽話者,主要探究非聽覺信息如何影響語音識別,且經常依賴於現實言語中不可能出現的非協調感知錯覺。本項目的創新之處在於,分析說話者如何主動調整發音來提高視覺辨識度,並探究由此產生的一系列真實的視覺言語線索如何影響聽話者的感知。三項實驗將探究特定英語音素的產出和感知,這些音素可以選擇性地通過可見的舌頭和嘴唇姿勢來發出,其中包括了未被系統研究過的/l/音極端變體。在產出實驗中,說話者將與虛擬的交流對象,在清晰、音頻降質和視頻降質的三種條件下互動,以此探究發音增強言語的產生,是專為裨益聽話者,還是源於自身整體說話努力程度的提高。實驗將同步收集聲學、超聲舌成像和3D動作捕捉數據,以理解這些聲音的聽覺-聲學和視覺-發音特性將如何在視聽增強的作用下改變。隨後,兩項感知實驗的被試將在清晰及嘈雜環境中,識別由不同類型的可見及不可見發音姿勢產出的聲音。其中一個實驗將使用眼動追蹤,以揭示對於可能混淆的聲音,聽話者是否會通過更密切地關注嘴部運動來預判可見的發音姿勢。這些實驗將共同為清晰言語、發音增強、音變、音位對立保持和自適應交流作出理論貢獻。
							
		 |