Cîâðåìåííûå èíôîðìàöèîííûå òåõíîëîãèè/ Âû÷èñëèòåëüíàÿ òåõíèêà è ïðîãðàììèðîâà­íèå

Kameshova S.S., master of natural sciences.

 Kostanaysky state university of A. Baytursynov, Kostanay, Kazakhstan.

THE AUDIOVISUAL SYNTHESIZERS OF THE SPEECH

The visual information of speech, using in supplement to ultrasonic information, very the important for the better perception and understanding utter announcer speech. Obviously, what looking in face to interlocutor, to us more easy to understand his speech. The signals from visual and acoustic canals duplicate and supplement friend of friend, what helps correctly to perceive speech in many complicated situations, for example, by influence dynamic ultrasonic noises or when simultaneous speak several man. Also known, what the hard of hearing and the elderly people, and also the not carriers tongue in large degree base on visual information, expressing articulation lips and facial organs, than on sound information. In motion complex process understandings speech organs ear (ears) man perceive notes, while the organs of sight (eye) see locomotion’s lip and facial organs speaking and all this information unites in to brain man in united show sense opinions. Besides it, the emotion and intonation in speech can be transferred change frequency main tone speech, heaving eyebrows, movement or nod head, gesture or and their combination. Except that, existing in world tendency to dynamical development of speech technologies for different tongues specifies on relevance of the organic inclusions visual information in quality additional canal by perception speech, synthesizing computer. The audiovisual (bimodal) speech synthesizers of the speech subdivide on two main types on type of performance:

1) The synthesis of audiovisual speech on input to text on definite tongue, on entrance systems moves printed text (written speech) on definite tongue, for example with the keyboard, which then it was analyzed system and it were be transformed in synthesizing audio - and video information.

2) The synthesis video modalities on to audio signal (sometimes also be called online animation speech), on entrance user moves oral speech on any tongue, for example with the microphone, the ultrasonic parameters which simultaneous it was analyzed system and it were be transformed (with small delay) in video information.

 By realization investigations with potential users, which consisted from two main interdependent parts [1][2]: 1. The analysis and the appraisal of naturalness synthesizing speech. 2. The analysis and the appraisal of scrupulousness speech (as unimodular, and multimodal) in conditions different ultrasonic noises. For the realization experiment sounding speech there were picked 20 phonetic-balanced phrases, which afterwards were shown volunteer-participants of the experiment to informants) in arbitrary order. Each phrase consisted from 4-6 together-said words, which all well known, but by this not will form semantic linkages between itself, all phrase wholly be senseless or have only partial sense. This former made in order that during testate informants managed without aprioristic semantic knowledge’s, and oriented only on organs of senses the: ear and sight. On first stage testing’s each from informants asked to miss synthesizing speech phrase, after that they due there were to enter recognized them on ear succession words. The then subjects due there were to apprehend this and phrase, but told the computer system. On this and stage informants due there were also to value naturalness of synthesis and the quality of clocking audiovisual signals on 5 - the mark to scale (MOS – mean opinion score) for four methods clocking (or the simulation asynchrony) audiovisual speech modalities:

1.     With completely synchronous flows phoneme and vizem (base methods synchronous).

2.     With suggested the by method simulation the asynchrony audiovisual speech (methods simulation asynchrony).

3.     Methods with stationary value delay sound signal relatively video signal on 150 ms (methods Â150À).

4.     Methods with delay visual signal relatively audio signal (methods À150Â).

Further informants had to test audiovisual synthesis of the speech with four methods of synchronization and estimate quality and naturalness (similitude to interhuman communication) synchronization of audiovisual signals (are synchronous or aren't synchronous) the synthesized speech, using a 5-mark scale (Mean Opinion Score – MOS; the highest mark "5" means that modalities are perfectly synchronized). Informants also had to write down sequence of words which they distinguished. And at the last stage of testing to testers suggested to listen to the same written-down phrase said by a real voice. Such cycle with various phrases was repeated 20 times for each examinee. It should be noted also that additive noise (white noise, or noise of crowd when many people) with the changing intensity at the same time speak was added to a pure acoustic signal (the relation signal/noise varied from 5 to 25db). In total 10 volunteers from 16 to 35 years with normal hearing and sight were involved in experiments, prior to the beginning of tests some time was provided to informants for adaptation to the synthesized voice. Test session for each person proceeded on average 30 minutes. 800 user estimates on naturalness of synthesis and 600 on legibility of the speech were in total received. Picture 1 shows the distributions of the user estimates of four methods of synchronization (on a 5-mark scale) average according to all test phrases for each of 10 testers. It is necessary to notice that informants asked to put down different marks only if they notice a difference between them. Some testers didn't use an assessment "5" or "2" at all. It was found out that all subjects identified mistiming of the audio-and visual speech for the A150B method; two persons from 10 didn't feel a difference in synchronization for a basic method, the offered asynchronous method and B150A; 2 other informants didn't distinguish B150A and an asynchronous method. Other people spoke that they define a difference in all 4 methods of synchronization.

Picture 1 — Distributions of average user estimates of naturalness of the speech for 4 methods of synchronization of modalities of system

It appeared very unexpectedly that users put down low marks to a basic method of synchronization which took only the 3rd place in this testing, the majority of informants preferred the offered method of modeling of asynchrony and one user - B150A. And, for all informants the B150 A method was much more acceptable, than A150B. It confirms a hypothesis that users are tolerant to natural leading of video signal, but at once notice leading of an audio signal. Picture 2 shows the distributions of the user estimates of 4 methods of synchronization average on all announcers depending on the relation signal/noise of an audio signal. It is easy to notice that with reduction of the relation signal/noise the user estimates also decrease, it is fair for all methods except A150B which already got very low points. When users understand the synthesized speech worse, they have the general tendency to give lower marks. Also at low ratios signal/noise of distance between elements of the provided histogram are reduced that means: users feel a difference between synchronization methods in noisy signals worse. Possibly, it is caused by that in the conditions of noise it is heavy to identify the moments of the beginning and end of the phrase. Advantage of the offered method of modeling of asynchrony of modalities of the speech in the conditions of a pure not noisy signal obviously.

Picture 2 — Average user estimates of naturalness for 4 methods of synchronization audio-and video modalities of the speech depending on the relation audio signal/noise

Now the audiovisual speech interfaces constructed with use of systems of automatic recognition and synthesis of the speech are actual in many in practical applications as increase legibility and naturalness of the synthesized speech, and also make them available for people with limited touch opportunities (violations of sight, speech production, and hearing).

Besides, it is offered to use also studied audiovisual speech interfaces in directory systems and machine guns (booths) of self-service [4][5]. The last can provide universal access to reference information in the city organizations of a social orientation (for example, medical institutions: policlinics, rehabilitation centers; government agencies: managements of social protection of the population, centers of service of the population; educational institutions: libraries for blind and visually impaired people; help systems for obtaining information on railway or to air tickets and the schedule of transport), intended both for ordinary users, and for people with disabilities (in particular, blind and visually impaired people). So, for example, the blind person, having come to policlinic, can use the speech interface to learn necessary information from the directory machine gun of self-service. Thus such machine gun itself automatically distinguishes presence of the client before itself at the expense of system of video detection and itself verbally welcomes the person: "Hello! How can I help you?". Then on inquiry of the client (for example, "I need the surgeon"), the machine gun will be able to issue the developed voice answer about the schedule of work of the doctor, number of an office, and also navigation information how to find this office (for example, "The doctor-surgeon Ivanov Ivan Ivanovich accepts in an office number 213. You need to turn on the left, to pass forward until the end of a wall, to turn to the right, to pass forward to the second door"). The speech interface of a similar booth has to be able to recognize simple commands of the client and to be intuitive and clear to the unprepared user.

Literature:

1.              Karpov, A. Audio-Visual Speech Asynchrony Modeling in a Talking Head /A.Karpov, L.Tsirulnik, Z.Krnoul, A.Ronzhin, B.Lobanov, M.Zelezny // In Proc.

10-th International Conference Interspeech’2009, Brighton, UK, 2009. - pp. 2911-2914.

2.   McGurk, H. Hearing Lips and Seeing Voices / H.McGurk, J.MacDonald // Nature, ¹264, 1976.- pp. 746-748.