Cîâðåìåííûå èíôîðìàöèîííûå òåõíîëîãèè/
Âû÷èñëèòåëüíàÿ òåõíèêà è ïðîãðàììèðîâàíèå
Kameshova
S.S., master of natural sciences.
Kostanaysky state university of A.
Baytursynov, Kostanay, Kazakhstan.
THE
AUDIOVISUAL SYNTHESIZERS OF THE SPEECH
The
visual information of speech, using in supplement to ultrasonic information,
very the important for the better perception and understanding utter announcer
speech. Obviously, what looking in face to interlocutor, to us more easy to
understand his speech. The signals from visual and acoustic canals duplicate
and supplement friend of friend, what helps correctly to perceive speech in
many complicated situations, for example, by influence dynamic ultrasonic
noises or when simultaneous speak several man. Also known, what the hard of hearing and the elderly people, and
also the not carriers tongue in large degree base on visual information,
expressing articulation lips and facial organs, than on sound information. In
motion complex process understandings speech organs ear (ears) man perceive
notes, while the organs of sight (eye) see locomotion’s lip and facial organs
speaking and all this information unites in to brain man in united show sense
opinions. Besides it, the
emotion and intonation in speech can be transferred change frequency main tone
speech, heaving eyebrows, movement or nod head, gesture or and their
combination. Except that, existing in world tendency to dynamical development
of speech technologies for different tongues specifies on relevance of the
organic inclusions visual information in quality additional canal by perception
speech, synthesizing computer. The audiovisual (bimodal) speech synthesizers of
the speech subdivide on two main types on type of performance:
1) The
synthesis of audiovisual speech on input to text on definite tongue, on
entrance systems moves printed text (written speech) on definite tongue, for
example with the keyboard, which then it was analyzed system and it were be
transformed in synthesizing audio - and video information.
2) The
synthesis video modalities on to audio signal (sometimes also be called online
animation speech), on entrance user moves oral speech on any tongue, for
example with the microphone, the ultrasonic parameters which simultaneous it was
analyzed system and it were be transformed (with small delay) in video
information.
By realization investigations with potential
users, which consisted from two main interdependent parts [1][2]: 1. The
analysis and the appraisal of naturalness synthesizing speech. 2. The analysis
and the appraisal of scrupulousness speech (as unimodular, and multimodal) in
conditions different ultrasonic noises. For the
realization experiment sounding speech there were picked 20 phonetic-balanced
phrases, which afterwards were shown volunteer-participants of the experiment
to informants) in arbitrary order. Each phrase consisted from 4-6 together-said
words, which all well known, but by this not will form semantic linkages
between itself, all phrase wholly be senseless or have only partial sense. This
former made in order that during testate informants managed without aprioristic semantic knowledge’s, and
oriented only on organs of senses the: ear and sight. On first stage testing’s each from informants asked to miss synthesizing
speech phrase, after that they due there were to enter recognized them on ear
succession words. The then subjects due there were to apprehend this and
phrase, but told the computer system.
On this and stage informants due there were also to value naturalness of
synthesis and the quality of clocking audiovisual signals on 5 - the mark to scale
(MOS – mean opinion score) for four methods clocking (or the simulation asynchrony) audiovisual speech modalities:
1.
With completely synchronous flows
phoneme and vizem (base methods synchronous).
2.
With suggested the by
method simulation the asynchrony audiovisual speech (methods simulation asynchrony).
3.
Methods with stationary value delay
sound signal relatively video signal on 150 ms (methods Â150À).
4.
Methods with delay visual signal
relatively audio signal (methods À150Â).
Further informants had to test audiovisual synthesis
of the speech with four methods of synchronization and estimate quality and
naturalness (similitude to interhuman communication) synchronization of
audiovisual signals (are synchronous or aren't synchronous) the synthesized
speech, using a 5-mark scale (Mean Opinion Score – MOS; the highest mark
"5" means that modalities are perfectly synchronized). Informants
also had to write down sequence of words which they distinguished. And at the
last stage of testing to testers suggested to listen to the same written-down
phrase said by a real voice. Such cycle with various phrases was repeated 20
times for each examinee. It should be noted also that additive noise (white
noise, or noise of crowd when many people) with the changing intensity at the
same time speak was added to a pure acoustic signal (the relation signal/noise
varied from 5 to 25db). In total 10 volunteers from 16 to 35 years with normal
hearing and sight were involved in experiments, prior to the beginning of tests
some time was provided to informants for adaptation to the synthesized voice.
Test session for each person proceeded on average 30 minutes. 800 user
estimates on naturalness of synthesis and 600 on legibility of the speech were
in total received. Picture 1 shows the distributions of the user estimates of
four methods of synchronization (on a 5-mark scale) average according to all
test phrases for each of 10 testers. It is necessary to notice that informants
asked to put down different marks only if they notice a difference between
them. Some testers didn't use an assessment "5" or "2" at
all. It was found out that all subjects identified mistiming of the audio-and
visual speech for the A150B method; two persons from 10 didn't feel a
difference in synchronization for a basic method, the offered asynchronous
method and B150A; 2 other informants didn't distinguish B150A and an
asynchronous method. Other people spoke that they define a difference in all 4
methods of synchronization.
Picture 1 —
Distributions of average user estimates of naturalness of the speech for 4
methods of synchronization of modalities of system
It appeared very unexpectedly
that users put down low marks to a basic method of synchronization which took
only the 3rd place in this testing, the majority of informants preferred the
offered method of modeling of asynchrony and one user - B150A. And, for all
informants the B150 A method was much more acceptable, than A150B. It confirms
a hypothesis that users are tolerant to natural leading of video signal, but at
once notice leading of an audio signal. Picture 2 shows the distributions of
the user estimates of 4 methods of synchronization average on all announcers
depending on the relation signal/noise of an audio signal. It is easy to notice
that with reduction of the relation signal/noise the user estimates also
decrease, it is fair for all methods except A150B which already got very low
points. When users understand the synthesized speech worse, they have the
general tendency to give lower marks. Also at low ratios signal/noise of
distance between elements of the provided histogram are reduced that means:
users feel a difference between synchronization methods in noisy signals worse.
Possibly, it is caused by that in the conditions of noise it is heavy to
identify the moments of the beginning and end of the phrase. Advantage of the
offered method of modeling of asynchrony of modalities of the speech in the
conditions of a pure not noisy signal obviously.
Picture 2 — Average
user estimates of naturalness for 4 methods of synchronization audio-and video
modalities of the speech depending on the relation audio signal/noise
Now the audiovisual speech
interfaces constructed with use of systems of automatic recognition and
synthesis of the speech are actual in many in practical applications as
increase legibility and naturalness of the synthesized speech, and also make
them available for people with limited touch opportunities (violations of
sight, speech production, and hearing).
Besides,
it is offered to use also studied audiovisual speech interfaces in directory
systems and machine guns (booths) of self-service [4][5]. The last can provide
universal access to reference information in the city organizations of a social
orientation (for example, medical institutions: policlinics, rehabilitation
centers; government agencies: managements of social protection of the
population, centers of service of the population; educational institutions:
libraries for blind and visually impaired people; help systems for obtaining
information on railway or to air tickets and the schedule of transport),
intended both for ordinary users, and for people with disabilities (in
particular, blind and visually impaired people). So, for example, the blind
person, having come to policlinic, can use the speech interface to learn
necessary information from the directory machine gun of self-service. Thus such
machine gun itself automatically distinguishes presence of the client before
itself at the expense of system of video detection and itself verbally welcomes
the person: "Hello! How can I help you?". Then on inquiry of the
client (for example, "I need the surgeon"), the machine gun will be
able to issue the developed voice answer about the schedule of work of the
doctor, number of an office, and also navigation information how to find this
office (for example, "The doctor-surgeon Ivanov Ivan Ivanovich accepts in
an office number 213. You need to turn on the left, to pass forward until the
end of a wall, to turn to the right, to pass forward to the second door").
The speech interface of a similar booth has to be able to recognize simple
commands of the client and to be intuitive and clear to the unprepared user.
Literature:
1.
Karpov, A. Audio-Visual Speech Asynchrony Modeling in
a Talking Head /A.Karpov, L.Tsirulnik, Z.Krnoul, A.Ronzhin, B.Lobanov,
M.Zelezny // In Proc.
10-th International Conference Interspeech’2009,
Brighton, UK, 2009. - pp. 2911-2914.
2. McGurk, H. Hearing
Lips and Seeing Voices / H.McGurk, J.MacDonald // Nature, ¹264, 1976.- pp.
746-748.