Современные
информационные технологии/
Вычислительная
техника и программирование
Kameshova S.S., master of natural sciences
Rauyl Olzhas, 1st year student of the
specialty "Informatics"
Kostanay state
university named after A. Baytursynov,
Kostanay, Kazakhstan.
Recognition technology of speech signal
The introduction
of highly complex, but highly intelligent information and computer technologies
in the sphere of human activity requires a change in the management of
automated systems for more convenient and efficient to use them. To the
greatest extent it stimulates the existence of specific areas of computing,
where voice commands are the most goals. These include, for example, include telephone
access to self-help systems, management of remote computer or a mobile handheld
device, carried out while driving.
Creating a
full-fledged language interfaces that support language dialogue
"user-computer" - very promising, but difficult direction of the
modern computer systems.
Two key problems
of speech recognition - to achieve absolute accuracy on a limited set of
commands for at least one announcer voice and speaker-independent continuous
speech recognition of any acceptable quality - not resolved, despite the almost
half-century history of their development.
There are doubts
about the concept of answerability both tasks, because even people cannot
always completely recognize the language of the interlocutor. If it has more
recently been considered as a signal in the range from about 300 to 3500 Hz,
that has the characteristic properties (e.g., a pause between words), then from
the standpoint of modern technology it - is primarily signal.
What is speech
recognition? You say the phrase on which the technical system responds
adequately - or machine executes the command contained in the phrase, or
gaining dictated text or dispose of information extracted from the phrase
otherwise. As it depends on the particular implementation.
What is it?
Speaking of speech, we must distinguish between such concepts as
"speech", "sound speech", "beep",
"message", "text".
In our case, in the annex to the problem of recognition of concepts such as
"speech" and "sound speech" mean the same thing - a certain
man generated voice message, which can be objectively recorded, measured,
stored, processed and reproduced by means of instruments and algorithms. In
this case, the term "message" can hide any useful information for the
recipient, and not just text.
The text, as it is
known, consists of letters, words, sentences - it is discrete. It is a normal
sound together. Human speech, as opposed to here the text does not consist of
letters. If we write on tape or disk sound of each letter, and then try to link
these sounds of it, we have nothing.
A speech
recognition system consists of two parts: the acoustic and linguistic. Last
named is not strictly linguistic. In general, it may include phonetic,
phonological, morphological lexical, syntactic and semantic language model.
Acoustic model is responsible for the representation of the speech signal.
Rather, his conversion (from the traditional temporal process) in some form, in
which more explicitly present information in the content of verbal
communication.
Linguistic model
interprets information from the acoustic model, and is responsible for
presenting the recognition result to the consumer (in the role of which can act
not only people, but also the technical system, controlled by speech).
It is difficult
to choose a suitable indicator of the quality of a speech recognition system.
Most simply an indicator of quality input to the command systems. When tested
in random order pronounced all the possible commands quite a number of times.
Count the number of correctly recognized commands and divided by the total
number of spoken commands.
The result is an
estimate of the probability of correct recognition of commands in a given
experiment, when the acoustic environment. For dictation systems like quality
score can be calculated at the dictation of some test text. Obviously, this is
not always convenient indicator of quality. In fact, we are confronted with a
variety of listening situations.
And what with the
change of speakers and the accompanying training system? Different systems may
require different amounts of settings, which greatly affects the ease of use.
The standard output is to use multi-criteria, the so-called comprehensive
quality index.
As an example,
consider the case of a simple command speech recognition system. Operation of
the system is based on the hypothesis that the spectral and temporal
characteristics of the teams of words for a single speaker vary slightly.
The acoustic
model of the system is a converter of a speech signal in the spectral-time
matrix. In the simplest case, the command located in time for pauses in the
speech signal. Linguistic unit is able to detect a limited number of teams plus
one, which means all the other unknown word system.
As a rule, the linguistic model is
constructed as the search algorithm maximum functionality of the input sample
and the sample of all "vocabulary" of the system.
Often this is the
usual two-dimensional correlation. Although the choice of the dimension of
description and his birth certificate may vary widely developer. Linguistic
blocks of modern systems implement complex model of natural language.
Sometimes it is
based on the mathematical apparatus of hidden Markov chains, sometimes utilizes
the latest technology of neural networks.
REFERENCES
1. Rabiner, L.R. A tutorial on
hidden Markov models and selected applications in speech recognition.
Proceedings of the IEEE, vol. 77, no.2, February 1989
2. Rabiner, L.R. Juang, B.H.
Fundamentals of speech recognition, 1993.