Cîâðåìåííûå èíôîðìàöèîííûå òåõíîëîãèè/
Âû÷èñëèòåëüíàÿ òåõíèêà è ïðîãðàììèðîâàíèå
Kameshova S.S.,
master of natural sciences
Kostanay
state university named after A.
Baytursynov, Kostanay, Kazakhstan.
THE POSSIBILITY
OF UTILIZATION NEURON NETWORKS FOR CONSTRUCTION SYSTEM IDENTIFICATION SPEECH
What is meant by speech recognition? This can be a speech-to-text
recognition and command execution, allocation of any speech characteristics
(eg, speaker identification, determination of his emotional state, gender, age,
etc.) - all in different sources may fall under this definition. In this paper,
a speech recognition refers to the assignment of speech sounds or sequences
(phonemes, letters, words) to any class. Then this class can be compared
alphanumeric characters - we get a system convert speech to text, or certain
actions - get the system performance of voice commands. In general, this method
of processing speech information can be used on the first level of a system
with a much more complex structure. And the effectiveness of the classifier
performance will depend on the overall system.What problems arise in the
construction of a speech recognition system? The main feature of the speech
signal that it varies greatly on many parameters: duration, tempo, voice pitch,
the distortions introduced by the large variability in the human vocal tract,
various emotional states of the speaker, a strong difference of votes different
people. Two temporary speech sound representation even for one and the same
person, recorded in the same time, will not match. We must look for such
parameters of the speech signal, which is fully described by him (ie, allowed
to distinguish one sound from another speech), but would have been in some
measure invariant with respect to the above-described variations of speech. The
thus obtained parameters must then be compared with the samples, and this
should be a simple comparison for coincidence and best fit search. This forces
them to seek the desired shape in the distance results parametric space.
Further, the amount of information that can store system, is not
unlimited. Remember how the almost infinite number of variations of speech
signals? Obviously, we can not do without some form of statistical averaging. Another
problem - is the speed of the database search. The more its size, the slower
will be searched - this statement is true, but only for ordinary sequential
computers. And what else the machine will be able to solve all the above
problems? That's right, this neural network. Classification - is one of the
"favorite" for neural network applications. Moreover, the neural
network can perform classification even during training without a teacher (but
at the same time form a class do not make sense, but nothing prevents further
associate them with other classes representing a different type of information
- in fact, give them a sense). Any voice signal can be represented as a vector
in any parameter space, then this vector can be stored in the neural network.
One model of a neural network, a learning without a teacher - a self-organizing
feature maps of Kohonen. In it for a variety of input signals generated
neuronal ensembles representing these signals. This algorithm has the ability
to statistical averaging, ie, solved the problem with the variability of
speech. Like many other neural network algorithms, it provides parallel
processing of information, ie simultaneously work all neurons. Thereby solving
the problem with the speed of recognition - usually while the neural network is
a few iterations. Further,
based on neural networks can be easily constructed multilevel hierarchical
structure, while retaining their transparency (the possibility of separate
analysis). As in fact it is a component, ie, divided into phrases, words,
characters, sounds, and then the speech recognition system to build a
hierarchical logical. Finally, another important property of neural networks
(and in my opinion, is the most promising of their property) is a flexible
architecture. Under this may not be entirely accurate term I mean that in fact
the algorithm of the neural network is defined by its architecture. Automatic
creation of algorithms - a dream for decades. But the creation of algorithms in
programming languages until
only by man. Of course, created a special language that allows you to perform
automatic generation of algorithms, but they are not much easier task. A new
algorithm for the generation of neural networks is achieved by simply changing
its architecture. It is possible to get a brand new solution to the problem.
Enter the correct selection rule that specifies better or worse a new neural
network solves the problem, and modification rules neural network, you can
eventually get a neural network, which will solve the problem correctly. All
the neural network model, a paradigm combined form a set of genetic algorithms.
It is very clearly a correlation of genetic algorithms and evolutionary theory
(hence the typical terms: population, genes, parent-child, crossover,
mutation). Thus, there is the possibility of creating such neural networks that
have not been explored by researchers or not amenable to analytical study, but
nonetheless successfully solve problems. What distinguishes the work being done
by robots that can perform a man? Robots can have qualities far superior
ability of people: high precision, strength, reaction, lack of fatigue. But at
the same time they are just tools in the hands of man. There are jobs that can
only be done by man, and which can not be performed by robots (or need to
create unnecessarily complex robots). The main difference between man and robot
- the ability to adapt to changing conditions. Of course, almost all the robots
there ability to operate in several modes, to handle exceptions, but it was
originally laid in a man. Thus, the main drawback of robots - is the lack of
autonomy (requires human control) and the lack of adaptation to changing
conditions (all possible situations laid him in the making). In connection with
this urgent problem of creating systems with such properties. One way to create
an autonomous system with the ability to adapt - is to give it the ability to
learn. At the same time, unlike conventional robots created with pre-calculate
the properties of such systems will have a certain degree of universality.
Attempts to create such systems were made by many researchers, including
with the use of neural networks. One example - created at the Kiev Institute of
Cybernetics in the 70-ies of the layout of the transport integral autonomous
robot (TAIR) [1]. This robot was trained to find their way on some areas and
could then be used as a vehicle. That's what properties, in my opinion, should
have such a system:
Development of a system is only in the construction of its architecture.
In the process of creating a system developer creates only a functional
part, but does not fill (or fill in the minimum volume) system information. The
main part of the system receives information in the learning process. The
ability to control their actions with subsequent correction. This principle
suggests the need feedback [Action] - [Result] - [Correction] in the system.
Such chains are very common in complex biological organisms and are used at all
levels - from the control of muscle contraction at the lowest level to the
management of complex mechanisms of behavior. The possibility of accumulation
of knowledge about the objects of the workspace. Knowledge about the object -
is the ability to manipulate his image in memory that is amount of knowledge
about the object is determined not only set its properties, but also
information about its interaction with other objects, behavior under different
treatments, being in different states, etc., ie, its behavior in the external
environment (eg, knowledge Geometric objects are assumed to predict the form of
its perspective projection for any rotation and lighting). This feature gives
the system the ability to abstract from real objects, ie the ability to analyze
the object in its absence, thus opening up new opportunities in teaching. Autonomous systems
When integrated set of actions that the system is able to perform, with
a set of sensors that monitor their actions and the environment, endowed with
the above properties of the system will be able to interact with the outside
world at a fairly sophisticated level, ie adequately respond to changes in the
external environment (of course, if it is included in the system during the
training phase). The ability to adjust their behavior depending on external
conditions will partially or completely eliminate the need for external
control, ie, the system will become autonomous. To study the features of
machine learning model speech recognition and synthesis were combined into a
single system, which allowed give her some properties of self-learning systems.
This association is one of the key properties of the new model. What was the
cause of this association? Firstly, the system has the ability to perform
actions (synthesis) and analyze them (recognition), ie, Property (2). Secondly,
there is a property (1), as in the development of the system does not pledge
any information, and the possibility of recognition and synthesis of speech
sounds - it is the result of learning. The advantage of our model is the ability
to automatically synthesize learning. The mechanism of this training is
described below. Another very important feature is the ability to transfer a
memorable image of a new parameter space with a much smaller dimension. This
feature is currently in the developed system has not been implemented in
practice has not been verified, but nevertheless I will try to summarize it
using the example of speech recognition. Suppose the input signal is given by
the vector of the primary features in the N-dimensional space. To store such a
signal must be N elements. Thus at the design stage we do not know the
specifics of the signal, or it is so complicated that it is difficult to
consider. This leads to the fact that the representation of the signal, we use
redundant. Next, assume that we have the ability to synthesize the same signals
(ie, synthesized speech), but the synthesized signal is a function of the
parameter vector in the M-dimensional space, and M << N (indeed, the
number of model parameters for speech synthesis is much less than the number of
primary characteristics of speech recognition models). But then we remember the
input signal is not in its primary features in the N-dimensional space, and the
parameters model synthesis in M-dimensional space. The question arises: how to
translate the signal from one of the parameter space to another? There is every
reason to believe that this transformation can be carried out using a fairly
simple neural network. Moreover, in my opinion, such a mechanism of storing
works in real biological systems, particularly in humans. The following
describes a model of automatic speech recognition and synthesis. Provides a
mechanism for sound input into the neural network model of speech synthesis, a
neural network model, the problems encountered in the model. Sound input is
performed in real time through your sound card, or through the Microsoft Wave
files encoded in PCM (16 bit, 22050 Hz sample rate). Working with files is
preferable because it allows multiple processes to repeat processing neural
network, which is especially important when teaching. To the sound can be fed
to the input of neural network, you must carry it over the conversions.
Obviously, the representation of the sound in the temporary shape is
inefficient. It does not reflect the characteristics of the audio signal. Much
more informative spectral representation of speech. To obtain a spectrum using
a set of bandpass filters tuned to different frequencies isolation or discrete
Fourier transform. Then, the resulting spectrum is subjected to various
transformations, such as a logarithmic scale change (amplitude in space and in
frequency space). This allows to take into account some of the features of the
speech signal - lowering high information content portions of the spectrum, the
logarithmic sensitivity of the human ear, etc.
Typically, a complete description of only the speech signal spectrum is
not possible. Along with the spectral information is needed and more
information about the dynamics of speech. To obtain it uses delta-parameters,
which are the time derivatives of the basic parameters. The thus obtained
parameters of the speech signal are considered the primary features and it
represents the signal levels on further processing. Sound information input process
shown in Fig. 1:
Figure 1. Entering audio
When processing a file on it moves the input window size equal to the
size of the window of a discrete Fourier transform (DFT). The offset relative
to the previous position of the window can be adjusted. In each position of the
window is filled with data (the system only works with sound, in which each
sample is encoded by 16 bits). When you enter the sound in real time he writes
a block of the same size. After entering the data in the window before computing
the DFT window superimposed on it smoothing Hamming:
N - size of the window DFT
Hamming window overlay slightly increased contrast range, but allows the
elimination of sharp frequency sidelobes (Figure 2), while particularly well
manifested harmonic structure of speech.
without
smoothing window with a window
smoothing Hamming
Figure 2. The effect of smoothing the Hamming window (logarithmic scale)
After that, the discrete Fourier transform is calculated by fast Fourier
transform algorithm ([XX]). As a result, the real and imaginary coefficients
obtained amplitude spectrum and phase information. Phase information is discarded and the
calculated energy spectrum:
Since the processed data do not contain imaginary part, by the
properties of the DFT result is symmetric, ie, E [i] = E [N-i]. Thus, the size
of the informative part of the spectrum NS is equal to N / 2. All calculations
are performed on a neural network in the floating-point numbers and the
majority of signals limited by the range [0.0,1.0], so that the resulting
spectrum is normalized to 1.00. For each component of the vector is divided by
its length:
Informativeness of different parts of the spectrum is not the same: in
the low-frequency region contains more information than in the high. Therefore,
to prevent excessive consumption of inputs neural network is necessary to
reduce the number of items that receive information from the high-frequency
region, or, equivalently, to compress the high-frequency region of the spectrum
in the frequency space. The most common method (due to its simplicity) -
logarithmic compression [2], "Non-linear frequency scales":
f - frequency spectrum Hz,
m - the frequency of the new compressed frequency space
After normalization and compression range is applied to the input of the
neural network. Inputs neural network does not perform any critical function,
and only transmit signals on a neural network. Select the number of inputs - a
difficult task, because with a small amount of input vector may lose important
to recognize information, and with a large computational complexity is
significantly increased (only in the simulation on the PC, in real neural
networks this is not true, because all elements work in parallel) . At high
resolution (more) inputs can be released harmonic structure of speech and as a
consequence the definition of voice pitch. At low resolution (small number) I
can only definition of formant structure. As demonstrated by further study of
this problem, to recognize already only information about the formant
structure. In fact, a man equally recognizes normal speech and voice whisper,
although the latter is not the source of the voice. Voice source provides
additional information in the form of intonation (pitch change over
statements), and this information is very important at the highest levels of
speech processing. But in the first approximation, we can confine ourselves to
give the formant structure, and for this purpose, taking into account compression
uninformative spectrum sufficient number of inputs selected in the range of 50
~ 100.
Neural network has a fairly simple structure and consists of three
layers: an input layer, a layer of character and effector layer (Fig. 4). Each
subsequent layer neuron is connected to all neurons in the previous layer. The
transfer function in all linear layer, the input layer of the simulated
competition.
1. NM Amos and
others. "Neurocomputers and intelligent robots" -Kiev: Naukova Dumka,
1991
2.Speech Analysis
FAQ - http://svr-www.eng.cam.ac.uk/~ajr/SA95/SpeechAnalysis.html