Cîâðåìåííûå èíôîðìàöèîííûå òåõíîëîãèè/

Âû÷èñëèòåëüíàÿ òåõíèêà è ïðîãðàììèðîâà­íèå

Kameshova S.S., master of natural sciences

 Kostanay state university named after  A. Baytursynov, Kostanay, Kazakhstan.

THE POSSIBILITY OF UTILIZATION NEURON NETWORKS FOR CONSTRUCTION SYSTEM IDENTIFICATION SPEECH

What is meant by speech recognition? This can be a speech-to-text recognition and command execution, allocation of any speech characteristics (eg, speaker identification, determination of his emotional state, gender, age, etc.) - all in different sources may fall under this definition. In this paper, a speech recognition refers to the assignment of speech sounds or sequences (phonemes, letters, words) to any class. Then this class can be compared alphanumeric characters - we get a system convert speech to text, or certain actions - get the system performance of voice commands. In general, this method of processing speech information can be used on the first level of a system with a much more complex structure. And the effectiveness of the classifier performance will depend on the overall system.What problems arise in the construction of a speech recognition system? The main feature of the speech signal that it varies greatly on many parameters: duration, tempo, voice pitch, the distortions introduced by the large variability in the human vocal tract, various emotional states of the speaker, a strong difference of votes different people. Two temporary speech sound representation even for one and the same person, recorded in the same time, will not match. We must look for such parameters of the speech signal, which is fully described by him (ie, allowed to distinguish one sound from another speech), but would have been in some measure invariant with respect to the above-described variations of speech. The thus obtained parameters must then be compared with the samples, and this should be a simple comparison for coincidence and best fit search. This forces them to seek the desired shape in the distance results parametric space.

Further, the amount of information that can store system, is not unlimited. Remember how the almost infinite number of variations of speech signals? Obviously, we can not do without some form of statistical averaging. Another problem - is the speed of the database search. The more its size, the slower will be searched - this statement is true, but only for ordinary sequential computers. And what else the machine will be able to solve all the above problems? That's right, this neural network. Classification - is one of the "favorite" for neural network applications. Moreover, the neural network can perform classification even during training without a teacher (but at the same time form a class do not make sense, but nothing prevents further associate them with other classes representing a different type of information - in fact, give them a sense). Any voice signal can be represented as a vector in any parameter space, then this vector can be stored in the neural network. One model of a neural network, a learning without a teacher - a self-organizing feature maps of Kohonen. In it for a variety of input signals generated neuronal ensembles representing these signals. This algorithm has the ability to statistical averaging, ie, solved the problem with the variability of speech. Like many other neural network algorithms, it provides parallel processing of information, ie simultaneously work all neurons. Thereby solving the problem with the speed of recognition - usually while the neural network is a few iterations. Further, based on neural networks can be easily constructed multilevel hierarchical structure, while retaining their transparency (the possibility of separate analysis). As in fact it is a component, ie, divided into phrases, words, characters, sounds, and then the speech recognition system to build a hierarchical logical. Finally, another important property of neural networks (and in my opinion, is the most promising of their property) is a flexible architecture. Under this may not be entirely accurate term I mean that in fact the algorithm of the neural network is defined by its architecture. Automatic creation of algorithms - a dream for decades. But the creation of algorithms in programming languages ​​until only by man. Of course, created a special language that allows you to perform automatic generation of algorithms, but they are not much easier task. A new algorithm for the generation of neural networks is achieved by simply changing its architecture. It is possible to get a brand new solution to the problem. Enter the correct selection rule that specifies better or worse a new neural network solves the problem, and modification rules neural network, you can eventually get a neural network, which will solve the problem correctly. All the neural network model, a paradigm combined form a set of genetic algorithms. It is very clearly a correlation of genetic algorithms and evolutionary theory (hence the typical terms: population, genes, parent-child, crossover, mutation). Thus, there is the possibility of creating such neural networks that have not been explored by researchers or not amenable to analytical study, but nonetheless successfully solve problems. What distinguishes the work being done by robots that can perform a man? Robots can have qualities far superior ability of people: high precision, strength, reaction, lack of fatigue. But at the same time they are just tools in the hands of man. There are jobs that can only be done by man, and which can not be performed by robots (or need to create unnecessarily complex robots). The main difference between man and robot - the ability to adapt to changing conditions. Of course, almost all the robots there ability to operate in several modes, to handle exceptions, but it was originally laid in a man. Thus, the main drawback of robots - is the lack of autonomy (requires human control) and the lack of adaptation to changing conditions (all possible situations laid him in the making). In connection with this urgent problem of creating systems with such properties. One way to create an autonomous system with the ability to adapt - is to give it the ability to learn. At the same time, unlike conventional robots created with pre-calculate the properties of such systems will have a certain degree of universality.

Attempts to create such systems were made by many researchers, including with the use of neural networks. One example - created at the Kiev Institute of Cybernetics in the 70-ies of the layout of the transport integral autonomous robot (TAIR) [1]. This robot was trained to find their way on some areas and could then be used as a vehicle. That's what properties, in my opinion, should have such a system:

Development of a system is only in the construction of its architecture.

In the process of creating a system developer creates only a functional part, but does not fill (or fill in the minimum volume) system information. The main part of the system receives information in the learning process. The ability to control their actions with subsequent correction. This principle suggests the need feedback [Action] - [Result] - [Correction] in the system. Such chains are very common in complex biological organisms and are used at all levels - from the control of muscle contraction at the lowest level to the management of complex mechanisms of behavior. The possibility of accumulation of knowledge about the objects of the workspace. Knowledge about the object - is the ability to manipulate his image in memory that is amount of knowledge about the object is determined not only set its properties, but also information about its interaction with other objects, behavior under different treatments, being in different states, etc., ie, its behavior in the external environment (eg, knowledge Geometric objects are assumed to predict the form of its perspective projection for any rotation and lighting). This feature gives the system the ability to abstract from real objects, ie the ability to analyze the object in its absence, thus opening up new opportunities in teaching. Autonomous systems

When integrated set of actions that the system is able to perform, with a set of sensors that monitor their actions and the environment, endowed with the above properties of the system will be able to interact with the outside world at a fairly sophisticated level, ie adequately respond to changes in the external environment (of course, if it is included in the system during the training phase). The ability to adjust their behavior depending on external conditions will partially or completely eliminate the need for external control, ie, the system will become autonomous. To study the features of machine learning model speech recognition and synthesis were combined into a single system, which allowed give her some properties of self-learning systems. This association is one of the key properties of the new model. What was the cause of this association? Firstly, the system has the ability to perform actions (synthesis) and analyze them (recognition), ie, Property (2). Secondly, there is a property (1), as in the development of the system does not pledge any information, and the possibility of recognition and synthesis of speech sounds - it is the result of learning. The advantage of our model is the ability to automatically synthesize learning. The mechanism of this training is described below. Another very important feature is the ability to transfer a memorable image of a new parameter space with a much smaller dimension. This feature is currently in the developed system has not been implemented in practice has not been verified, but nevertheless I will try to summarize it using the example of speech recognition. Suppose the input signal is given by the vector of the primary features in the N-dimensional space. To store such a signal must be N elements. Thus at the design stage we do not know the specifics of the signal, or it is so complicated that it is difficult to consider. This leads to the fact that the representation of the signal, we use redundant. Next, assume that we have the ability to synthesize the same signals (ie, synthesized speech), but the synthesized signal is a function of the parameter vector in the M-dimensional space, and M << N (indeed, the number of model parameters for speech synthesis is much less than the number of primary characteristics of speech recognition models). But then we remember the input signal is not in its primary features in the N-dimensional space, and the parameters model synthesis in M-dimensional space. The question arises: how to translate the signal from one of the parameter space to another? There is every reason to believe that this transformation can be carried out using a fairly simple neural network. Moreover, in my opinion, such a mechanism of storing works in real biological systems, particularly in humans. The following describes a model of automatic speech recognition and synthesis. Provides a mechanism for sound input into the neural network model of speech synthesis, a neural network model, the problems encountered in the model. Sound input is performed in real time through your sound card, or through the Microsoft Wave files encoded in PCM (16 bit, 22050 Hz sample rate). Working with files is preferable because it allows multiple processes to repeat processing neural network, which is especially important when teaching. To the sound can be fed to the input of neural network, you must carry it over the conversions. Obviously, the representation of the sound in the temporary shape is inefficient. It does not reflect the characteristics of the audio signal. Much more informative spectral representation of speech. To obtain a spectrum using a set of bandpass filters tuned to different frequencies isolation or discrete Fourier transform. Then, the resulting spectrum is subjected to various transformations, such as a logarithmic scale change (amplitude in space and in frequency space). This allows to take into account some of the features of the speech signal - lowering high information content portions of the spectrum, the logarithmic sensitivity of the human ear, etc.

Typically, a complete description of only the speech signal spectrum is not possible. Along with the spectral information is needed and more information about the dynamics of speech. To obtain it uses delta-parameters, which are the time derivatives of the basic parameters. The thus obtained parameters of the speech signal are considered the primary features and it represents the signal levels on further processing. Sound information input process shown in Fig. 1:

Figure 1. Entering audio

When processing a file on it moves the input window size equal to the size of the window of a discrete Fourier transform (DFT). The offset relative to the previous position of the window can be adjusted. In each position of the window is filled with data (the system only works with sound, in which each sample is encoded by 16 bits). When you enter the sound in real time he writes a block of the same size. After entering the data in the window before computing the DFT window superimposed on it smoothing Hamming:

, (1)

N - size of the window DFT

Hamming window overlay slightly increased contrast range, but allows the elimination of sharp frequency sidelobes (Figure 2), while particularly well manifested harmonic structure of speech.

                 without smoothing window      with a window smoothing Hamming

Figure 2. The effect of smoothing the Hamming window (logarithmic scale)

After that, the discrete Fourier transform is calculated by fast Fourier transform algorithm ([XX]). As a result, the real and imaginary coefficients obtained amplitude spectrum and phase information. Phase information is discarded and the calculated energy spectrum:

(2)

Since the processed data do not contain imaginary part, by the properties of the DFT result is symmetric, ie, E [i] = E [N-i]. Thus, the size of the informative part of the spectrum NS is equal to N / 2. All calculations are performed on a neural network in the floating-point numbers and the majority of signals limited by the range [0.0,1.0], so that the resulting spectrum is normalized to 1.00. For each component of the vector is divided by its length:

, (3)

(4)

Informativeness of different parts of the spectrum is not the same: in the low-frequency region contains more information than in the high. Therefore, to prevent excessive consumption of inputs neural network is necessary to reduce the number of items that receive information from the high-frequency region, or, equivalently, to compress the high-frequency region of the spectrum in the frequency space. The most common method (due to its simplicity) - logarithmic compression [2], "Non-linear frequency scales":

, (5)

f - frequency spectrum Hz,

m - the frequency of the new compressed frequency space

After normalization and compression range is applied to the input of the neural network. Inputs neural network does not perform any critical function, and only transmit signals on a neural network. Select the number of inputs - a difficult task, because with a small amount of input vector may lose important to recognize information, and with a large computational complexity is significantly increased (only in the simulation on the PC, in real neural networks this is not true, because all elements work in parallel) . At high resolution (more) inputs can be released harmonic structure of speech and as a consequence the definition of voice pitch. At low resolution (small number) I can only definition of formant structure. As demonstrated by further study of this problem, to recognize already only information about the formant structure. In fact, a man equally recognizes normal speech and voice whisper, although the latter is not the source of the voice. Voice source provides additional information in the form of intonation (pitch change over statements), and this information is very important at the highest levels of speech processing. But in the first approximation, we can confine ourselves to give the formant structure, and for this purpose, taking into account compression uninformative spectrum sufficient number of inputs selected in the range of 50 ~ 100.

Neural network has a fairly simple structure and consists of three layers: an input layer, a layer of character and effector layer (Fig. 4). Each subsequent layer neuron is connected to all neurons in the previous layer. The transfer function in all linear layer, the input layer of the simulated competition.

Literature:

1. NM Amos and others. "Neurocomputers and intelligent robots" -Kiev: Naukova Dumka, 1991

2.Speech Analysis FAQ - http://svr-www.eng.cam.ac.uk/~ajr/SA95/SpeechAnalysis.html