Masich I.S.

Siberian State Aerospace University

named after academician M. F. Reshetnev, Russia

 

Pattern search in data for solving practical

problems of recognition

 

À logical method for solving  recognition problem is considered in the paper. A decision making model consists of logical rules that describe patterns in the studied phenomenon or system. The main task is to find these patterns and to make it suitable for decision making support.

Creation and use of logical algorithms for recognition are based on identifying patterns in the original data. A set of these patterns forms a decision function. Search of patterns can be considered as a problem of combinatorial optimization. For more effective solutions the choice of optimization algorithm should be made on the basis of the characteristic features specific to the considered optimization problem. In this paper some properties of optimization problems that solved for finding logical patterns in the data are considered.

Consider the recognition problem for objects described by binary attributes and divided into two classes . The object  is describe by a binary vector  and can be presented as a point in the subcube of binary attribute space .

Denote by  a pattern that cover some object . The variables that fixed in  are equal to the corresponding values ​​of the attributes of the object a. To define the pattern  introduce binary variables :

 

 

Patterns are the elementary building blocks for logical recognition algorithms. Most useful for this purpose are the patterns with the highest coverage (maximum pattern), that is, those for which  is maximum.

The problem of finding the maximum patterns can be written as the problem of finding such values , so that the resulting pattern  covers as many points  as possible and does not cover any points  [1]:

 

                                      (1)

 for all                                       (2)

 

This problem is a problem of constrained pseudo-Boolean optimization, i.e. an optimization problem in which the objective function and the functions in the left side of the constraints are pseudo-Boolean functions – real functions of Boolean variables [2, 3].

Research results show that the search space has sets of constancy of the objective function. That complicates work of optimization algorithms which start search from a feasible point and make search among the neighboring points, since the calculation of the objective function in the system of neighborhoods consisting of neighboring points provides no information on best direction of the search. When solving practical problems of higher dimensions a set of constancy may be such that it covers most of the points of the feasible region. That makes it difficult or impossible to work for such algorithms as the genetic algorithm, multi-start of local search.

The main difficulty in the presence of sets of constancy, that is connected sets, in which the function takes the same value, is the absence of information about what direction should look for the optimal or suboptimal solutions. This concerns the behavior of the objective function (1) and the constraint function (2) in the space of binary variables. One way to improve the situation is use the information not only about the pattern covers objects of the sample, but also use data on the distance to objects uncovered yet.

An experimental comparison of the two algorithms for search rules is made on a number of problems of recognition [4, 5].

The experimental results show that the use of closeness of sampling objects to the pattern allows us to overcome the difficulties that are associated with the characteristic features of the solved optimization problem and manifest themselves in the presence of sets of constancy. That makes possible to find better patterns in the data to be used in solving the problems of recognition.

 

Literature

1.       Hammer P.L., Bonates T. Logical Analysis of Data: From Combinatorial Optimization to Medical Applications. RUTCOR Research Report 10-2005, 2005.

2.       A.N. Antamoshkin, I.S. Masich. Identification of pseudo-Boolean function conditional optimization. Engineering & automation problems. N. 2, 2007, 66-69.

3.       A.N. Antamoshkin, I.S. Masich. Heuristic search algorithms for monotone pseudo-boolean function conditional optimization. Engineering & automation problems. N. 3, 2007, 41-45.

4.       UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/index.html.

5.       Complications of myocardial infarction: a database for testing systems of recognition and forecasting / S.E. Golovenkin, A.N. Gorban, V.A. Shulman and others. Krasnoyarsk, Computing Center of RAS, Preprint 6, 1997.