Masich I.S.

Siberian State Aerospace University

named after academician M. F. Reshetnev, Russia

 

Search algorithms for combinatorial optimization in diagnosis and prognosis problems

 

A data classification method based on combinatorics and optimization is considered in the paper. The decision rule bases on the model received by solving combinatorial optimization problems. Search algorithms of pseudo-Boolean optimization are designed and investigated for solving these problems.

The main object of study in this work is the use of algorithms for combinatorial optimization problems in data classification. One of the most promising directions in classification are logical classification algorithms, the principle of which is to identify patterns in data and formalize them into a set of logical rules. The process of formation of decision rules is implemented through solving problems of choosing the best alternative according to some criterion. In some logic algorithms, including decision trees and decision lists, this is done implicitly, using some heuristic procedures. Formalization of this process as a series of combinatorial optimization problems forms a flexible and efficient algorithm of logical analysis for data classification. Construction of effective rules and classification models is a difficult combinatorial problem. The results of the solution are determined by the type of the criteria and constraints that are formed, and by used optimization algorithms. Their effectiveness depends on accuracy and laboriousness of the method of classification.

The investigated data classification algorithm consists of steps, each of which requires decision a series of combinatorial optimization problems [1]. Criteria and constraints in the problems are defined by pseudo-Boolean functions, characterized by the properties of unimodality and monotonicity that distinguish them in a special class of problems where the feasible set is connected. In the general case these functions are defined algorithmically, i.e. evaluated through a specific sequence of operations.

Therefore, the most suitable for solving this problem are the so-called search optimization algorithms that do not require the assignment of functions in explicit form, using algebraic expressions, and use calculations of the functions in points.

For solving this problem a regular pseudo-Boolean optimization algorithm was previously developed [2], which finds the exact optimal solution for a limited time; it is shown that this algorithm implements the informational complexity of this class of problems.

At the same time for multiple solutions of this problem with a large number of variables the most appropriate is using of approximation algorithms [3], designed specifically for this class of problems and based on the behavior of monotone functions, which are the criteria and constraints, in the space of Boolean variables. These optimization algorithms are based on finding the boundary points of the feasible region.

Experimental study of the developed procedure of classification and its comparison with other methods of data classification was carried out on the problem of predicting complications of myocardial infarction (MI) [4]. The challenge is to predict for patients at the hospital the number of complications: atrial fibrillation (AF), ventricular fibrillation (VF), pulmonary edema (PE), heart failure (HF). For this purpose the data sample consisting of 1700 objects, data of which lies in 117 characters, is used.

Comparison of the developed logical classification algorithm (LCA) was carried out with different data classification algorithms: naïve Bayesian classifier, nearest neighbor (IB1), support vector machine (SMO), decision trees (J48), algorithm for constructing 1-rule (1-R) and other classification algorithms. Solving by these algorithms was carried out in the data analysis system WEKA. The results for VF are presented in Table 1.

Table 1

                         Method

Indicator

NaiveBayes

IB1

SMO

1-R

J48

LCA

The number of correctly classified negative objects (%)

89

89

78

78

78

94

The number of correctly classified posirive objects (%)

69

46

77

92

92

100

 

Realized classification procedure is not inferior in accuracy to other algorithms for data classification on the practical problems of forecasting. But it has several important advantages in practical use. First of all, explicitly known the rules by which a decision is made about belonging to a class. This makes the method is interpreted and provides an opportunity to apply it to solve those problems in which the loss from making the wrong decisions can be large, and the decision itself must be justified.

Many methods of pattern recognition (e.g., neural networks, support vector machine), although find a solution with good accuracy, but do not give any explanation about them. However, when using systems of diagnosis and prognosis in practical problems (e.g., problems of medical diagnosis and prognosis), sometimes you want to know why some new observation belongs to a class, how far it is from the “boundary” between the classes, how this decision is justified. In such cases, the method of data classification, which in addition to the solutions of the problem provides an explicit decision rule, that is, detects knowledge from available data.

Literature

1.       Hammer P.L., Bonates T. Logical Analysis of Data: From Combinatorial Optimization to Medical Applications. RUTCOR Research Report 10-2005, 2005.

2.       A.N. Antamoshkin, I.S. Masich. Unimprovable algorithm for monotone pseudo-Boolean function conditional optimization. Engineering & automation problems. V. 6, N. 1, 2008, 71-75.

3.       A.N. Antamoshkin, I.S. Masich. Heuristic search algorithms for monotone pseudo-boolean function conditional optimization. Engineering & automation problems. N. 3, 2007, 41-45.

4.       Complications of myocardial infarction: a database for testing systems of recognition and forecasting / S.E. Golovenkin, A.N. Gorban, V.A. Shulman and others. Krasnoyarsk, Computing Center of RAS, Preprint 6, 1997.