Modern information technology /2. Computers and Programming

Abylkhassenova D.K.

Almaty University of Energy and Communications

The problem of "Big Data" and the prospects of its solution.

Big Data problem is eternal and illusive. In the entire history of data management technology has always had important data that we would like to be able to effectively store and process, but the volume, which makes this a daunting task for existing data management systems. Eternity and illusory of a problem related not only to the constant growth of the volume of data, but also to the fact that there are needs for storage and processing of new types of data, for which existing systems are ill-suited (or do not work at all). Eternity and illusory of a problem hardly can count on its full and final decision. This is bad for users and application developers, but guarantees permanent employment in the future research and development of data management systems.

It is considered that the database technology (meaning traditional SQL-oriented - colloquially called Relational - DBMS) made possible to efficiently manage transactional and analytical databases. Transactional database designed to support operational transactional applications (different backup systems, trading systems, etc.). The transactional database mainly contains operational data reflecting the current status of the business or other areas of work, quickly and frequently update and maintain various aspects of this activity. Analytic Database contains historical data related to the activities of a particular enterprise, some business or some scientific direction. This data comes in the database from different sources, one of which is relevant transaction database.

The problem of large data exposed to both categories of databases. The volumes of transactional databases are increasing due to the development of operational user needs, business or science. The volume of analytical databases grow, primarily because of its nature: the data they just always accumulate and never deleted. Another major reason for growth in the analytical database is a need for business analysts to attract new sources of data.

For transactional databases particular case of large data problems can be summarized as follows: it is necessary to provide a relatively inexpensive technology scaling databases and transactional applications, allowing maintain the desired speed of transaction processing with an increase in the volume of data and increase the number of simultaneously carried out transactions. For analytical databases particular case of a problem goes something like this: it is required to provide the technology relatively inexpensive to scale database and analytic applications that enables analysts to (a) enhance the capacity of the database by part performance of analytic queries and (b) provide an effective online analytical processing with an increase in their volume.

In the first decade of the new century, the researchers, led by one of the pioneers of database technology Michael Stonebraker managed to find ways to solve both of a problem of particular cases. At the core of both solutions based on the following general principles:

1.                 Transfer calculations as close as possible to the sources - a principle means that the database itself and database applications so arranged to minimize the transfer of data over the network connecting the nodes of a suitable computing system. Obviously, this principle importance increases with increasing amounts of data. The consequence first principle is the need for porting database applications (parts or all) on the server side;

2.                 The use of architecture without any shared resources (sharing nothing) - the principle enables a real parallelization DBMS and applications, since the absence of shared resources between computing nodes of the system (in fact, by using a cluster architecture) decreases the likelihood of conflict between parts of the system and the applications running in different network nodes;

3.                 Effective data sharing across nodes of a computer system with the possibility of replication in multiple nodes - a principle provides an efficient parallel processing of transactions or effective support for online analytical processing.

Immediately should say that the all three of the principle is not new. Rather, their appearance can be dated to the 80th years of the last century. For example, sharing nothing principles based popular and highly effective parallel to the Teradata database, which has been successfully used for several decades. However, in our time marked by three principles successfully applied to create really scalable parallel transactional and analytical databases.

The application of these principles is necessary and sufficient for implementation of both types of systems. In each case it is necessary to apply some additional ideas. In particular, the parallel database transaction is beneficial to be based on the long-known ideas of database management in main memory, and to ensure the reliability of data used advanced replication. At the same analytical system is more advantageous to use the storage technology of tabular data in the external memory columns together with support for a variety of redundant data structures.

So, we can assume that the solution to the problem of large transactional and analytical data are planned. This does not mean that even these particular kinds of problems are solved. For example, a parallel DBMS transaction effectively only work with this data distribution, which minimizes the number of distributed transactions in the existing workload. When you change the workload required to redistribute the data. Analytical parallel DBMS to cope with complex analytic queries only in those cases where the separation of the data corresponds to the specifics of these requests.

As already stated, the concept of large data is relative. In particular, large transaction data volume by several orders greater yield analytical data.

Based on the foregoing, it follows that the database community is relatively well learn how to build a horizontally scalable parallel analytic database that can support the effective implementation of the standard analytical queries (for simplicity, we neglect the problem of data redistribution requirements noted above). But back to the first basic principle - the approximate calculation of the data. If the server database provided only basic analytics, like it or not, any more or less serious analytical application will have to pull large volumes of analytical data on the workstations or in the best case for the interim analysis server.

The only way to eliminate this defect is to allow the expansion of server analytics with new analytic functions supplied by business analysts. At first glance, the corresponding features are provided by the SQL tools that allow users to define their own functions, procedures, and even the types of data. But SQL does not provide the parallelization of these programs. Pieces of these programs will have to be carried out in the vicinity of the respective pieces of data.

The only common method of parallel programming in a clustered environment is the use of interface MPI - Message Passing Interface. In this case, the programmer decides how to arrange separate parallel execution of its programs in the cluster nodes, and how to ensure their interaction to generate the final result. But programming with MPI interface is a great difficulty even for professionally trained programmers, not to mention the business analysts who are much closer to mathematical statistics, instead of parallel programming techniques. Most likely, a typical analyst, who will be asked to start solving this problem, prefer to use analytical software packages on your workstation that is fundamentally ruin the whole idea of the horizontal scalability of systems. At first glance, the problem seems insoluble, as well as insoluble, and it seems the more general problem of providing convenient and efficient parallel programming tools for programmers to generalists.

However, not so long ago managed to find approach to solving this problem, at least in the first part of it (it should be noted the merits of the developers of parallel DBMS Greenplum and Asterdata). This decision is based on the use of MapReduce technology.

Let us remind that MapReduce technology appeared in the bowels of Google as a replacement for parallel analytical database for solving analytical problems of their own. The technology quickly gained popularity among many practitioners, especially young people, and at first caused deep resentment in the database community. Authoritative experts from the area claimed that the MapReduce - a return to prehistoric time to address data management issues required an explicit programming, and reproached MapReduce proponents of ignorance and unreasonable denial of the serious results of previous years. Most likely, these arguments were and are correct. MapReduce technology can not and should not substitute for database technology. But it turned out that this technology can be very useful if it is applied within the parallel analytical DBMS to support parallel programming and performance of analytic functions supplied by the users.

MapReduce is conceptually much simpler than MPI. The programmer need only understand one idea - that the data must first be distributed over cluster nodes, and then processed. The result of treatment can be re-distributed over cluster nodes and then process, etc. From the application programmer only needs to provide the program code of the two functions - providing the separation of data nodes in the cluster, and functions, providing data processing received sections. Certainly, such a paradigm of programming much easier for professional programmers than the MPI, but, more importantly, it must be conceptually close to analysts.

It seems that there is support MapReduce technology in a parallel analytical DBMS must fully meet the needs of analysts and future analytical applications will be server-based applications that run in parallel in the vicinity of the data which they are addressed. All this means that future analytical systems’ the horizontal scalability will provided. Thus, for these can be solved and the problem of large data.

In fact, a new generation of analytical parallel database provides the ability to parallel programming of analytic applications, which is easier and clearer for untrained professionals than the traditional MPI interface. I.e. in fact partly a more general problem - parallel programming for supercomputers. The question is, not whether this approach deserves wider use than the analytic extension of the parallel database server? We should not try to apply the technology to MapReduce parallel programming tasks that require processing of large amounts of data?