Modern information technology /2. Computers and Programming
Abylkhassenova
D.K.
Almaty University of Energy and Communications
The problem of "Big Data" and the
prospects of its solution.
Big Data problem is eternal and
illusive. In the entire history of data management technology has always had
important data that we would like to be able to effectively store and process,
but the volume, which makes this a daunting task for existing data management
systems. Eternity and illusory of a problem related not only to the constant
growth of the volume of data, but also to the fact that there are needs for
storage and processing of new types of data, for which existing systems are
ill-suited (or do not work at all). Eternity and illusory of a problem hardly
can count on its full and final decision. This is bad for users and application
developers, but guarantees permanent employment in the future research and
development of data management systems.
It is considered that the
database technology (meaning traditional SQL-oriented - colloquially called
Relational - DBMS) made possible to efficiently manage transactional and
analytical databases. Transactional database designed to support operational
transactional applications (different backup systems, trading systems, etc.). The
transactional database mainly contains operational data reflecting the current
status of the business or other areas of work, quickly and frequently update
and maintain various aspects of this activity. Analytic Database contains
historical data related to the activities of a particular enterprise, some
business or some scientific direction. This data comes in the database from
different sources, one of which is relevant transaction database.
The problem of large data exposed
to both categories of databases. The volumes of transactional databases are
increasing due to the development of operational user needs, business or
science. The volume of analytical databases grow, primarily because of its
nature: the data they just always accumulate and never deleted. Another major
reason for growth in the analytical database is a need for business analysts to
attract new sources of data.
For transactional databases
particular case of large data problems can be summarized as follows: it is
necessary to provide a relatively inexpensive technology scaling databases and
transactional applications, allowing maintain the desired speed of transaction
processing with an increase in the volume of data and increase the number of
simultaneously carried out transactions. For analytical databases particular
case of a problem goes something like this: it is required to provide the
technology relatively inexpensive to scale database and analytic applications
that enables analysts to (a) enhance the capacity of the database by part
performance of analytic queries and (b) provide an effective online analytical
processing with an increase in their volume.
In the first decade of the new
century, the researchers, led by one of the pioneers of database technology
Michael Stonebraker managed to find ways to solve both of a problem of
particular cases. At the core of both solutions based on the following general
principles:
1.
Transfer calculations as close as possible to the sources - a principle
means that the database itself and database applications so arranged to
minimize the transfer of data over the network connecting the nodes of a
suitable computing system. Obviously, this principle importance increases with
increasing amounts of data. The consequence first principle is the need for
porting database applications (parts or all) on the server side;
2.
The use of architecture without any shared resources (sharing nothing) -
the principle enables a real parallelization DBMS and applications, since the
absence of shared resources between computing nodes of the system (in fact, by
using a cluster architecture) decreases the likelihood of conflict between
parts of the system and the applications running in different network nodes;
3.
Effective data sharing across nodes of a computer system with the
possibility of replication in multiple nodes - a principle provides an
efficient parallel processing of transactions or effective support for online
analytical processing.
Immediately should say that the
all three of the principle is not new. Rather, their appearance can be dated to
the 80th years of the last century. For example, sharing nothing principles
based popular and highly effective parallel to the Teradata database, which has
been successfully used for several decades. However, in our time marked by
three principles successfully applied to create really scalable parallel
transactional and analytical databases.
The application of these
principles is necessary and sufficient for implementation of both types of
systems. In each case it is necessary to apply some additional ideas. In
particular, the parallel database transaction is beneficial to be based on the
long-known ideas of database management in main memory, and to ensure the
reliability of data used advanced replication. At the same analytical system is
more advantageous to use the storage technology of tabular data in the external
memory columns together with support for a variety of redundant data
structures.
So, we can assume that the
solution to the problem of large transactional and analytical data are planned.
This does not mean that even these particular kinds of problems are solved. For
example, a parallel DBMS transaction effectively only work with this data
distribution, which minimizes the number of distributed transactions in the
existing workload. When you change the workload required to redistribute the
data. Analytical parallel DBMS to cope with complex analytic queries only in
those cases where the separation of the data corresponds to the specifics of
these requests.
As already stated, the concept of
large data is relative. In particular, large transaction data volume by several
orders greater yield analytical data.
Based on the foregoing, it
follows that the database community is relatively well learn how to build a
horizontally scalable parallel analytic database that can support the effective
implementation of the standard analytical queries (for simplicity, we neglect
the problem of data redistribution requirements noted above). But back to the
first basic principle - the approximate calculation of the data. If the server
database provided only basic analytics, like it or not, any more or less
serious analytical application will have to pull large volumes of analytical
data on the workstations or in the best case for the interim analysis server.
The only way to eliminate this
defect is to allow the expansion of server analytics with new analytic
functions supplied by business analysts. At first glance, the corresponding
features are provided by the SQL tools that allow users to define their own
functions, procedures, and even the types of data. But SQL does
not provide the parallelization of these programs. Pieces of these programs
will have to be carried out in the vicinity of the respective pieces of data.
The only common method of
parallel programming in a clustered environment is the use of interface MPI -
Message Passing Interface. In this case, the programmer decides how to arrange
separate parallel execution of its programs in the cluster nodes, and how to
ensure their interaction to generate the final result. But programming
with MPI interface is a great difficulty even for professionally trained
programmers, not to mention the business analysts who are much closer to
mathematical statistics, instead of parallel programming techniques. Most
likely, a typical analyst, who will be asked to start solving this problem,
prefer to use analytical software packages on your workstation that is
fundamentally ruin the whole idea of the horizontal scalability of systems. At
first glance, the problem seems insoluble, as well as insoluble, and it seems
the more general problem of providing convenient and efficient parallel
programming tools for programmers to generalists.
However, not so long ago managed
to find approach to solving this problem, at least in the first part of it (it
should be noted the merits of the developers of parallel DBMS Greenplum and
Asterdata). This decision is based on the use of MapReduce technology.
Let us remind that MapReduce technology appeared in the bowels of Google as a replacement for parallel analytical database for solving analytical problems of their own. The technology
quickly gained popularity among many practitioners, especially young people,
and at first caused deep resentment in the database community. Authoritative
experts from the area claimed that the MapReduce - a return to prehistoric time
to address data management issues required an explicit programming, and
reproached MapReduce proponents of ignorance and unreasonable denial of the
serious results of previous years. Most likely, these arguments were and are
correct. MapReduce technology can not and should not substitute for database
technology. But it turned out that this technology can be very useful if it is
applied within the parallel analytical DBMS to support parallel programming and
performance of analytic functions supplied by the users.
MapReduce is conceptually much
simpler than MPI. The programmer need only understand one idea - that the data
must first be distributed over cluster nodes, and then processed. The result of
treatment can be re-distributed over cluster nodes and then process, etc. From the
application programmer only needs to provide the program code of the two
functions - providing the separation of data nodes in the cluster, and
functions, providing data processing received sections. Certainly, such
a paradigm of programming much easier for professional programmers than the
MPI, but, more importantly, it must be conceptually close to analysts.
It seems that there is support
MapReduce technology in a parallel analytical DBMS must fully meet the needs of
analysts and future analytical applications will be server-based applications
that run in parallel in the vicinity of the data which they are addressed. All
this means that future analytical systems’ the horizontal scalability will provided.
Thus, for these can be solved and the problem of large data.
In fact, a new generation of
analytical parallel database provides the ability to parallel programming of
analytic applications, which is easier and clearer for untrained professionals
than the traditional MPI interface. I.e. in fact partly a more general problem
- parallel programming for supercomputers. The question is, not whether this
approach deserves wider use than the analytic extension of the parallel
database server? We should not try to apply the technology to MapReduce
parallel programming tasks that require processing of large amounts of data?