Современные информационные технологии/2. Вычислительная техника и
программирование
Пономаренко Е.А., Сорочак Н.В., Жарлыкасов
Б.Ж., Нурмагамбетов А.С.
Костанайский государственный университет
имени А.Байтурсынова, Казахстан
Using hardware and software solutions
for CUDA technology for non-graphical tasks
Today,
in a period of rapid development of computer technology and the increasing
number of tasks associated with the processing of large data sets. To handle
large amounts of data along with the standard tools of data used graphics
processors. Graphics processor was originally designed for building scenes from
a variety of graphics primitives, and the construction of various objects in
the scene was carried out independently of each other, making it possible to
carry out the tasks of constructing parallel. For this purpose GPU (Graphics
Processing Unit) is used many so-called stream processors, the number of which
in modern GPU up to several hundred. Carrying out calculations on the graphics
device, freeing the CPU (Central Processing Unit).
Many
processes in parallelizing run many times faster. New CPUs have from 1 to 8 and
kernel support parallelism. On the cards of the new generation, there are more
than 300 graphics cores, so parallelization process on video, much more
profitable.
For
PC performance is directly related to the CPU clock frequency SPU. If you track
the dynamics of growth frequency CPU, it can be seen that in recent years the
growth rate slowed markedly, but there was a new trend - the creation of
multi-core processors and systems to increase the number of cores in the
processor. Also increasing the speed of computing devices is achieved by
increasing the number of parallel cores, that is, through parallelism.
According
to Amdahl's law (Amdahl Law) calculated maximum acceleration, which can be
obtained from the program parallelization process by N (nuclei):
where P - is part of the run-time
program that can be parallelized by N processes. Note that increasing the
number of processes N tends to the maximum gain
Technology benefits
1. Application Programming Interface CUDA (CUDA API) based on a standard
programming language C + + with some restrictions. This simplifies the process
of researching and smoothes architecture CUDA.
2. Shared among threads memory (shared memory) size of 16 KB can be used
by organized user cache with a wider bandwidth than the sample of normal
texture.
3. More efficient memory transactions between the CPU and memory.
4. Complete hardware support for integer and bitwise operations [2].
Programs are
written in the "extended" C, with their parallel part (core) is
performed on the GPU, and the regular - the CPU. CUDA automatically performs
the division and control parts of their launch.
Suppose we
have two square matrices A and B of size N * N (we assume that N is a multiple
of 16). The simplest approach uses one thread to each element of the resulting
matrix C, while the thread retrieves all the necessary elements of a global
memory, and performs the required calculations.
To calculate
a single element of the matrix product need 2 * N arithmetic operations and 2 *
N readings from global memory. Clearly, in this case, the main limiting factor
is the speed of access to the global memory, which is very low.
Sequential
algorithm for multiplying two square matrices is represented by three nested
loops and is focused on a consistent calculation of the resulting rows of the
matrix C.
These indicators are ideal. In practice, we usually get the relative
performance, which do not coincide with the ideal because of the cost to
perform data transfer operations between processors and other office expenses [3].
The experimental results of multiplication of two matrices shown in
Table 1, which shows that the resultant speed of computation efficiency
increases with increasing size of the matrix. Figure 1 also shows data
dependencies in the form of diagrams.
Table 1.
Experimental results of multiplication of two matrices
Order of the matrix |
Sequential algorithm |
Parallel algorithm |
|
Runtime on
CPU, ms |
Runtime on
GPU, ms |
Acceleration |
|
100 |
0,32 |
0,29 |
1,103 |
300 |
0,73 |
0,64 |
1,140 |
500 |
3,32 |
1,59 |
2,088 |
800 |
13,68 |
3,30 |
4,14 |
1000 |
26,70 |
5,20 |
5,13 |
Figure 1. Dependence of the program on the rank
of the matrix
As
can be seen from Figure 1, the increase in computation time increases in proportion
to an increase in dimension of the matrix, however, for both algorithms,
computation time data is different. When parallelization process to increase
the dimension of the matrix, the performance is only increased by 20 times,
while, when the serial algorithm, increased by about 100 times.
Summarizing
it should be noted that when working with graphics processor, using technology
CUDA, it is necessary to know the technology and additional programming
language extensions presented in technology CUDA.
Considered
one of the matrix multiplication algorithm on the other, showed that the use of
a parallel algorithm on the graphics card provides a significant advantage in
time of program execution. In both cases, increasing the dimension of the
matrix, increases the performance of programs, but on the graphics device
parallelized program runs much faster. In case the dimension of the matrices
1000 * 1000, execution speed was approximately 5 times higher than the central
processor.
The
downside of the parallel computation on the video is that when the dimension of
the matrix 3000 and higher video card driver does not withstand the load on and
off. Solution to this problem is the use of specialized graphics cards for
non-graphical calculations of general purpose, such as NVidia Tesla, NVidia
Quadro and others.
Using
graphics cards for computing offloads the CPU and use it in parallel with other
tasks.
Calculating
non-graphical calculations of general purpose graphics processors , have a
great perspective in connection with the rapid development of the architecture
of the graphics core, together with an increase in the quality and quantity of
graphics cores Video Card, develop and calculations on such devices for
non-graphical calculations for general use.
Literature:
1. Боресков А.В., Харламов
А.А., «Основы работы с технологиями CUDA», Москва, 2010 – 234 стр.
2. Антонов А.С. «Введение в параллельные
вычисления» (методическое пособие), Физический факультет МГУ, Москва, 2002 - 70
стр.
3. Воеводин В.В., «Параллельные вычисления»,
БХВ, Петербург, 2002 - 608 стр.