Современные информационные технологии/2. Вычислительная техника и программирова­ние

 

Пономаренко Е.А., Сорочак Н.В., Жарлыкасов Б.Ж., Нурмагамбетов А.С.

Костанайский государственный университет имени А.Байтурсынова, Казахстан

Using hardware and software solutions for CUDA technology for non-graphical tasks

Today, in a period of rapid development of computer technology and the increasing number of tasks associated with the processing of large data sets. To handle large amounts of data along with the standard tools of data used graphics processors. Graphics processor was originally designed for building scenes from a variety of graphics primitives, and the construction of various objects in the scene was carried out independently of each other, making it possible to carry out the tasks of constructing parallel. For this purpose GPU (Graphics Processing Unit) is used many so-called stream processors, the number of which in modern GPU up to several hundred. Carrying out calculations on the graphics device, freeing the CPU (Central Processing Unit).

Many processes in parallelizing run many times faster. New CPUs have from 1 to 8 and kernel support parallelism. On the cards of the new generation, there are more than 300 graphics cores, so parallelization process on video, much more profitable.

For PC performance is directly related to the CPU clock frequency SPU. If you track the dynamics of growth frequency CPU, it can be seen that in recent years the growth rate slowed markedly, but there was a new trend - the creation of multi-core processors and systems to increase the number of cores in the processor. Also increasing the speed of computing devices is achieved by increasing the number of parallel cores, that is, through parallelism.

According to Amdahl's law (Amdahl Law) calculated maximum acceleration, which can be obtained from the program parallelization process by N (nuclei):

(1)

where P - is part of the run-time program that can be parallelized by N processes. Note that increasing the number of processes N tends to the maximum gain  . Thus, if we parallelize ¾ of the entire program, the maximum gain is 4 times. That is why it is important to use good parallelized algorithms and methods[1].

Technology benefits

1. Application Programming Interface CUDA (CUDA API) based on a standard programming language C + + with some restrictions. This simplifies the process of researching and smoothes architecture CUDA.

2. Shared among threads memory (shared memory) size of 16 KB can be used by organized user cache with a wider bandwidth than the sample of normal texture.

3. More efficient memory transactions between the CPU and memory.

4. Complete hardware support for integer and bitwise operations [2].

Programs are written in the "extended" C, with their parallel part (core) is performed on the GPU, and the regular - the CPU. CUDA automatically performs the division and control parts of their launch.

Suppose we have two square matrices A and B of size N * N (we assume that N is a multiple of 16). The simplest approach uses one thread to each element of the resulting matrix C, while the thread retrieves all the necessary elements of a global memory, and performs the required calculations.

To calculate a single element of the matrix product need 2 * N arithmetic operations and 2 * N readings from global memory. Clearly, in this case, the main limiting factor is the speed of access to the global memory, which is very low.

Sequential algorithm for multiplying two square matrices is represented by three nested loops and is focused on a consistent calculation of the resulting rows of the matrix C.

These indicators are ideal. In practice, we usually get the relative performance, which do not coincide with the ideal because of the cost to perform data transfer operations between processors and other office expenses [3].

The experimental results of multiplication of two matrices shown in Table 1, which shows that the resultant speed of computation efficiency increases with increasing size of the matrix. Figure 1 also shows data dependencies in the form of diagrams.

Table 1. Experimental results of multiplication of two matrices

Order of the matrix

Sequential algorithm

Parallel algorithm

Runtime on CPU,

ms

Runtime on GPU, ms

Acceleration

100

0,32

0,29

1,103

300

0,73

0,64

1,140

500

3,32

1,59

2,088

800

13,68

3,30

4,14

1000

26,70

5,20

5,13

 

Figure 1. Dependence of the program on the rank of the matrix

 

As can be seen from Figure 1, the increase in computation time increases in proportion to an increase in dimension of the matrix, however, for both algorithms, computation time data is different. When parallelization process to increase the dimension of the matrix, the performance is only increased by 20 times, while, when the serial algorithm, increased by about 100 times.

Summarizing it should be noted that when working with graphics processor, using technology CUDA, it is necessary to know the technology and additional programming language extensions presented in technology CUDA.

Considered one of the matrix multiplication algorithm on the other, showed that the use of a parallel algorithm on the graphics card provides a significant advantage in time of program execution. In both cases, increasing the dimension of the matrix, increases the performance of programs, but on the graphics device parallelized program runs much faster. In case the dimension of the matrices 1000 * 1000, execution speed was approximately 5 times higher than the central processor.

The downside of the parallel computation on the video is that when the dimension of the matrix 3000 and higher video card driver does not withstand the load on and off. Solution to this problem is the use of specialized graphics cards for non-graphical calculations of general purpose, such as NVidia Tesla, NVidia Quadro and others.

 Using graphics cards for computing offloads the CPU and use it in parallel with other tasks.

Calculating non-graphical calculations of general purpose graphics processors , have a great perspective in connection with the rapid development of the architecture of the graphics core, together with an increase in the quality and quantity of graphics cores Video Card, develop and calculations on such devices for non-graphical calculations for general use.

Literature:

1. Боресков А.В., Харламов А.А., «Основы работы с технологиями CUDA», Москва, 2010 – 234 стр.

2.     Антонов А.С. «Введение в параллельные вычисления» (методическое пособие), Физический факультет МГУ, Москва, 2002 - 70 стр.

3.     Воеводин В.В., «Параллельные вычисления», БХВ, Петербург, 2002 - 608 стр.