Baymurzin C

Baymurzin C.K., Ponomarenko E.A., Zharlykasov B.J.

A. Baitursynov Kostanai State University, Kazakhstan

GPU NVIDIA hardware architecture and its comparison with the CPU architecture

Quite a long time, it was observed that video demonstrate high performance on tasks for which they were created - namely, three-dimensional graphics display on the computer screen, enthusiasts performed calculations on the cards using shader languages, but only with the advent of CUDA in 2006 with GeForce 8 series an opportunity to direct the use of resources GPU (Graphics Processing Unit), that is exactly when an expression of GPGPU (General Purpose computing on GPU) - the calculation of the overall view of the GPU.

CUDA - a software and hardware architecture of the GPU from NVIDIA [1], it includes a description of the features of the architecture of graphics cards, as well as the methods and approaches of its programming. At the moment, the development of eco system CUDA, which includes a lot of libraries, both commercial and free development tools, diagnostic tools, and examples of usage.

There are several directions of GPU - it's entertainment (Video Card GeForce), professional graphics (Video Card Quadro), High Performance Computing (Video Cards Tesla). Family differ in the number of cores, memory and bandwidth, but in terms of the general architecture of all models of the same generation are identical and if the program runs on a cheap video card GeForce, it will work on the best Tesla and vice versa, will differ only at runtime.

For modern CPUs are characterized by several (2 to 16) cores . For efficient operation of these nuclei provides a complex hierarchical structure cache ( fast memory located on the CPU chip ) . At the current date, the amount of memory that reaches tens of megabytes . We should also note that the appeal in memory for each core is processed separately, giving you complete independence of the nuclei. The basic unit of construction graphics card is streaming multiprocessor (SM or SMX). By streaming multiprocessors can be treated as a core CPU . Kepler architecture is located in one SM 192 cores CUDA CORE. It is these core can execute code in parallel [2]. Apart from cores of the streaming multiprocessor includes many additional facilities , including a considerable amount of registered memory , because it is required for all cores at once and small caches. In a single chip card contains from one to 15 multiprocessors , so for best video at the moment , you can get more than 2,500 cores on a single chip. In addition to SM in the chips include graphics controllers with wider bus memory access as well as a small common to all second-level cache multiprocessors. Graphics card interface supports PCI-E 3.0 [3], as well as its previous version. When comparing photographs images CPU and GPU can be seen that the latter is much more uniform - it is caused by the fact that in the graphics processor comprises a large number of common structural units CUDA kernels at a time in a CPU unit of a complicated structure.

Thus the main differences from the CPU GPU is the large number of cores. The small size of the cache and memory with a wide data bus optimized for shared access. While until the CPU is engaged in computing the next portion of data may already be loaded into fast caches, the absence of large cache GPU automatically leads to the fact that you need a different mechanism to deal with this problem. The solution lies in the already described in the context of threads.

When you run a large number of execution threads, waiting for data for some threads will be covered his other , so for the efficient loading of graphics (ie, to the computing resources are not idle while waiting for the required data ) required to run significantly more threads of execution than the number of CUDA cores. Usually graphics card goes to the optimal performance when running multiple tens of thousands of threads. It is worth to note that for all the CPU quite differently - optimal number of threads equal to the number of execution cores.

The nested parallelism mechanism functions and run threads, particularly of memory, hardware configuration, etc. So it is with search algorithms can definitely point out one thing CUDA effective in such algorithms, where the CPU is forced to repeat the iteration, and CUDA can do the same thing in different threads. In other words, where the migration algorithm from the CPU to GPU, we get rid of the cycle and move it into streams.

You also need to note a common saying that CUDA is effective in such applications where the input, we have a large array of data and output as well have a large array of data.

Литература:

1.Сандерс Д., Кэндрот Э., Технология CUDA в примерах. Введение в программирование графических процессов: Пер. с англ. Слинкина А.А., научный редактор Боресков А.В. М.: ДМК Пресс, 2011 г. –232 с.

3.Kirk D.B., Wen-mei W. H., Programming Massively Parallel Processors: A Hands-on Approach, Издатель: Morgan Kaufmann, 2010 г. -280 c.

4.Cook Sh., CUDA Programming: A Developer's Guide to Parallel Computing with GPUs, Издатель: Morgan Kaufmann, 2012 г. -600 c.