Baymurzin C.K., Ponomarenko E.A., Zharlykasov B.J.
A.
Baitursynov Kostanai State University, Kazakhstan
GPU NVIDIA hardware architecture and its comparison with the CPU
architecture
Quite a long time, it was observed that video
demonstrate high performance on tasks for which they were created - namely,
three-dimensional graphics display on the computer screen, enthusiasts
performed calculations on the cards using shader languages, but only with the
advent of CUDA in 2006 with GeForce 8 series an opportunity to direct the use
of resources GPU (Graphics Processing Unit), that is exactly when an expression
of GPGPU (General Purpose computing on GPU) - the calculation of the overall
view of the GPU.
CUDA - a software and hardware architecture of the GPU
from NVIDIA [1], it includes a description of the features of the architecture
of graphics cards, as well as the methods and approaches of its programming. At
the moment, the development of eco system CUDA, which includes a lot of
libraries, both commercial and free development tools, diagnostic tools, and
examples of usage.
There are several directions of GPU - it's
entertainment (Video Card GeForce), professional graphics (Video Card Quadro),
High Performance Computing (Video Cards Tesla). Family differ in the number of
cores, memory and bandwidth, but in terms of the general architecture of all
models of the same generation are identical and if the program runs on a cheap
video card GeForce, it will work on the best Tesla and vice versa, will differ
only at runtime.
For modern CPUs are characterized by several (2 to 16)
cores . For efficient operation of these nuclei provides a complex hierarchical
structure cache ( fast memory located on the CPU chip ) . At the current date,
the amount of memory that reaches tens of megabytes . We should also note that
the appeal in memory for each core is processed separately, giving you complete
independence of the nuclei. The basic unit of construction graphics card is
streaming multiprocessor (SM or SMX). By streaming multiprocessors can be
treated as a core CPU . Kepler architecture is located in one SM 192 cores CUDA
CORE. It is these core can execute code in parallel [2]. Apart from cores of
the streaming multiprocessor includes many additional facilities , including a
considerable amount of registered memory , because it is required for all cores
at once and small caches. In a single chip card contains from one to 15
multiprocessors , so for best video at the moment , you can get more than 2,500
cores on a single chip. In addition to SM in the chips include graphics
controllers with wider bus memory access as well as a small common to all
second-level cache multiprocessors. Graphics card interface supports PCI-E 3.0
[3], as well as its previous version. When comparing photographs images CPU and
GPU can be seen that the latter is much more uniform - it is caused by the fact
that in the graphics processor comprises a large number of common structural
units CUDA kernels at a time in a CPU unit of a complicated structure.
Thus the main differences from the CPU GPU is the
large number of cores. The small size of the cache and memory with a wide data
bus optimized for shared access. While until the CPU is engaged in computing
the next portion of data may already be loaded into fast caches, the absence of
large cache GPU automatically leads to the fact that you need a different
mechanism to deal with this problem. The solution lies in the already described
in the context of threads.
When you run a large number of execution threads, waiting
for data for some threads will be covered his other , so for the efficient
loading of graphics (ie, to the computing resources are not idle while waiting
for the required data ) required to run significantly more threads of execution
than the number of CUDA cores. Usually graphics card goes to the optimal
performance when running multiple tens of thousands of threads. It is worth to
note that for all the CPU quite differently - optimal number of threads equal
to the number of execution cores.
The nested parallelism mechanism functions and run
threads, particularly of memory, hardware configuration, etc. So it is with
search algorithms can definitely point out one thing CUDA effective in such
algorithms, where the CPU is forced to repeat the iteration, and CUDA can do
the same thing in different threads. In other words, where the migration
algorithm from the CPU to GPU, we get rid of the cycle and move it into
streams.
You also need to note a common saying that CUDA is
effective in such applications where the input, we have a large array of data
and output as well have a large array of data.
Литература:
1.Сандерс Д., Кэндрот Э.,
Технология CUDA в примерах. Введение в программирование графических процессов:
Пер. с англ. Слинкина А.А., научный редактор Боресков А.В. М.: ДМК Пресс,
2011 г. –232 с.
3.Kirk D.B., Wen-mei W. H., Programming Massively Parallel Processors: A
Hands-on Approach, Издатель: Morgan Kaufmann,
2010 г. -280 c.
4.Cook Sh., CUDA Programming: A Developer's Guide to Parallel Computing
with GPUs, Издатель: Morgan Kaufmann,
2012 г. -600 c.