Hostname: page-component-586b7cd67f-vdxz6 Total loading time: 0 Render date: 2024-12-06T11:01:21.211Z Has data issue: false hasContentIssue false

A Survey on Parallel Computing and its Applications in Data-Parallel Problems Using GPU Architectures

Published online by Cambridge University Press:  03 June 2015

Cristóbal A. Navarro*
Affiliation:
Department of Computer Science (DCC), Universidad de Chile, Santiago, Chile Centro de Estudios Científicos (CECS), Valdivia, Chile
Nancy Hitschfeld-Kahler*
Affiliation:
Department of Computer Science (DCC), Universidad de Chile, Santiago, Chile
Luis Mateu*
Affiliation:
Department of Computer Science (DCC), Universidad de Chile, Santiago, Chile
*

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Parallel computing has become an important subject in the field of computer science and has proven to be critical when researching high performance solutions. The evolution of computer architectures (multi-core and many-core) towards a higher number of cores can only confirm that parallelism is the method of choice for speeding up an algorithm. In the last decade, the graphics processing unit, or GPU, has gained an important place in the field of high performance computing (HPC) because of its low cost and massive parallel processing power. Super-computing has become, for the first time, available to anyone at the price of a desktop computer. In this paper, we survey the concept of parallel computing and especially GPU computing. Achieving efficient parallel algorithms for the GPU is not a trivial task, there are several technical restrictions that must be satisfied in order to achieve the expected performance. Some of these limitations are consequences of the underlying architecture of the GPU and the theoretical models behind it. Our goal is to present a set of theoretical and technical concepts that are often required to understand the GPU and its massive parallelism model. In particular, we show how this new technology can help the field of computational physics, especially when the problem is data-parallel. We present four examples of computational physics problems; n-body, collision detection, Potts model and cellular automata simulations. These examples well represent the kind of problems that are suitable for GPU computing. By understanding the GPU architecture and its massive parallelism programming model, one can overcome many of the technical limitations found along the way, design better GPU-based algorithms for computational physics problems and achieve speedups that can reach up to two orders of magnitude when compared to sequential implementations.

Type
Review Article
Copyright
Copyright © Global Science Press Limited 2014

References

[1]Adve, S. V. and Gharachorloo, K.Shared memory consistency models: A tutorial. Computer, 29(12):66–76, December 1996.CrossRefGoogle Scholar
[2]Aggarwal, A., Alpern, B., Chandra, A., and Snir, M.A model for hierarchical memory. In Proceedings of the nineteenth annual ACM symposium on Theory of computing, STOC ‘87, pages 305–314, New York, NY, USA, 1987. ACM.Google Scholar
[3]Alpern, B., Carter, L., Feig, E., and Selker, T.The uniform memory hierarchy model of computation. Algorithmica, 12:72–109, 1994. 10.1007/BF01185206.Google Scholar
[4]Alpern, B., Carter, L., and Ferrante, J.Modeling parallel computers as memory hierarchies. In In Proc. Programming Models for Massively Parallel Computers, pages 116–123. IEEE Computer Society Press, 1993.Google Scholar
[5]Amdahl, G. M.Validity of the single processor approachto achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference, AFIPS ‘67 (Spring), pages 483–485, New York, NY, USA, 1967. ACM.Google Scholar
[6]Barnes, J. and Hut, P.A hierarchical O(N log N) force-calculation algorithm. Nature, 324(6096):446–449, December 1986.CrossRefGoogle Scholar
[7]Barroso, L. A.The price of performance. Queue, 3(7):48–53, September 2005.Google Scholar
[8]Bays, C.Cellular automata in triangular, pentagonal and hexagonal tessellations. In Meyers, Robert A., editor, Computational Complexity, pages 434–442. Springer New York, 2012.Google Scholar
[9]Beame, P. and Hastad, J.Optimal bounds for decision problems on the crcw pram. In In Proceedings of the 19th ACM Symposium on Theory of Computing (New, pages 25–27. ACM.Google Scholar
[10]Bédorf, J., Gaburov, E., and Zwart, S. P.A sparse octree gravitational n-body code that runs entirely on the GPU processor. J. Comput. Phys., 231(7):2825–2839, April 2012.Google Scholar
[11]Bernhardt, A., Maximo, A., Velho, L., Hnaidi, H., and Cani, M.-P.Real-time terrain modeling using cpu-GPU coupled computation. In Proceedings of the 2011 24th SIBGRAPI Conference on Graphics, Patterns and Images, SIBGRAPI ‘11, pages 64–71, Washington, DC, USA, 2011. IEEE Computer Society.Google Scholar
[12]Bittnar, Z., Kruis, J., Němeček, J., Patzák, B., and Rypl, D.Civil and structural engineering computing: 2001. chapter Parallel and distributed computations for structural mechanics: a review, pages 211–233. Saxe-Coburg Publications, 2001.Google Scholar
[13]Carter, L.Alpern, B.The ram model considered harmful towards a science of performance programming, 1994.Google Scholar
[14]Breshears, C. P.The Art of Concurrency – A Thread Monkey’s Guide to Writing Parallel Applications. O’Reilly, 2009.Google Scholar
[15]Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P.Brook for GPUs: stream computing on graphics hardware. ACM Trans. Graph., 23(3):777–786, August 2004.CrossRefGoogle Scholar
[16]Capannini, G., Silvestri, F., and Baraglia, R.K-model: A new computational model for stream processors. In Proceedings of the 2010 IEEE 12th International Conference on High Performance Computing and Communications, HPCC ‘10, pages 239–246, Washington, DC, USA, 2010. IEEE Computer Society.Google Scholar
[17]Chamberlain, B. L.Chapel (cray inc. hpcs language). In Encyclopedia of Parallel Computing, pages 249–256. 2011.Google Scholar
[18]Chapman, B., Jost, G., and Pas, R. van der. Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press, 2007.Google Scholar
[19]Chen, D.-K., Su, H.-M., and Yew, P.-C.The impact of synchronization and granularity on parallel systems. SIGARCH Comput. Archit. News, 18(3a):239–248, May 1990.CrossRefGoogle Scholar
[20]Chen, N., Glazier, J. A., Izaguirre, J. A., and Alber, M. S.A parallel implementation of the cellular potts model for simulation of cell-based morphogenesis. Computer Physics Communications, 176(11-12):670–681, 2007.Google Scholar
[21]Coddington, P.Visualizations of spin models of magnetism, online at http://cs.adelaide.edu.au/paulc/physics/spinmodels.html August 2013.Google Scholar
[22]Cohen, F., Decaudin, P., and Neyret, F.GPU-based lighting and shadowing of complex natural scenes. In Siggraph’04 Conf. DVD-ROM (Poster), August 2004. Los Angeles, USA.Google Scholar
[23]Colbert, M. and Kŕivánek, J.Real-time dynamic shadows for image-based lighting. In ShaderX 7 - Advanced Rendering Technicques. Charles River Media, 2009.Google Scholar
[24]Cole, M.Algorithmic skeletons: structured management of parallel computation. MIT Press, Cambridge, MA, USA, 1991.Google Scholar
[25]Colic, A., Kalva, H., and Furht, B.Exploring nvidia-cuda for video coding. In Proceedings of the first annual ACM SIGMM conference on Multimedia systems, MMSys ‘10, pages 13–22, New York, NY, USA, 2010. ACM.Google Scholar
[26]Cook, M.Universality in Elementary Cellular Automata. Complex Systems, 15(1): 1–40, 2004.Google Scholar
[27]Cormen, T. H., Stein, C., Rivest, R. L., and Leiserson, C. E.Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001.Google Scholar
[28] Intel Corporation. IntelR XeonR Processor E5-2600 Product Family Uncore Performance Monitoring Guide, 2012.Google Scholar
[29] Nvidia Corporation. Kepler Whitepaper for the GK110 architecture, 2012.Google Scholar
[30]Scheihing, E., Navarro, C. A., Hitschfeld-Kahler, N.A GPU-based method for generating quasi-delaunay triangulations based on edge-flips. In Proceedings of the 8th International on Computer Graphics, Theory and Applications, GRAPP 2013, pages 27–34, February 2013.Google Scholar
[31]Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K. E., Santos, E., Subramonian, R., and Eicken, T. von. Logp: towards a realistic model of parallel computation. SIGPLAN Not., 28(7):1–12, July 1993.Google Scholar
[32]Dean, J. and Ghemawat, S.Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113,January 2008.CrossRefGoogle Scholar
[33]Dijkstra, E. W.Solution of a problem in concurrent programming control. Commun. ACM, 8(9):569–, September 1965.Google Scholar
[34]Dunstan, N.Semaphores for fair scheduling monitor conditions. SIGOPS Oper. Syst. Rev., 25(3):27–31, May 1991.CrossRefGoogle Scholar
[35]Faber, V., Lubeck, O. M., and White, A. B., Jr. Superlinear speedup of an efficient sequential algorithm is not possible. Parallel Comput., 3(3):259–260, July 1986.CrossRefGoogle Scholar
[36]Ferrando, N., Gosalvez, M. A., Cerda, J., Girones, R. G., and Sato, K.Octree-based, GPU implementation of acontinuous cellular automaton for the simulation of complex, evolving surfaces. Computer Physics Communications, pages 628–640, 2011.Google Scholar
[37]Ferrero, E. E., De Francesco, J. P., Wolovick, N., and S. A.Cannas. q-state potts model metasta-bility study using optimized GPU-based monte carlo algorithms. Computer Physics Communications, 183(8):1578–1587, 2012.Google Scholar
[38]Flynn, M. J.Some computer organizations and their effectiveness. IEEE Trans. Comput., 21(9):948–960, September 1972.Google Scholar
[39]Fortune, S. and Wyllie, J.Parallelism in random access machines. In Proceedings of the tenth annual ACM symposium on Theory of computing, STOC ‘78, pages 114–118, New York, NY, USA, 1978. ACM.Google Scholar
[40]Foster, I.Designing and building parallel programs: Concepts and tools for parallel software engineering. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1995.Google Scholar
[41]Gabriel, E., Fagg, G. E., Bosilca, G., Angskun, T., Dongarra, J. J., Squyres, J. M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R. H., Daniel, D. J., Graham, R. L., and Woodall, T. S.Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users’ Group Meeting, pages 97–104, Budapest, Hungary, September 2004.Google Scholar
[42]Gardner, M.The fantastic combinations of John Conway’s new solitaire game “life”. Scientific American, 223:120–123, October 1970.Google Scholar
[43]Gobron, S., Bonafos, H., and Mestre, D.GPU accelerated computation and visualization of hexagonal cellular automata. In Proceedings of the 8th international conference on Cellular Automata for Reseach and Industry, ACRI ‘08, pages 512–521, Berlin, Heidelberg, 2008. Springer-Verlag.Google Scholar
[44]Gobron, S., Çöltekin, A., Bonafos, H., and Thalmann, D.GPGPU computation and visualization of three-dimensional cellular automata. The Visual Computer, 27(1):67–81, 2011.CrossRefGoogle Scholar
[45]Gobron, S., Devillard, F., and Heit, B.Retina simulation using cellular automata and GPU programming. Mach. Vision Appl., 18(6):331–342, November 2007.Google Scholar
[46]Gobron, S., Marx, C., Ahn, J., and Thalmann, D.Real-time textured volume reconstruction using virtual and real video cameras. In proceedings of the Computer Graphics International 2010 conference, 2010.Google Scholar
[47]Greenlaw, R., Hoover, J. H., and Ruzzo, W. L.Limits to Parallel Computation: P-Completeness Theory. Oxford University Press, USA, April 1995.Google Scholar
[48]Gupta, M., Mukhopadhyay, S., and Sinha, N.Automatic parallelization of recursive procedures. Int. Parallel, J.Program., 28(6):537–562, December 2000.Google Scholar
[49]Gustafson, J. L.Reevaluating Amdahl’s law. Communications of the ACM, 31:532–533, 1988.Google Scholar
[50]Gustafson, J. L. Fixed time, tiered memory, and superlinear speedup. In In Proceedings of the Fifth Distributed Memory Computing Conference (DMCC5, 1990.Google Scholar
[51]Gustafson, J. L. The consequences of fixed time performance measurement. In Proceedings of the 25th Hawaii International Conference on Systems Sciences, IEEE Computer Society, 1992.Google Scholar
[52]Hamada, T., Narumi, T., Yokota, R., Yasuoka, K., Nitadori, K., and Taiji, M. 42 tflops hierarchical n-body simulations on GPUs with applications in both astrophysics and turbulence. In SC, 2009.Google Scholar
[53]Harada, T.Real-time rigid body simulation on GPUs. In Hubert Nguyen, editor, GPU Gems 3, pages 611–632. Addison-Wesley, 2008.Google Scholar
[54]Hoare, C. A. R.Monitors: an operating system structuring concept. Commun. ACM, 17(10):549–557, October 1974.Google Scholar
[55]Hong, C., Chen, D., Chen, W., Zheng, W., and Lin, H.Mapcg: writing parallel program portable between cpu and GPU. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT1 ‘10, pages 217–226, New York, NY, USA, 2010. ACM.Google Scholar
[56]Horn, D. R., Sugerman, J., Houston, M., and Hanrahan, P.Interactive k-d tree GPU raytrac-ing. In Proceedings of the 2007 symposium on Interactive 3D graphics and games, I3D ‘07, pages 167–174, New York, NY, USA, 2007. ACM.Google Scholar
[57]Huang, M., Mehalel, M., Arvapalli, R., and He, S.An energy efficient 32nm 20 MB L3 cache for IntelR XeonR processor E5 family. In CICC, pages 1–4. IEEE, 2012.Google Scholar
[58]Ivanov, L.The n-body problem throughout the computer science curriculum. J. Comput. Sci. Coll., 22(6):43–52, June 2007.Google Scholar
[59]Luebke, D.Tran, J., Jordan, D.New challenges for cellular automata simulation on the GPU. Technical Report MSU-CSE-00-2, Virginia University, 2003.Google Scholar
[60]Jimenez, P., Thomas, F., and Torras, C. 3d collision detection: A survey. Computers and Graphics, 25:269–285, 2000.Google Scholar
[61]Judice, S. F., Barcellos, B., Coutinho, S., and Giraldi, G. A.Lattice methods for fluid animation in games. Comput. Entertain., 7(4):56:1–56:29, January 2010.Google Scholar
[62]Kashyap, S., Goradia, R., Chaudhuri, P., and Chandran, S.Implicit surface octrees for ray tracing point models. In Proceedings of the Seventh Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP ‘10, pages 227–234, New York, NY, USA, 2010. ACM.Google Scholar
[63]Kauffmann, C. and Piche, N.Seeded nd medical image segmentation by cellular automaton on GPU. Int. Computer, J.Assisted Radiology and Surgery, 5(3):251–262, 2010.Google Scholar
[64]Kautz, J., Heidrich, W., and Seidel, H.-P.Real-time bump map synthesis. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware, HWWS ‘01, pages 109–114, New York, NY, USA, 2001. ACM.Google Scholar
[65] Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, 8 December 2008.Google Scholar
[66]Kidner, D. B., Rallings, P. J., and Ware, J. A.Parallel processing for terrain analysis in GIS: Visibility as a case study. Geoinformatica, 1(2):183–207, August 1997.Google Scholar
[67]Kilgard, M. J. A practical and robust bump-mapping technique for todays GPUs. Nvidia, 2000.Google Scholar
[68]Kim, S. W. and Eigenmann, R.The structure of a compiler for explicit and implicit parallelism. In Proceedings of the 14th international conference on Languages and compilers for parallel computing, LCPC’01, pages 336–351, Berlin, Heidelberg, 2003. Springer-Verlag.Google Scholar
[69]Kipfer, P.LCP algorithms for collision detection using CUDA. In Hubert Nguyen, editor, GPUGems 3, pages 723–739. Addison-Wesley, 2007.Google Scholar
[70]Knuth, D. E.Computer programming as an art. Commun. ACM, 17(12):667–673,December 1974.Google Scholar
[71]Komura, Y. and Okabe, Y.GPU-based single-cluster algorithm for the simulation of the ising model. J. Comput. Phys., 231(4):1209–1215,February 2012.Google Scholar
[72]Komura, Y. and Okabe, Y.Multi-GPU-based swendsenVwang multi-cluster algorithm for the simulation of two-dimensional -state potts model. Computer Physics Communications, 184(1):40–44, 2013.Google Scholar
[73]Korček, P., Sekanina, L., and Fučik, O.Cellular automata based traffic simulation accelerated on GPU. In Proceedings of the 17th International Conference on Soft Computing (MENDEL2011), pages 395–402. Institute of Automation and Computer Science FME BUT, 2011.Google Scholar
[74]Krishnamoorthy, S., Baskaran, M., Bondhugula, U., Ramanujam, J., Rountev, A., and Sa-dayappan, P.Effective automatic parallelization of stencil computations. SIGPLAN Not., 42(6):235–244, June 2007.CrossRefGoogle Scholar
[75]Lee, V. W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A. D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., and Dubey, P.Debunking the 100x GPU vs. cpu myth: an evaluation of throughput computing on cpu and GPU. SIGARCH Comput. Archit. News, 38(3):451–460, June 2010.Google Scholar
[76]Leighton, F. T.Introduction to parallel algorithms and architectures: array, trees, hyper-cubes. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1992.Google Scholar
[77]Loveman, D.High performance Fortran. IEEE Parallel & Distributed Technology: Systems & Applications, 1(1):25–42, 1993.Google Scholar
[78]Lu, P., Oki, H., Frey, C., Chamitoff, G., Chiao, L., Fincke, E., Foale, C., Magnus, S., Mc, W.Arthur, Tani, D., Whitson, P., Williams, J., Meyer, W., Sicker, R., Au, B., Christiansen, M., Schofield, A., and Weitz, D.Orders-of-magnitude performance increases in GPU-accelerated correlation of images from the international space station. Journal of Real-Time Image Processing, 5:179–193, 2010. 10.1007/s11554-009-0133-1.CrossRefGoogle Scholar
[79]Ma, X., Li, J., and Samatova, N. F.Automatic parallelization of scripting languages: Toward transparent desktop parallel computing. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1–6, 2007.Google Scholar
[80]Macedonia, M.The GPU enters computing’s mainstream. Computer, 36(10):106–108,2003.Google Scholar
[81]Mackenzie, P. D. and Ramachandran, V.ERCW PRAMs and optical communication. In in Proceedings of the European Conference on Parallel Processing, EUROPAR 96, pages 293–302, 1996.Google Scholar
[82]Mark, W. R., Glanville, R. S., Akeley, K., and Kilgard, M. J.Cg: a system for programming graphics hardware in a c-like language. ACM Trans. Graph., 22(3):896–907, July 2003.Google Scholar
[83]Marroquim, R. and Maximo, A.Introduction to GPU programming with glsl. In Proceedings of the 2009 Tutorials of the XXII Brazilian Symposium on Computer Graphics and Image Processing, SIBGRAPI-TUTORIALS ‘09, pages 3–16, Washington, DC, USA, 2009. IEEE Computer Society.Google Scholar
[84]Matias, Y. and Vishkin, U.On parallel hashing and integer sorting. In Michael Paterson, editor, Automata, Languages and Programming, volume 443 of Lecture Notes in Computer Science, pages 729–743. Springer Berlin / Heidelberg, 1990. 10.1007/BFb0032070.Google Scholar
[85]McCool, M. D., Qin, Z., and Popa, T. S.Shader metaprogramming. In Proceedings of the ACMSIGGRAPH/EUROGRAPHICS conferenceonGraphics hardware,HWWS ‘02, pages 57–68, Aire-la-Ville, Switzerland, Switzerland, 2002. Eurographics Association.Google Scholar
[86]Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E.Equation of state calculations by fast computing machines. J. Chem. Phys., 21:1087, 1953.CrossRefGoogle Scholar
[87]Mikhayhu, A. S. Embarrassingly Parallel. Tempor, 2012.Google Scholar
[88]Neumann, J. Von. Theory of Self-Reproducing Automata. University of Illinois Press, Champaign, IL, USA, 1966.Google Scholar
[89]Nguyen, H. GPU gems 3. Addison-Wesley Professional, first edition, 2007.Google Scholar
[90]Nichols, B., Buttlar, D., and Farrell, J. P.Pthreads Programming. O’Reilly, 101 Morris Street, Sebastopol, CA 95472, 1998.Google Scholar
[91]Nikhil, R. and Arvind, . Implicit Parallel Programming in pH. Morgan Kaufmann, May 2001.Google Scholar
[92] Nvidia. Fermi Compute Architecture Whitepaper.Google Scholar
[93] Nvidia-Corporation. Nvidia CUDA C Programming Guide, 2012.Google Scholar
[94]Oneppo, M.Hlsl shader model 4.0. In ACM SIGGRAPH 2007 courses, SIGGRAPH ‘07, pages 112–152, New York, NY, USA, 2007. ACM.Google Scholar
[95]Openshaw, S. and Turton, I.High Performance Computing and the Art of Parallel Programming: An Introduction for Geographers, Social Scientists, and Engineers. Routledge, New York, NY, 10001, 1999.Google Scholar
[96]Pabst, S., Koch, A., and Straßer, W.Fast and scalable CPU/GPU collision detection for rigid and deformable surfaces. Computer Graphics Forum, 29(5):1605–1612, 2010.Google Scholar
[97]Padua, D. A., editor. Encyclopedia of Parallel Computing, volume 4. Springer, 2011.Google Scholar
[98]Pagani, M. and Tranquilli, P. Parallel reduction in resource lambda-calculus. In APLAS, pages 226–242, 2009.Google Scholar
[99]Parkinson, D.Parallel efficiency can be greater than unity. Parallel Computing, 3(3):261 – 262, 1986.Google Scholar
[100]Peelle, H. A.To teach Newton’s square root algorithm. SIGAPL APL Quote Quad, 5(4):48–50, December 1974.Google Scholar
[101]Plagianakos, V. P., Nousis, N. K., and Vrahatis, M. N.Locating and computing in parallel all the simple roots of special functions using pvm. J. Comput. Appl. Math., 133(1-2):545–554, August 2001.Google Scholar
[102]Preis, T., Virnau, P., Paul, W., and Schneider, J. J.GPU accelerated monte carlo simulation of the 2d and 3d ising model. J. Comput. Phys., 228(12):4468–4477, July 2009.Google Scholar
[103]Roberts, M., Packer, J., Sousa, M. C., and Mitchell, J. R.A work-efficient GPU algorithm for level set segmentation. In Proceedings of the Conference on High Performance Graphics, HPG ‘10, pages 123–132, Aire-la-Ville, Switzerland, Switzerland, 2010. Eurographics Association.Google Scholar
[104]Ross, P. E.Why cpu frequency stalled. IEEE Spectr., 45(4):72–72, April 2008.Google Scholar
[105]Rugina, R. and Rinard, M.Automatic parallelization of divide and conquer algorithms. In In Proceedings of the 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 72–83, 1999.Google Scholar
[106]Rybacki, S., Himmelspach, J., and Uhrmacher, A. M.Experiments with single core, multi-core, and GPU based computation of cellular automata. In Proceedings of the 2009 First International Conference on Advances in System Simulation, SIMUL ‘09, pages 62–67, Washington, DC, USA, 2009. IEEE Computer Society.Google Scholar
[107]Sander, P. V. and Mitchell, J. L.Progressive buffers: view-dependent geometry and texture lod rendering. In Proceedings of the third Eurographics symposium on Geometry processing, SGP ‘05, Aire-la-Ville, Switzerland, Switzerland, 2005. Eurographics Association.Google Scholar
[108]Di, A. Serio and Ibáñez, M. B. Evaluation of a nearest-neighbor load balancing strategy for parallel molecular simulations in mpi environment. In PVM/MPI, pages 226–233, 2002.CrossRefGoogle Scholar
[109]Shiloach, Y. and Vishkin, U.An o(log n) parallel connectivity algorithm. Algorithms, J., 3(1):57–67, 1982.Google Scholar
[110]Smith, J. R.The design and analysis of parallel algorithms. Oxford University Press, Inc., New York, NY, USA, 1993.Google Scholar
[111]Subramonian, R.An o(log n) time common CRCW PRAM algorithm for minimum spanning tree. Technical Report UCB/CSD-92-673, EECS Department, University of California, Berkeley, Mar 1992.Google Scholar
[112]Sugerman, J., Fatahalian, K., Boulos, S., Akeley, K., and Hanrahan, P.Gramps: A programming model for graphics pipelines. ACM Trans. Graph., 28(1):4:1–4:11, February 2009.Google Scholar
[113]Swendsen, R. H. and Wang, J. S.Nonuniversal, critical dynamics in Monte Carlo simulations. Phys. Rev. Lett., 58:86, 1987.Google Scholar
[114]Tanabe, N., Hori, N., Nuttapon, B., and Nakajo, H.Preliminary evaluations for hybrid memory cube with gather functions using FPGA. IPSJ SIG Notes, 2012(6):1–10, 2012-03-19.Google Scholar
[115]Taniar, D., Leung, C. H. C., Rahayu, W., and Goel, S. High-Performance Parallel Database Processing and Grid Databases. Wiley Series on Parallel and Distributed Computing, 2008.Google Scholar
[116]Tapia, J. J. and D’Souza, R.Data-parallel algorithms for large-scale real-time simulation of the cellular Potts model on graphics processing units. 2009 IEEE International Conference on Systems Man and Cybernetics, (10):1411–1418, 2009.Google Scholar
[117]Tapia, J. J. and D’Souza, R.Parallelizing the cellular potts model on graphics processing units. Computer Physics Communications, 182(4):857–865, 2011.Google Scholar
[118]Topa, P. and Mlocek, P.GpGPU implementation of cellular automata model of water flow. In Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I, PPAM’11, pages 630–639, Berlin, Heidelberg, 2012. Springer-Verlag.Google Scholar
[119]Valiant, L. G.A bridging model for parallel computation. Commun. ACM, 33(8):103–111, August 1990.Google Scholar
[120]Vishkin, U. A pram-on-chip vision (invited abstract). In SPIRE, page 260, 2000.Google Scholar
[121]Vishkin, U., Dascal, S., Berkovich, E., and Nuzman, J. Explicit multi-threading (XMT) bridging models for instruction parallelism (extended abstract). In SPAA, pages 140–151, 1998.Google Scholar
[122]Neumann, J. von1. The general and logical theory of automata. In Cerebral Mechanisms in Behaviour. Wiley, 1951.Google Scholar
[123]Woeginger, G. J.Combinatorial optimization - eureka, you shrink! chapter Exact algorithms for NP-hard problems: a survey, pages 185–207. Springer-Verlag New York, Inc., New York, NY, USA, 2003.Google Scholar
[124]Wolff, U.Collective Monte Carlo updating for spin systems. Physical Review Letters, 62:361–364, 1989.Google Scholar
[125]Wu, F. Y.The Potts model. Reviews of Modern Physics, 54(1):235–268, January 1982.Google Scholar
[126]Yokota, R., Barba, L., Narumi, T., and Yasuoka, K.Scaling fast multipole methods up to 4000 GPUs. In Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?, ATIP ‘12, pages 9:1–9:6, Singapore, Singapore, 2012. A*STAR Computational Resource Centre.Google Scholar
[127]Yokota, R. and Barba, L. A. Fast n-body simulations on GPUs. CoRR, abs/1108.5815, 2011.Google Scholar
[128]Yokota, R. and Barba, L. A. A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems. CoRR, abs/1106.2176, 2011.Google Scholar
[129]Yokota, R. and Barba, L. A.Hierarchical n-body simulations with autotuning for heterogeneous systems. Computing in Science and Engineering, 14(3):30–39, 2012.Google Scholar
[130]Yukita, S.Cellular automata in non-euclidean spaces. In Proceedings of the 7th WSEAS International Conference on Mathematical Methods and Computational Techniques In Electrical Engineering, MMACTE’05, pages 200–207, Stevens Point, Wisconsin, USA, 2005. World Scientific and Engineering Academy and Society (WSEAS).Google Scholar
[131]Zhou, K., Hou, Q., Wang, R., and Guo, B.Real-time kd-tree construction on graphics hardware. ACM Trans. Graph., 27(5):126:1–126:11, December 2008.Google Scholar