�ݺ�ߣshows by User: ahayashi10

�ݺ�ߣshows by User: ahayashi10 / http://www.slideshare.net/images/logo.gif �ݺ�ߣshows by User: ahayashi10 / Tue, 25 Jun 2019 17:19:04 GMT �ݺ�ߣShare feed for �ݺ�ߣshows by User: ahayashi10 GPUIterator: Bridging the Gap between Chapel and GPU Platforms /slideshow/gpuiterator-bridging-the-gap-between-chapel-and-gpu-platforms/151801664 ahayashi20190622-190625171904
The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019. PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.]]>
The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019. PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.]]> Tue, 25 Jun 2019 17:19:04 GMT /slideshow/gpuiterator-bridging-the-gap-between-chapel-and-gpu-platforms/151801664 ahayashi10@slideshare.net(ahayashi10) GPUIterator: Bridging the Gap between Chapel and GPU Platforms ahayashi10 The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019. PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20190622-190625171904-thumbnail.jpg?width=120&height=120&fit=bounds" /> The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019. PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.

GPUIterator: Bridging the Gap between Chapel and GPU Platforms from Akihiro Hayashi

]]> 473 2 https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20190622-190625171904-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs. GPU Execution in Java Programs /slideshow/exploration-of-supervised-machine-learning-techniques-for-runtime-selection-of-cpu-vs-gpu-execution-in-java-programs/89020580 ahayashi20171113-180226235309
Fourth Workshop on Accelerator Programming Using Directives (WACCPD2017, co-located with SC17) While multi-core CPUs and many-core GPUs are both viable platforms for parallel computing, programming models for them can impose large burdens upon programmers due to their complex and low-level APIs. Since managed languages like Java are designed to be run on multiple platforms, parallel language constructs and APIs such as Java 8 Parallel Stream APIs can enable high-level parallel programming with the promise of performance portability for mainstream (“non-ninja”) programmers. To achieve this goal, it is important for the selection of the hardware device to be automated rather than be specified by the programmer, as is done in current programming models. Due to a variety of factors affecting performance, predicting a preferable device for faster performance of individual kernels remains a difficult problem. While a prior approach uses machine learning to address this challenge, there is no comparable study on good supervised machine learning algorithms and good program features to track. In this paper, we explore 1) program features to be extracted by a compiler and 2) various machine learning techniques that improve accuracy in prediction, thereby improving performance. The results show that an appropriate selection of program features and machine learning algorithm can further improve accuracy. In particular, support vector machines (SVMs), logistic regression, and J48 decision tree are found to be reliable techniques for building accurate prediction models from just two, three, or four program features, achieving accuracies of 99.66%, 98.63%, and 98.28% respectively from 5-fold-cross-validation. ]]>
Fourth Workshop on Accelerator Programming Using Directives (WACCPD2017, co-located with SC17) While multi-core CPUs and many-core GPUs are both viable platforms for parallel computing, programming models for them can impose large burdens upon programmers due to their complex and low-level APIs. Since managed languages like Java are designed to be run on multiple platforms, parallel language constructs and APIs such as Java 8 Parallel Stream APIs can enable high-level parallel programming with the promise of performance portability for mainstream (“non-ninja”) programmers. To achieve this goal, it is important for the selection of the hardware device to be automated rather than be specified by the programmer, as is done in current programming models. Due to a variety of factors affecting performance, predicting a preferable device for faster performance of individual kernels remains a difficult problem. While a prior approach uses machine learning to address this challenge, there is no comparable study on good supervised machine learning algorithms and good program features to track. In this paper, we explore 1) program features to be extracted by a compiler and 2) various machine learning techniques that improve accuracy in prediction, thereby improving performance. The results show that an appropriate selection of program features and machine learning algorithm can further improve accuracy. In particular, support vector machines (SVMs), logistic regression, and J48 decision tree are found to be reliable techniques for building accurate prediction models from just two, three, or four program features, achieving accuracies of 99.66%, 98.63%, and 98.28% respectively from 5-fold-cross-validation. ]]> Mon, 26 Feb 2018 23:53:09 GMT /slideshow/exploration-of-supervised-machine-learning-techniques-for-runtime-selection-of-cpu-vs-gpu-execution-in-java-programs/89020580 ahayashi10@slideshare.net(ahayashi10) Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs. GPU Execution in Java Programs ahayashi10 Fourth Workshop on Accelerator Programming Using Directives (WACCPD2017, co-located with SC17) While multi-core CPUs and many-core GPUs are both viable platforms for parallel computing, programming models for them can impose large burdens upon programmers due to their complex and low-level APIs. Since managed languages like Java are designed to be run on multiple platforms, parallel language constructs and APIs such as Java 8 Parallel Stream APIs can enable high-level parallel programming with the promise of performance portability for mainstream (“non-ninja”) programmers. To achieve this goal, it is important for the selection of the hardware device to be automated rather than be specified by the programmer, as is done in current programming models. Due to a variety of factors affecting performance, predicting a preferable device for faster performance of individual kernels remains a difficult problem. While a prior approach uses machine learning to address this challenge, there is no comparable study on good supervised machine learning algorithms and good program features to track. In this paper, we explore 1) program features to be extracted by a compiler and 2) various machine learning techniques that improve accuracy in prediction, thereby improving performance. The results show that an appropriate selection of program features and machine learning algorithm can further improve accuracy. In particular, support vector machines (SVMs), logistic regression, and J48 decision tree are found to be reliable techniques for building accurate prediction models from just two, three, or four program features, achieving accuracies of 99.66%, 98.63%, and 98.28% respectively from 5-fold-cross-validation. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20171113-180226235309-thumbnail.jpg?width=120&height=120&fit=bounds" /> Fourth Workshop on Accelerator Programming Using Directives (WACCPD2017, co-located with SC17) While multi-core CPUs and many-core GPUs are both viable platforms for parallel computing, programming models for them can impose large burdens upon programmers due to their complex and low-level APIs. Since managed languages like Java are designed to be run on multiple platforms, parallel language constructs and APIs such as Java 8 Parallel Stream APIs can enable high-level parallel programming with the promise of performance portability for mainstream (“non-ninja”) programmers. To achieve this goal, it is important for the selection of the hardware device to be automated rather than be specified by the programmer, as is done in current programming models. Due to a variety of factors affecting performance, predicting a preferable device for faster performance of individual kernels remains a difficult problem. While a prior approach uses machine learning to address this challenge, there is no comparable study on good supervised machine learning algorithms and good program features to track. In this paper, we explore 1) program features to be extracted by a compiler and 2) various machine learning techniques that improve accuracy in prediction, thereby improving performance. The results show that an appropriate selection of program features and machine learning algorithm can further improve accuracy. In particular, support vector machines (SVMs), logistic regression, and J48 decision tree are found to be reliable techniques for building accurate prediction models from just two, three, or four program features, achieving accuracies of 99.66%, 98.63%, and 98.28% respectively from 5-fold-cross-validation.

Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs. GPU Execution in Java Programs from Akihiro Hayashi

]]> 340 4 https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20171113-180226235309-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages /ahayashi10/chapelonx-exploring-tasking-runtimes-for-pgas-languages ahayashi20171112-180226234728
With the shift to exascale computer systems, the importance of productive programming models for distributed systems is increasing. Partitioned Global Address Space (PGAS) programming models aim to reduce the complexity of writing distributed-memory parallel programs by introducing global operations on distributed arrays, distributed task parallelism, directed synchronization, and mutual exclusion. However, a key challenge in the application of PGAS programming models is the improvement of compilers and runtime systems. In particular, one open question is how runtime systems meet the requirement of exascale systems, where a large number of asynchronous tasks are executed. While there are various tasking runtimes such as Qthreads, OCR, and HClib, there is no existing comparative study on PGAS tasking/threading runtime systems. To explore runtime systems for PGAS programming languages, we have implemented OCR-based and HClib-based Chapel runtimes and evaluated them with an initial focus on tasking and synchronization implementations. The results show that our OCR and HClib-based implementations can improve the performance of PGAS programs compared to the ex- isting Qthreads backend of Chapel.]]>
With the shift to exascale computer systems, the importance of productive programming models for distributed systems is increasing. Partitioned Global Address Space (PGAS) programming models aim to reduce the complexity of writing distributed-memory parallel programs by introducing global operations on distributed arrays, distributed task parallelism, directed synchronization, and mutual exclusion. However, a key challenge in the application of PGAS programming models is the improvement of compilers and runtime systems. In particular, one open question is how runtime systems meet the requirement of exascale systems, where a large number of asynchronous tasks are executed. While there are various tasking runtimes such as Qthreads, OCR, and HClib, there is no existing comparative study on PGAS tasking/threading runtime systems. To explore runtime systems for PGAS programming languages, we have implemented OCR-based and HClib-based Chapel runtimes and evaluated them with an initial focus on tasking and synchronization implementations. The results show that our OCR and HClib-based implementations can improve the performance of PGAS programs compared to the ex- isting Qthreads backend of Chapel.]]> Mon, 26 Feb 2018 23:47:28 GMT /ahayashi10/chapelonx-exploring-tasking-runtimes-for-pgas-languages ahayashi10@slideshare.net(ahayashi10) Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages ahayashi10 With the shift to exascale computer systems, the importance of productive programming models for distributed systems is increasing. Partitioned Global Address Space (PGAS) programming models aim to reduce the complexity of writing distributed-memory parallel programs by introducing global operations on distributed arrays, distributed task parallelism, directed synchronization, and mutual exclusion. However, a key challenge in the application of PGAS programming models is the improvement of compilers and runtime systems. In particular, one open question is how runtime systems meet the requirement of exascale systems, where a large number of asynchronous tasks are executed. While there are various tasking runtimes such as Qthreads, OCR, and HClib, there is no existing comparative study on PGAS tasking/threading runtime systems. To explore runtime systems for PGAS programming languages, we have implemented OCR-based and HClib-based Chapel runtimes and evaluated them with an initial focus on tasking and synchronization implementations. The results show that our OCR and HClib-based implementations can improve the performance of PGAS programs compared to the ex- isting Qthreads backend of Chapel. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20171112-180226234728-thumbnail.jpg?width=120&height=120&fit=bounds" /> With the shift to exascale computer systems, the importance of productive programming models for distributed systems is increasing. Partitioned Global Address Space (PGAS) programming models aim to reduce the complexity of writing distributed-memory parallel programs by introducing global operations on distributed arrays, distributed task parallelism, directed synchronization, and mutual exclusion. However, a key challenge in the application of PGAS programming models is the improvement of compilers and runtime systems. In particular, one open question is how runtime systems meet the requirement of exascale systems, where a large number of asynchronous tasks are executed. While there are various tasking runtimes such as Qthreads, OCR, and HClib, there is no existing comparative study on PGAS tasking/threading runtime systems. To explore runtime systems for PGAS programming languages, we have implemented OCR-based and HClib-based Chapel runtimes and evaluated them with an initial focus on tasking and synchronization implementations. The results show that our OCR and HClib-based implementations can improve the performance of PGAS programs compared to the ex- isting Qthreads backend of Chapel.

Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages from Akihiro Hayashi

]]> 190 1 https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20171112-180226234728-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Introduction to Polyhedral Compilation /slideshow/introduction-to-polyhedral-compilation/70482946 polyhedralmodel20161219-161227234517
A brief overview of the polyhedral model.]]>
A brief overview of the polyhedral model.]]> Tue, 27 Dec 2016 23:45:17 GMT /slideshow/introduction-to-polyhedral-compilation/70482946 ahayashi10@slideshare.net(ahayashi10) Introduction to Polyhedral Compilation ahayashi10 A brief overview of the polyhedral model. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/polyhedralmodel20161219-161227234517-thumbnail.jpg?width=120&height=120&fit=bounds" /> A brief overview of the polyhedral model.

Introduction to Polyhedral Compilation from Akihiro Hayashi

]]> 5659 10 https://cdn.slidesharecdn.com/ss_thumbnails/polyhedralmodel20161219-161227234517-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Exploring Compiler Optimization Opportunities �for the OpenMP 4.x Accelerator Model �on a POWER8+GPU Platform /ahayashi10/exploring-compiler-optimization-opportunities-for-the-openmp-4x-accelerator-model-on-a-power8gpu-platform ahayashi20161114-161206175642
Third Workshop on Accelerator Programming Using Directives (WACCPD2016, co-located with SC16) While GPUs are increasingly popular for high-performance computing, optimizing the performance of GPU programs is a time-consuming and non-trivial process in general. This complexity stems from the low abstraction level of standard GPU programming models such as CUDA and OpenCL: programmers are required to orchestrate low-level operations in order to exploit the full capability of GPUs. In terms of software productivity and portability, a more attractive approach would be to facilitate GPU programming by providing high-level abstractions for expressing parallel algorithms. OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP’s high-level parallel abstractions with accelerator programming. This extension allows programmers to write GPU programs in standard C/C++ or Fortran languages, without exposing too many details of GPU architectures. However, such high-level parallel programming strategies generally impose additional program optimizations on compilers, which could result in lower performance than fully hand-tuned code with low-level programming models.To study potential performance improvements by compiling and optimizing high-level GPU programs, in this paper, we 1) evaluate a set of OpenMP 4.x benchmarks on an IBM POWER8 and NVIDIA Tesla GPU platform and 2) conduct a comparable performance analysis among hand-written CUDA and automatically-generated GPU programs by the IBM XL and clang/LLVM compilers.]]>
Third Workshop on Accelerator Programming Using Directives (WACCPD2016, co-located with SC16) While GPUs are increasingly popular for high-performance computing, optimizing the performance of GPU programs is a time-consuming and non-trivial process in general. This complexity stems from the low abstraction level of standard GPU programming models such as CUDA and OpenCL: programmers are required to orchestrate low-level operations in order to exploit the full capability of GPUs. In terms of software productivity and portability, a more attractive approach would be to facilitate GPU programming by providing high-level abstractions for expressing parallel algorithms. OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP’s high-level parallel abstractions with accelerator programming. This extension allows programmers to write GPU programs in standard C/C++ or Fortran languages, without exposing too many details of GPU architectures. However, such high-level parallel programming strategies generally impose additional program optimizations on compilers, which could result in lower performance than fully hand-tuned code with low-level programming models.To study potential performance improvements by compiling and optimizing high-level GPU programs, in this paper, we 1) evaluate a set of OpenMP 4.x benchmarks on an IBM POWER8 and NVIDIA Tesla GPU platform and 2) conduct a comparable performance analysis among hand-written CUDA and automatically-generated GPU programs by the IBM XL and clang/LLVM compilers.]]> Tue, 06 Dec 2016 17:56:42 GMT /ahayashi10/exploring-compiler-optimization-opportunities-for-the-openmp-4x-accelerator-model-on-a-power8gpu-platform ahayashi10@slideshare.net(ahayashi10) Exploring Compiler Optimization Opportunities �for the OpenMP 4.x Accelerator Model �on a POWER8+GPU Platform ahayashi10 Third Workshop on Accelerator Programming Using Directives (WACCPD2016, co-located with SC16) While GPUs are increasingly popular for high-performance computing, optimizing the performance of GPU programs is a time-consuming and non-trivial process in general. This complexity stems from the low abstraction level of standard GPU programming models such as CUDA and OpenCL: programmers are required to orchestrate low-level operations in order to exploit the full capability of GPUs. In terms of software productivity and portability, a more attractive approach would be to facilitate GPU programming by providing high-level abstractions for expressing parallel algorithms. OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP’s high-level parallel abstractions with accelerator programming. This extension allows programmers to write GPU programs in standard C/C++ or Fortran languages, without exposing too many details of GPU architectures. However, such high-level parallel programming strategies generally impose additional program optimizations on compilers, which could result in lower performance than fully hand-tuned code with low-level programming models.To study potential performance improvements by compiling and optimizing high-level GPU programs, in this paper, we 1) evaluate a set of OpenMP 4.x benchmarks on an IBM POWER8 and NVIDIA Tesla GPU platform and 2) conduct a comparable performance analysis among hand-written CUDA and automatically-generated GPU programs by the IBM XL and clang/LLVM compilers. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20161114-161206175642-thumbnail.jpg?width=120&height=120&fit=bounds" /> Third Workshop on Accelerator Programming Using Directives (WACCPD2016, co-located with SC16) While GPUs are increasingly popular for high-performance computing, optimizing the performance of GPU programs is a time-consuming and non-trivial process in general. This complexity stems from the low abstraction level of standard GPU programming models such as CUDA and OpenCL: programmers are required to orchestrate low-level operations in order to exploit the full capability of GPUs. In terms of software productivity and portability, a more attractive approach would be to facilitate GPU programming by providing high-level abstractions for expressing parallel algorithms. OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP’s high-level parallel abstractions with accelerator programming. This extension allows programmers to write GPU programs in standard C/C++ or Fortran languages, without exposing too many details of GPU architectures. However, such high-level parallel programming strategies generally impose additional program optimizations on compilers, which could result in lower performance than fully hand-tuned code with low-level programming models.To study potential performance improvements by compiling and optimizing high-level GPU programs, in this paper, we 1) evaluate a set of OpenMP 4.x benchmarks on an IBM POWER8 and NVIDIA Tesla GPU platform and 2) conduct a comparable performance analysis among hand-written CUDA and automatically-generated GPU programs by the IBM XL and clang/LLVM compilers.

Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator Model on a POWER8+GPU Platform from Akihiro Hayashi

]]> 299 4 https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20161114-161206175642-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 LLVM-based Communication Optimizations for PGAS Programs /slideshow/llvmbased-communication-optimizations-for-pgas-programs/55136772 ahayashi20151115-151115205825-lva1-app6891
The Second Workshop on the LLVM Compiler Infrastructure in HPC (Co-located with SC15) While Partitioned Global Address Space (PGAS) programming languages such as UPC/UPC++, CAF, Chapel and X10 provide highlevel programming models for facilitating large-scale distributed memory parallel programming, it is widely recognized that compiler analysis and optimization for these languages has been very limited, unlike the optimization of SMP models such as OpenMP. One reason for this limitation is that current optimizers for PGAS programs are specialized to different languages. This is unfortunate since communication optimization is an important class of compiler optimizations for PGAS programs running on distributed memory platforms, and these optimizations need to be performed more widely. Thus, a more effective approach would be to build a language independent and runtime-independent compiler framework for optimizing PGAS programs so that new communication optimizations can be leveraged by different languages. To address this need, we introduce an LLVM-based (Low Level Virtual Machine) communication optimization framework. Our compilation system leverages existing optimization passes and introduces new PGAS language-aware runtime dependent/independent passes to reduce communication overheads. Our experimental results show an average performance improvement of 3.5× and 3.4× on 64-nodes of a Cray XC30TM supercomputer and 32-nodes of a Westmere cluster respectively, for a set of benchmarks written in the Chapel language. Overall, we show that our new LLVMbased compiler optimization framework can effectively improve the performance of PGAS programs.]]>
The Second Workshop on the LLVM Compiler Infrastructure in HPC (Co-located with SC15) While Partitioned Global Address Space (PGAS) programming languages such as UPC/UPC++, CAF, Chapel and X10 provide highlevel programming models for facilitating large-scale distributed memory parallel programming, it is widely recognized that compiler analysis and optimization for these languages has been very limited, unlike the optimization of SMP models such as OpenMP. One reason for this limitation is that current optimizers for PGAS programs are specialized to different languages. This is unfortunate since communication optimization is an important class of compiler optimizations for PGAS programs running on distributed memory platforms, and these optimizations need to be performed more widely. Thus, a more effective approach would be to build a language independent and runtime-independent compiler framework for optimizing PGAS programs so that new communication optimizations can be leveraged by different languages. To address this need, we introduce an LLVM-based (Low Level Virtual Machine) communication optimization framework. Our compilation system leverages existing optimization passes and introduces new PGAS language-aware runtime dependent/independent passes to reduce communication overheads. Our experimental results show an average performance improvement of 3.5× and 3.4× on 64-nodes of a Cray XC30TM supercomputer and 32-nodes of a Westmere cluster respectively, for a set of benchmarks written in the Chapel language. Overall, we show that our new LLVMbased compiler optimization framework can effectively improve the performance of PGAS programs.]]> Sun, 15 Nov 2015 20:58:25 GMT /slideshow/llvmbased-communication-optimizations-for-pgas-programs/55136772 ahayashi10@slideshare.net(ahayashi10) LLVM-based Communication Optimizations for PGAS Programs ahayashi10 The Second Workshop on the LLVM Compiler Infrastructure in HPC (Co-located with SC15) While Partitioned Global Address Space (PGAS) programming languages such as UPC/UPC++, CAF, Chapel and X10 provide highlevel programming models for facilitating large-scale distributed memory parallel programming, it is widely recognized that compiler analysis and optimization for these languages has been very limited, unlike the optimization of SMP models such as OpenMP. One reason for this limitation is that current optimizers for PGAS programs are specialized to different languages. This is unfortunate since communication optimization is an important class of compiler optimizations for PGAS programs running on distributed memory platforms, and these optimizations need to be performed more widely. Thus, a more effective approach would be to build a language independent and runtime-independent compiler framework for optimizing PGAS programs so that new communication optimizations can be leveraged by different languages. To address this need, we introduce an LLVM-based (Low Level Virtual Machine) communication optimization framework. Our compilation system leverages existing optimization passes and introduces new PGAS language-aware runtime dependent/independent passes to reduce communication overheads. Our experimental results show an average performance improvement of 3.5× and 3.4× on 64-nodes of a Cray XC30TM supercomputer and 32-nodes of a Westmere cluster respectively, for a set of benchmarks written in the Chapel language. Overall, we show that our new LLVMbased compiler optimization framework can effectively improve the performance of PGAS programs. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20151115-151115205825-lva1-app6891-thumbnail.jpg?width=120&height=120&fit=bounds" /> The Second Workshop on the LLVM Compiler Infrastructure in HPC (Co-located with SC15) While Partitioned Global Address Space (PGAS) programming languages such as UPC/UPC++, CAF, Chapel and X10 provide highlevel programming models for facilitating large-scale distributed memory parallel programming, it is widely recognized that compiler analysis and optimization for these languages has been very limited, unlike the optimization of SMP models such as OpenMP. One reason for this limitation is that current optimizers for PGAS programs are specialized to different languages. This is unfortunate since communication optimization is an important class of compiler optimizations for PGAS programs running on distributed memory platforms, and these optimizations need to be performed more widely. Thus, a more effective approach would be to build a language independent and runtime-independent compiler framework for optimizing PGAS programs so that new communication optimizations can be leveraged by different languages. To address this need, we introduce an LLVM-based (Low Level Virtual Machine) communication optimization framework. Our compilation system leverages existing optimization passes and introduces new PGAS language-aware runtime dependent/independent passes to reduce communication overheads. Our experimental results show an average performance improvement of 3.5× and 3.4× on 64-nodes of a Cray XC30TM supercomputer and 32-nodes of a Westmere cluster respectively, for a set of benchmarks written in the Chapel language. Overall, we show that our new LLVMbased compiler optimization framework can effectively improve the performance of PGAS programs.

LLVM-based Communication Optimizations for PGAS Programs from Akihiro Hayashi

]]> 862 8 https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20151115-151115205825-lva1-app6891-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Machine-learning based performance heuristics for Runtime CPU/GPU Selection in Java /slideshow/machinelearning-based-performance-heuristics-for-runtime-cpugpu-selection-in-java/55136684 ahayashi20151103-151115205412-lva1-app6891
10th Workshop on Challenges for Parallel Computing (Co-located with IBM CASCON2015)]]>
10th Workshop on Challenges for Parallel Computing (Co-located with IBM CASCON2015)]]> Sun, 15 Nov 2015 20:54:12 GMT /slideshow/machinelearning-based-performance-heuristics-for-runtime-cpugpu-selection-in-java/55136684 ahayashi10@slideshare.net(ahayashi10) Machine-learning based performance heuristics for Runtime CPU/GPU Selection in Java ahayashi10 10th Workshop on Challenges for Parallel Computing (Co-located with IBM CASCON2015) <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20151103-151115205412-lva1-app6891-thumbnail.jpg?width=120&height=120&fit=bounds" /> 10th Workshop on Challenges for Parallel Computing (Co-located with IBM CASCON2015)

Machine-learning based performance heuristics for Runtime CPU/GPU Selection in Java from Akihiro Hayashi

]]> 861 9 https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20151103-151115205412-lva1-app6891-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection /slideshow/machinelearningbased-performance-heuristics-for-runtime-cpugpu-selection/52598674 ahayashi20150909-150909190732-lva1-app6892
12th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2015.]]>
12th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2015.]]> Wed, 09 Sep 2015 19:07:32 GMT /slideshow/machinelearningbased-performance-heuristics-for-runtime-cpugpu-selection/52598674 ahayashi10@slideshare.net(ahayashi10) Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection ahayashi10 12th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2015. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20150909-150909190732-lva1-app6892-thumbnail.jpg?width=120&height=120&fit=bounds" /> 12th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2015.

Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection from Akihiro Hayashi

]]> 851 9 https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20150909-150909190732-lva1-app6892-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Studies on Automatic Parallelization for �Heterogeneous and Homogeneous Multicore Processors /slideshow/studies-on-automatic-parallelization-for-heterogeneous-and-homogeneous-multicore-processors/35232685 ahayashi20120123-140528154214-phpapp01
Ph.D Defense]]>
Ph.D Defense]]> Wed, 28 May 2014 15:42:14 GMT /slideshow/studies-on-automatic-parallelization-for-heterogeneous-and-homogeneous-multicore-processors/35232685 ahayashi10@slideshare.net(ahayashi10) Studies on Automatic Parallelization for �Heterogeneous and Homogeneous Multicore Processors ahayashi10 Ph.D Defense <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20120123-140528154214-phpapp01-thumbnail.jpg?width=120&height=120&fit=bounds" /> Ph.D Defense

Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multicore Processors from Akihiro Hayashi

]]> 1738 4 https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20120123-140528154214-phpapp01-thumbnail.jpg?width=120&height=120&fit=bounds presentation 000000 http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in Chapel- /slideshow/akihirohayashichiuw2014/35232034 akihiro-140528152443-phpapp02
Akihiro Hayashi, Rishi Surendran, Jisheng Zhao, Michael Ferguson, Vivek Sarkar. The 1st Chapel Implementers and Users Workshop (CHIUW), May 23rd, 2014, Phoenix AZ (co-located with IPDPS2014).]]>
Akihiro Hayashi, Rishi Surendran, Jisheng Zhao, Michael Ferguson, Vivek Sarkar. The 1st Chapel Implementers and Users Workshop (CHIUW), May 23rd, 2014, Phoenix AZ (co-located with IPDPS2014).]]> Wed, 28 May 2014 15:24:43 GMT /slideshow/akihirohayashichiuw2014/35232034 ahayashi10@slideshare.net(ahayashi10) LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in Chapel- ahayashi10 Akihiro Hayashi, Rishi Surendran, Jisheng Zhao, Michael Ferguson, Vivek Sarkar. The 1st Chapel Implementers and Users Workshop (CHIUW), May 23rd, 2014, Phoenix AZ (co-located with IPDPS2014). <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/akihiro-140528152443-phpapp02-thumbnail.jpg?width=120&height=120&fit=bounds" /> Akihiro Hayashi, Rishi Surendran, Jisheng Zhao, Michael Ferguson, Vivek Sarkar. The 1st Chapel Implementers and Users Workshop (CHIUW), May 23rd, 2014, Phoenix AZ (co-located with IPDPS2014).

LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in Chapel- from Akihiro Hayashi

]]> 992 6 https://cdn.slidesharecdn.com/ss_thumbnails/akihiro-140528152443-phpapp02-thumbnail.jpg?width=120&height=120&fit=bounds presentation 000000 http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Speculative Execution of Parallel Programs �with Precise Exception Semantics on GPUs /slideshow/akihirohayashilcpc2013/35231821 akihiro-140528151928-phpapp01
Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. The 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC2013), September 25-27, 2013 Qualcomm Research Silicon Valley, Santa Clara, CA (co-located with CnC-2013).]]>
Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. The 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC2013), September 25-27, 2013 Qualcomm Research Silicon Valley, Santa Clara, CA (co-located with CnC-2013).]]> Wed, 28 May 2014 15:19:28 GMT /slideshow/akihirohayashilcpc2013/35231821 ahayashi10@slideshare.net(ahayashi10) Speculative Execution of Parallel Programs �with Precise Exception Semantics on GPUs ahayashi10 Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. The 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC2013), September 25-27, 2013 Qualcomm Research Silicon Valley, Santa Clara, CA (co-located with CnC-2013). <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/akihiro-140528151928-phpapp01-thumbnail.jpg?width=120&height=120&fit=bounds" /> Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. The 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC2013), September 25-27, 2013 Qualcomm Research Silicon Valley, Santa Clara, CA (co-located with CnC-2013).

Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs from Akihiro Hayashi

]]> 724 5 https://cdn.slidesharecdn.com/ss_thumbnails/akihiro-140528151928-phpapp01-thumbnail.jpg?width=120&height=120&fit=bounds presentation 000000 http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Accelerating Habanero-Java Program with OpenCL Generation /slideshow/akihirohayashipppj2013/35231633 akihiro-140528151418-phpapp01
Accelerating Habanero-Java Program with OpenCL Generation. Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. 10th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2013.]]>
Accelerating Habanero-Java Program with OpenCL Generation. Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. 10th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2013.]]> Wed, 28 May 2014 15:14:18 GMT /slideshow/akihirohayashipppj2013/35231633 ahayashi10@slideshare.net(ahayashi10) Accelerating Habanero-Java Program with OpenCL Generation ahayashi10 Accelerating Habanero-Java Program with OpenCL Generation. Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. 10th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2013. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/akihiro-140528151418-phpapp01-thumbnail.jpg?width=120&height=120&fit=bounds" /> Accelerating Habanero-Java Program with OpenCL Generation. Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. 10th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2013.

Accelerating Habanero-Java Program with OpenCL Generation from Akihiro Hayashi

]]> 925 3 https://cdn.slidesharecdn.com/ss_thumbnails/akihiro-140528151418-phpapp01-thumbnail.jpg?width=120&height=120&fit=bounds presentation 000000 http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 https://cdn.slidesharecdn.com/profile-photo-ahayashi10-48x48.jpg?cb=1630906560 Dr. Hayashi is a research scientist at Rice university. He received his Ph.D. degree from Waseda University in Japan in 2012. His research interests are focused on parallel computing, and include automatic parallelization, programming languages, compiler optimizations for parallel computer systems. He is now working on 1) GPGPU code generation from high-level languages (e.g. Java) and 2) LLVM-based compiler optimizations for PGAS (Partitioned Global Address Space) programs. A detailed information including publications can be found at http://ahayashi.blogs.rice.edu/ ahayashi.blogs.rice.edu https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20190622-190625171904-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/gpuiterator-bridging-the-gap-between-chapel-and-gpu-platforms/151801664 GPUIterator: Bridging ... https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20171113-180226235309-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/exploration-of-supervised-machine-learning-techniques-for-runtime-selection-of-cpu-vs-gpu-execution-in-java-programs/89020580 Exploration of Supervi... https://cdn.slidesharecdn.com/ss_thumbnails/ahayashi20171112-180226234728-thumbnail.jpg?width=320&height=320&fit=bounds ahayashi10/chapelonx-exploring-tasking-runtimes-for-pgas-languages Chapel-on-X: Exploring...