著者
Wataru Endo Shigeyuki Sato Kenjiro Taura
出版者
Information Processing Society of Japan
雑誌
Journal of Information Processing (ISSN:18826652)
巻号頁・発行日
vol.30, pp.269-282, 2022 (Released:2022-03-15)
参考文献数
26
被引用文献数
1

User-level threading or task-parallel systems have been developed over decades to provide efficient and flexible threading features missing from kernel-level threading for both parallel and concurrent programming. Some of the existing state-of-the-art user-level threading libraries provide interfaces to customize the implementation of thread scheduling to adapt to different workloads from both applications and upper-level systems. However, most of them are typically built as huge sets of monolithic components which achieve customizability with additional costs via concrete C APIs. We have noticed that the zero-overhead abstraction of C++ is beneficial for assembling flexible user-level threading in a clearer manner. To demonstrate our ideas, we have implemented a new user-level threading library ComposableThreads which provides customizability while minimizing the interfacing costs. We show that the users can pick up, insert, or replace the individual classes of ComposableThreads for their own purposes. ComposableThreads offers several characteristic abstractions to build high-level constructs of user-level threading including suspended threads (one-shot continuations) and lock delegators. We evaluate both the customizability and performance of our runtime system through the microbenchmark and application results.
著者
Satoshi Matsuoka Hideharu Amano Kengo Nakajima Koji Inoue Tomohiro Kudoh Naoya Maruyama Kenjiro Taura Takeshi Iwashita Takahiro Katagiri Toshihiro Hanawa Toshio Endo
雑誌
研究報告ハイパフォーマンスコンピューティング(HPC) (ISSN:21888841)
巻号頁・発行日
vol.2016-HPC-155, no.32, pp.1-14, 2016-08-01

Slowdown and inevitable end in exponential scaling of processor performance, the end of the so-called “Moore's Law”is predicted to occur around 2025-2030 timeframe. Because CMOS semiconductor voltage is also approaching its limits, this means that logic transistor power will become constant, and as a result, the system FLOPS will cease to improve, resulting in serious consequences for IT in general, especially supercomputing. Existing attempts to overcome the end of Moore 's law are rather limited in their future outlook or applicability. We claim that data-oriented parameters, such as bandwidth and capacity, or BYTES, are the new parameters that will allow continued performance gains for periods even after computing performance or FLOPS ceases to improve, due to continued advances in storage device technologies and optics, and manufacturing technologies including 3-D packaging. Such transition from FLOPS to BYTES will lead to disruptive changes in the overall systems from applications, algorithms, software to architecture, as to what parameter to optimize for, in order to achieve continued performance growth over time. We are launching a new set of research efforts to investigate and devise new technologies to enable such disruptive changes from FLOPS to BYTES in the Post-Moore era, focusing on HPC, where there is extreme sensitivity to performance, and expect the results to disseminate to the rest of IT.
著者
Takato Hideshima Shigeyuki Sato Kenjiro Taura
出版者
Information Processing Society of Japan
雑誌
Journal of Information Processing (ISSN:18826652)
巻号頁・発行日
vol.30, pp.464-475, 2022 (Released:2022-06-15)
参考文献数
31

Page-based distributed shared memory (PDSM) is a programming environment on distributed-memory computers that allows to freely allocate shared regions in the virtual address space accessible from any computer. It hides distributed physical memory from programmers and enables shared-memory programming over the uniform virtual address space. PDSM systems are typically equipped with coherent cache to improve performance while hiding communication, but the management cost is considered implementation details and is complex and implicit. Consequently, it is easy to fail in gaining speedup, and it is difficult to perform cost-aware programming to solve it. In this study, we explore cost-aware programming for ArgoDSM, a state-of-the-art PDSM. Particularly, based on the observation that there are three effective measures for reducing PDSM-derived costs: 1) informing PDSM of changes in access patterns to shared regions, 2) inspecting the data to be placed in shared regions, and 3) performing writes with an awareness of the original owner of the shared region, we extend the ArgoDSM with APIs to help in these measures. We performed cost-aware programming on the extended ArgoDSM for benchmark programs, and experimentally showed that PDSM-derived costs can be significantly reduced. The proposed programming measures significantly improve the situation, where the performance is below the sequential performance, and allows to benefit from the scalability of distributed-memory computers under the high-level abstraction of PDSM.
著者
Shigeki Akiyama Kenjiro Taura
雑誌
情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日
vol.9, no.1, 2016-03-08

Task parallelism on large-scale distributed memory environments is still a challenging problem. The focuses of our work are flexibility of task model and scalability of inter-node load balancing. General task models provide functionalities for suspending and resuming tasks at any program point, and such a model enables us flexible task scheduling to achieve higher processor utilization, locality-aware task placement, etc. To realize such a task model, we have to employ a thread―an execution context containing register values and stack frames―as a representation of a task, and implement thread migration for inter-node load balancing. However, an existing thread migration scheme, iso-address, has a scalability limitation: it requires virtual memory proportional to the number of processors in each node. In large-scale distributed memory environments, this results in a huge virtual memory usage beyond the virtual address space limit of current 64bit CPUs. Furthermore, this huge virtual memory consumption makes it impossible to implement one-sided work stealing with Remote Direct Memory Access (RDMA) operations. One-sided work stealing is a popular approach to achieving high efficiency of load balancing; therefore this also limits scalability of distributed memory task parallelism. In prior work, we propose uni-address, a new thread migration scheme which significantly reduces virtual memory usage for thread stacks and enables RDMA-based work stealing, and implements a lightweight multithread library supporting RDMA-based work stealing on top of Fujitsu FX10 system. In this paper, we port the library to an x86-64 Infiniband cluster with GASNet communication library. We develop one-sided and non one-sided implementations of inter-node work stealing, and evaluate the performance and efficiency of the work stealing implementations.\n------------------------------This is a preprint of an article intended for publication Journal ofInformation Processing(JIP). This preprint should not be cited. Thisarticle should be cited as: Journal of Information Processing Vol.24(2016) No.3(online)------------------------------
著者
Nan Dun Sugianto Angkasa Kenjiro Taura Ting Chen
雑誌
研究報告ハイパフォーマンスコンピューティング(HPC)
巻号頁・発行日
vol.2011, no.39, pp.1-7, 2011-07-20

Cynk is a hybrid file system using rsync and SSH for data-intensive cloud computing. By automatically synchronizing the local file system with a cloud storage, Cynk enables users to transparently access local/remote data when they are online and continue working when disconnected from network. The hybrid architecture of Cynk means that it can allow users to simutaneously access locally synchronized/cached data or online remote data over the network via a uniform file system interface. Cynk uses the rsync tool with a partially reasoning based protocol to synchronize files from local to remote file systems and vice versa. It only requires the installation of client on local side. By seamlessly bridging the local file system and cloud storage, Cynk especially simplifies the work cycle of developing, testing, and deploying data-intensive applications.Cynk is a hybrid file system using rsync and SSH for data-intensive cloud computing. By automatically synchronizing the local file system with a cloud storage, Cynk enables users to transparently access local/remote data when they are online and continue working when disconnected from network. The hybrid architecture of Cynk means that it can allow users to simutaneously access locally synchronized/cached data or online remote data over the network via a uniform file system interface. Cynk uses the rsync tool with a partially reasoning based protocol to synchronize files from local to remote file systems and vice versa. It only requires the installation of client on local side. By seamlessly bridging the local file system and cloud storage, Cynk especially simplifies the work cycle of developing, testing, and deploying data-intensive applications.