著者
Akihiro MUSA Yoshiei SATO Ryusuke EGAWA Hiroyuki TAKIZAWA Koki OKABE Hiroaki KOBAYASHI
出版者
The Editorial Committee of the Interdisciplinary Information Sciences
雑誌
Interdisciplinary Information Sciences (ISSN:13409050)
巻号頁・発行日
vol.15, no.1, pp.51-66, 2009 (Released:2009-03-25)
参考文献数
21
被引用文献数
2 3

Thanks to the highly effective memory bandwidth of the vector systems, they can achieve the high computation efficiency for computation-intensive scientific applications. However, they have been encountering the memory wall problem and the effective memory bandwidth rate has decreased, resulting in the decrease in the bytes per flop rates of recent vector systems from 4 (SX-7 and SX-8) to 2 (SX-8R) and 2.5 (SX-9). The situation is getting worse as many functions units and/or cores will be brought into a single chip, because the pin bandwidth is limited and does not scale. To solve the problem, we propose an on-chip cache, called vector cache, to maintain the effective memory bandwidth rate of future vector supercomputers. The vector cache employs a bypass mechanism between the main memory and register files under software controls. We evaluate the performance of the vector cache on the NEC SX vector processor architecture with bytes per flop rates of 2 B/FLOP and 1 B/FLOP, to clarify the basic characteristics of the vector cache. For the evaluation, we use the NEC SX-7 simulator extended with the vector cache mechanism. Benchmark programs for performance evaluation are two DAXPY-like loops and five leading scientific applications. The results indicate that the vector cache boosts the computational efficiencies of the 2 B/FLOP and 1 B/FLOP systems up to the level of the 4 B/FLOP system. Especially, in the case where cache hit rates exceed 50%, the 2 B/FLOP system can achieve a performance comparable to the 4 B/FLOP system. The vector cache with the bypass mechanism can provide the data both from the main memory and the cache simultaneously. In addition, from the viewpoints of designing the cache, we investigate the impact of cache associativity on the cache hit rate, and the relationship between cache latency and the performance. The results also suggest that the associativity hardly affects the cache hit rate, and the effects of the cache latency depend on the vector loop length of applications. The cache shorter latency contributes to the performance improvement of the applications with shorter loop lengths, even in the case of the 4 B/FLOP system. In the case of longer loop lengths of 256 or more, the latency can effectively be hidden, and the performance is not sensitive to the cache latency. Finally, we discuss the effects of selective caching using the bypass mechanism and loop unrolling on the vector cache performance for the scientific applications. The selective caching is effective for efficient use of the limited cache capacity. The loop unrolling is also effective for the improvement of performance, resulting in a synergistic effect with caching. However, there are exceptional cases; the loop unrolling worsens the cache hit rate due to an increase in the working space to process the unrolled loops over the cache. In this case, an increase in the cache miss rate cancels the gain obtained by unrolling.
著者
Ye Gao Ryusuke Egawa Hiroyuki Takizawa Hiroaki Kobayashi
雑誌
研究報告計算機アーキテクチャ(ARC)
巻号頁・発行日
vol.2010-ARC-190, no.24, pp.1-10, 2010-07-27

Nowadays, multimedia applications (MMAs) form an important workload for general purpose processors. The vector processing is considered as the most potential approach for MMAs due to plenty of data level parallelism involved in them. However, the tradition vector architectures obey an in-order issue policy (IIP). The IIP issue policy blocks the following instructions to be issued, no matter whether they are ready to be issued or not. This paper proposes a media-oriented vector architectural extension with an out-of-order vector processing mechanism (OVPM). The OVPM overcomes the inefficiency on utilization of the memory bandwidth and vector functional units. As a result, the proposed architecture achieves a higher performance with lower hardware cost than the traditional one. This paper evaluates the proposed architecture with architectural design parameters and finds out the most efficient size for the vector architecture when performing MMAs.