文献一覧: 情報処理学会論文誌コンピューティングシステム(ACS) (雑誌)

1 0 0 0 OA FPGAによる天体物理学計算の高速化

著者: 中里直人濱田剛
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.47, no.SIG7(ACS14), pp.162-171, 2006-05-15

天体物理学では重力多体問題専用計算機GRAPE が非常に大きな成果をあげてきた.本論文では,演算機能が固定されているというGRAPE 計算機の欠点を解消すべく,Field Programmable GateArray(FPGA)を利用した計算機上で,浮動小数点演算を実行し天体物理学計算の高速化を行った.我々が開発したFPGA に浮動小数点演算による演算回路を実装するためのソフトウエアPGR を使用し,重力多体問題を大幅に高速化できることだけでなく,世界で初めてSPH 法の専用計算機による高速化に成功した.本論文の結果は,FPGA による浮動小数点演算の実用性を実証している.

2018-03-19 18:55:33
1 + 0 Twitter

http://id.nii.ac.jp/1001/00018338/

1 0 0 0 Webアプリケーションサーバにおけるプロセスのふるまいに基づいたDoS攻撃の防御手法

著者: 中川岳追川修一
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.10, no.1, pp.1-18, 2017-05-25

Denial-of-Service攻撃(DoS攻撃)からWebアプリケーションを防衛する手法としては,Web Application Firewall(WAF)を用いる方法,OSのリソース制限機構を用いる方法,クライアントからのリクエスト傾向からDoS攻撃の可能性を判定する方法などさまざまな方法が提案されてきた.しかしながら,それらの方法は,Webアプリケーションの脆弱性を利用して,大量のリソースを消費させるDoS攻撃には十分に対処できない.そこで本論文では,WebアプリケーションのDoS攻撃の防御手法として,プロセスのメモリ消費の傾向を利用したリソース制限を提案する.DoS攻撃の原因となるリクエストを受け取ると,そのリクエストを処理するプロセスは急速に大量のリソースを消費する.提案手法では,このリソースの急速な消費を検出し,そのプロセスに対してリソースの利用制限を行う.これにより,DoS攻撃によるリソース浪費を抑制し,正常なリクエストの処理性能の低下を防止する.提案に基づいて,メモリ消費の傾向に基づいたDoS攻撃への対策機構を設計,実装し,評価実験を行った.結果として,DoS攻撃下にあるWebアプリケーションのリクエスト処理性能を最大で4.3倍に改善することができた.また,提案手法による,Webアプリケーションのリクエスト処理性能の性能低下は,最大でも5.0%程度と,非常に小さいことも確認できた.

2017-09-03 02:37:38
1 + 0 Twitter

http://id.nii.ac.jp/1001/00178962/

1 0 0 0 ウェブアプリケーションの性能異常兆候検出への管理図の適用

著者: 岩田聡河野健二
出版者: 情報処理学会
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.3, no.3, pp.221-234, 2010-09-17

ウェブアプリケーションの性能異常が重要な問題となりつつあり,性能異常の発生をいち早く検出し,被害が致命的になるのを防ぐことが求められている.しかし,性能指標の1つとして用いられるリクエストの処理時間は,正常時でも揺らぎが大きいため,その微妙な変化から性能異常の兆候を検出することは難しい.そこで,本論文では管理図という統計的手法を利用して,リクエストの処理時間に表れる微妙な変化から性能異常兆候の検出を試みる.その際,検出の精度向上や検出後の原因究明に有益な情報を得るために2つの工夫を行っている.1つは,監視対象の値として個々のリクエストの処理時間ではなく,一定時間ごとのリクエストの処理時間に関する4種類の統計値を用いている点である.もう1つはウェブアプリケーション全体ではなく,リクエストの種類ごとに管理図を作成する点である.実際にRUBiSというウェブアプリケーションに管理図を適用した結果,性能異常の兆候をつかむことができた.管理図によって検出されたいくつかの性能異常の兆候を詳細に調査した結果,(1)データベースへのインデックスの追加および,(2)性能パラメータの調整などを行うことによって,検出された性能異常が解決できることを確認した.Performance anomaly is becoming a serious problem in web applications. To prevent performance anomaly, it is useful to detect a symptom of performance anomaly to proactively take action against it. Unfortunately, a symptom of performance anomaly is a slight change in processing time in a web application. Thus it is difficult to detect the symptoms without being confused by natural fluctuations in processing time. In this paper, we apply control charts, a statistical method to detect deviations from the standard quality of products, to detecting symptoms of performance anomaly. In applying control charts, we devise some means to improve detection accuracy and gain useful information for debugging performance anomaly. To demonstrate the usefulness of control charts, we conducted case studies with RUBiS, an auction site modeled after ebay.com. Control charts detected some symptoms of performance anomaly: (1) the increase in processing time due to inappropriate design of database index, and (2) the increase in processing time due to improper setting of performance parameters. We confirmed that the detected anomalies can be fixed by modifying database design and performance parameters.

2016-06-24 15:01:40
1 + 0 Twitter

https://ci.nii.ac.jp/naid/110007990322

1 0 0 0 スーパコンピュータTSUBAME 2.0におけるLinpack性能1ペタフロップス超の達成

著者: 遠藤敏夫額田彰松岡聡
出版者: 情報処理学会
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.4, no.4, pp.169-179, 2011-10-05

2010 年 11 月に稼働開始した TSUBAME 2.0 スーパコンピュータは,Intel プロセッサに加え 4,000 以上の NVIDIA GPU を備えるペタスケールのヘテロ型システムである.この TSUBAME 2.0 における Linpack ベンチマークの実行について報告する.本システムは 2CPU と 3GPU を備えた計算ノードを約 1,400 台持ち,それらはフルバイセクションのファットツリー構造を持つ Dual-Rail QDR InfiniBand ネットワークにより接続される.理論演算性能は TSUBAME 1.0 の約 30 倍となる 2.4PFlops であり,それを TSUBAME 1.0 とほぼ同じ規模の電力で実現している.Linpack ベンチマークのコード改良およびチューニングを GPU を用いた大規模システムの特性に合わせて行い,実行速度として 1.192PFlops を実現した.この結果は日本のスパコンとしては初めて PFlops を超えるものであり,Top500 スパコンランキングに 4 位にランクされた.さらに電力性能比は 958MFlops/W であり,Green500 ランキングにおいて the Greenest Production Supercomputer in the World 賞を獲得した.We report Linpack benchmark results on the TSUBAME 2.0 supercomputer, a large scale heterogeneous system with Intel processors and > 4,000 NVIDIA GPUs, operation of which has started in November 2010. The main part of this system consists of about 1,400 compute nodes, each of which is equipped with two CPUs and three GPUs. The nodes are connected via full bisection fat tree network of Dual-Rail QDR InfiniBand. The theoretical peak performance reaches 2.4PFlops, 30 times larger than that of the predecessor TSUBAME 1.0, while its power consumption is similar to TSUBAME 1.0. We conducted improvement and tuning of Linpack benchmark considering characteristics of large scale systems with GPUs, and achieved Linpack performance of 1.192PFlops. This is the first result that exceeds 1PFlops in Japan, and ranked as 4th in the latest Top500 supercomputer ranking. Also TSUBAME 2.0 has received "the Greenest Production Supercomputer in the World" prize in Green500 ranking for its performance power ratio of 958MFlops/W.

2016-06-21 17:45:29
1 + 0 Twitter

https://ci.nii.ac.jp/naid/40019259212

1 0 0 0 プラズマ粒子シミュレーション電流計算のOpenMP並列化手法

著者: 臼井英之杉崎由典冨田清司大村善治三宅洋平青木正樹
出版者: 情報処理学会
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.1, no.2, pp.250-260, 2008-08-21

プラズマ粒子シミュレーションで電磁界成分を更新する際には電流値が必要であり,そのために,個々の粒子の運動量を各空間格子点に集める必要がある.しかし,粒子が空間的にランダムに分布しているため,電流計算の並列演算による高速化は容易ではなく工夫を要する.本論文では,粒子の位置情報を利用して各スレッドに粒子を明示的に割り当てるスレッド並列化アルゴリズムを新しく提案し,OpenMPを用いた実装によりその有効性を検証した.動作検証により,提案手法のCPU台数効果はシミュレーション内の空間格子数の影響を受け,粒子数密度の影響はないことが分かった.特に,各スレッドに割り当てられた空間格子配列がキャッシュに収まりきる程度に細分化される場合,並列台数効果を得やすいことを明らかにした.特に並列台数10前後の場合,その台数効果はスーパリニアとなり,自動並列化コンパイラを用いた電流ルーチン実装に比べて高速になることを明らかにした.また,本提案手法は,各スレッドで全粒子を走査する冗長的な並列化方法であるため,従来アルゴリズムで用いられていた作業領域用配列は不要となり,シミュレーションに必要なメモリ容量を大幅に節約できることを示した.In Particle-In-Cell (PIC) plasma simulations, we calculate the current density to advance the electromagnetic fields. One of the ways to obtain the current density is to gather the velocity moment of each particle to the adjacent grid points. The current calculation is not basically parallelized because the particle positions, which are random in the simulation space, are independent of the array number of current density. In the present paper, we propose a new parallelization method which explicitly associates particles to threads by using OpenMP and evaluate the performance of the proposed method. We clarified that the scalability performance is affected by the number of spatial grid points and is independent of the number of particle per grid. In the proposed method, each thread is in charge of a part of the array of current density divided with the number of thread. When the memory size of the array allocated to each thread becomes small and close to the data cache size of CPU, we found that the scalability performance shows super-linear characteristics and the execution needs less time than the case of using the automatic parallelization compiler. In addition, each thread redundantly scans the particle array to obtain the information of the particle positions for assigning the corresponding particles in charge. Because of this redundant parallelization, we do not have to use work arrays and can save the memory consumed for simulations.

2016-06-14 17:00:34
1 + 2 Twitter

https://ci.nii.ac.jp/naid/110007990189

1 0 0 0 OA Scalable Work Stealing of Native Threads on an x86-64 Infiniband Cluster

著者: Shigeki Akiyama Kenjiro Taura
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.9, no.1, 2016-03-08

Task parallelism on large-scale distributed memory environments is still a challenging problem. The focuses of our work are flexibility of task model and scalability of inter-node load balancing. General task models provide functionalities for suspending and resuming tasks at any program point, and such a model enables us flexible task scheduling to achieve higher processor utilization, locality-aware task placement, etc. To realize such a task model, we have to employ a thread―an execution context containing register values and stack frames―as a representation of a task, and implement thread migration for inter-node load balancing. However, an existing thread migration scheme, iso-address, has a scalability limitation: it requires virtual memory proportional to the number of processors in each node. In large-scale distributed memory environments, this results in a huge virtual memory usage beyond the virtual address space limit of current 64bit CPUs. Furthermore, this huge virtual memory consumption makes it impossible to implement one-sided work stealing with Remote Direct Memory Access (RDMA) operations. One-sided work stealing is a popular approach to achieving high efficiency of load balancing; therefore this also limits scalability of distributed memory task parallelism. In prior work, we propose uni-address, a new thread migration scheme which significantly reduces virtual memory usage for thread stacks and enables RDMA-based work stealing, and implements a lightweight multithread library supporting RDMA-based work stealing on top of Fujitsu FX10 system. In this paper, we port the library to an x86-64 Infiniband cluster with GASNet communication library. We develop one-sided and non one-sided implementations of inter-node work stealing, and evaluate the performance and efficiency of the work stealing implementations.\n------------------------------This is a preprint of an article intended for publication Journal ofInformation Processing(JIP). This preprint should not be cited. Thisarticle should be cited as: Journal of Information Processing Vol.24(2016) No.3(online)------------------------------

2016-03-08 10:19:54
1 + 0 Twitter

http://id.nii.ac.jp/1001/00158016/

1 0 0 0 MPI通信モデルに適した非同期通信機構の設計と実装

著者: 松田元彦石川裕工藤知宏手塚宏史
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.45, no.11, pp.14-23, 2004-10-15

大規模クラスタ計算機に向けたMPIを実装するための通信機構であるO2Gドライバの設計・実装を行っている.O2Gでは,TCP/IPプロトコル通信レイヤ自体は変更せず,MPIの実装に必要となる受信キュー操作をプロトコル処理ハンドラに組み込んでいる.割込みで起動されるプロトコル処理ハンドラ内で,TCP受信バッファから受信データを読み出しユーザ空間にコピーする.これによって,TCP受信バッファの溢れにともなう通信フローの停滞が抑制され,通信性能を劣化させることがなくなる.さらに,従来のソケットAPIで必要だったポーリングが不要になり,システムコール・オーバヘッドが低減される.NAS 並列ベンチマークのISベンチマークでは,O2Gを使用することで従来のMPI実装に比べて3倍の性能が得られる.さらに,ソケットによるMPI実装ではコネクション数が増大すると通信バンド幅が低下するが,O2Gではコネクション数に関係なく高性能なデータ受信を達成していることが示される.In order to implement an efficient MPI communication library for large-scale commoditybased clusters, a new communication mechanism, called O2G, is designed and implemented. O2G introduces receive queue management of MPI into a TCP/IP protocol handler without modifying the protocol stacks. Received data is extracted from the TCP receive buffer and copied into the user space within the TCP/IP protocol handler invoked by interrupts. This avoids message flow disruption due to the shortage of the receive buffer and keeps the bandwidth high. In addition, it totally avoids polling of sockets and reduces system call overheads. An evaluation using the NAS Parallel Benchmark IS shows that an MPI implementation with O2G performed three times faster than other MPI implementations. An evaluation on bandwidth also shows that an MPI implementation with O2G was not affected by the number of connections while an MPI implementation with sockets was affected.

2015-12-23 17:56:00
1 はてなブックマーク

https://ci.nii.ac.jp/naid/110002712277

1 0 0 0 OA 木構造型ネットワークにおける最適ブロードキャストスケジューリング

著者: 蓬来祐一郎西田晃小柳義夫
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.45, no.SIG03(ACS5), pp.100-108, 2004-03-15

集合通信のスケジューリングは,通信時間を大きく左右する.従来の研究ではネットワークを抽象化し,ハブや不均一なネットワークなどのより現実的なモデルを避けていた.しかし,グリッドコンピューティングへの関心や分散データベースなどの需要の増加とともにこの問題の重要性が増してきている.そこで本研究において,スケジューリングの影響が大きいと考えられる木構造におけるブロードキャストの最適スケジューリングを考える.まず,不均一なネットワークを考慮した場合,NP困難な問題になることを示し,最適解の探索に深さ優先探索による分枝限定法を用いた方法を提案する.その際,木構造の対称性からくる冗長性を高速な木の同型判定アルゴリズムにより省く手法を紹介し,その有効性を示す.また実機によるテストを行い,汎用的なMPI実装のブロードキャスト関数MPI Bcastと比較し,ブロードキャストの実行時間が大幅に削減される場合があることを示す.

2015-09-04 12:29:00
1 はてなブックマーク

http://id.nii.ac.jp/1001/00018512/

1 0 0 0 Cray XD1での星団進化の高性能「小規模」シミュレーション

著者: 似鳥啓吾牧野淳一郎阿部譲司
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.48, no.8, pp.54-61, 2007-05-15

本論文では,400個のデュアルコアOpteronプロセッサを用いたCray XD1システム上での高性能な N体シミュレーションコードの実装と,64k粒子の星団のシミュレーションでの性能について述べる.これまでにも多くの天体物理学的 $N$ 体計算の並列化が報告されているが.その中でも数十プロセッサ以上を用いた実装の性能評価には,大きな粒子数が使われる傾向がある.たとえば,これまでのゴードン・ベル賞へのエントリでは,少なくとも70万粒子が用いられている.この傾向の理由は,並列化効率にある.というのも,大規模並列機で小さな粒子数で性能を出すのは非常に困難であるからである.しかしながら,多くの科学的に重要な問題では計算コストは O(N^3.3) に比例するため,比較的小さな粒子数の計算に大規模並列計算機を用いることが非常に重要である.我々は,64k粒子のO(N^2)直接計算独立時間刻み法の計算で2.03Tflops(対ピーク57.7%)の性能を実現した.これまでの64k粒子での同様の計算における最大の効率は,128プロセッサのCray T3E-900での7.8%(9Gflops)である.今回の実装では従来の方法より高スケーラブルな2次元並列アルゴリズムを用いている.さらに今回のような高性能を達成するためにはCray XD1の低レイテンシネットワークが本質的に重要であった.In this paper, we describe the implimentation and performance of N-body simulation code for a star cluster with 64k stars on a Cray XD1 system with 400 dual-core Opteron processors. There have been many reports on the parallelization of astrophysical N-body simulations. For parallel implementations on more than a few tens of processors, performance was usually measured for very large number of particles. For example, all previous entries for the Gordon-Bell prizes used at least 700\,k particles. The reason for this preference of large numbers of particles is the parallel efficiency. It is very difficult to achieve high performance on large parallel machines, if the number of particles is small. However, for many scientifically important problems the calculation cost scales as O(N^3.3), and it is very important to use large machines for relatively small number of particles. We achieved 2.03Tflops, or 57.7% of the theoretical peak performance, using a direct O(N^2) calculation with the individual timestep algorithm, on 64k particles. The best efficiency previously reported on similar calculation with 64K or smaller number of particles is 7.8% (9Gflops) on Cray T3E-900 with 128 processors. Our implementation is based on highly scalable two-dimensional parallelization scheme, and low-latency communication network of Cray XD1 turned out to be essential to achieve this level of performance.

2015-03-06 21:19:57
1 + 0 Twitter

https://ci.nii.ac.jp/naid/110006274062

1 0 0 0 GPGPU アプリケーションの開発を支援するための性能モデル

著者: 伊藤信悟伊野文彦萩原兼一
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.48, no.13, pp.235-246, 2007-08-15
被引用文献数: 2

GPGPU (General-Purpose Computation on Graphics Processing Units) とは、GPU をグラフィクス処理の枠を越えて汎用問題に適用する試みのことである。本稿では、典型的な GPGPU 実装を対象として、GPU による高速化の見込みを予測するための性能モデルを提案する。提案モデルは、GPGPU 実装の多くがメモリ集中型の問題を対象として規則的にデータを参照する点に着目し、実装全体の性能を主記憶、ビデオメモリおよび GPU 内演算器間の各データパスの転送性能で表す。転送性能の各々は、GPGPU アプリケーションとは独立に計測できるバンド幅および遅延時間のみの簡単な組で表す。提案モデルを画像フィルタおよび LU 分解に適用し、3 世代にわたる GPU 上で評価した結果、誤差は最悪で 20%であった。GPU 内キャッシュの効果がさほど大きくない場合、誤差は 10%以内であることから、提案モデルは典型的な実装に対して GPU による高速化の見込みを見積もる際に有用であると考える。GPGPU stands for general-purpose computation on graphics processing units (GPUs), aiming at applying the GPU to general problems beyond graphics problems. This paper presents a performance model for typical GPGPU implementations, which is capable of predicting the possibility of the acceleration achievable by the GPU. Our model focuses on the fact that most of GPGPU implementations deal with memory-intensive problems and have regular access to data. Based on this fact, we represent the entire performance as the transfer performance of data paths connecting main memory, video memory, and processors inside the GPU. Each of the transfer performance here is simply represented by a combination of bandwidth and latency, which are independent of GPGPU applications. We applied the model to an image filter and LU decomposition to estimate their performance on three generations of GPUs. We found that the model has a 20% error at the worst case. We think that the model is useful for estimating the possibility of typical GPU-accelerated implementations, because the observed errors are less than 10% if GPU cache does not have significant effects on performance.

2014-09-01 16:15:07
1 + 3 Twitter

https://ci.nii.ac.jp/naid/110006367097

1 0 0 0 テラスケールコンピューティングのための遠隔スワップシステムTeramem

著者: 山本和典石川裕
出版者: 情報処理学会
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.2, no.3, pp.142-152, 2009-09-18
被引用文献数: 3

64 ビットコモディティアーキテクチャ上で,効率的に大容量仮想メモリを提供する遠隔スワップシステム Teramem を提案する.Teramem は,次の特徴を持つ. i) Linux カーネルのローダブルモジュールとして実装されている.カーネルレベルで実装することにより,従来のユーザレベル遠隔スワップシステムと違い,メモリ管理情報に基づく擬似 LRU などのスワップアウトアルゴリズムを実装可能となった.ii) Linux のスワップ機構と独立に実装することにより,遠隔メモリへのページ転送が最適化されている.本システムの評価の結果,GNU sort ベンチマークにおいて,ディスクへのスワップに比べて 40 倍以上の性能を達成していることを確かめた.また,1MB のメモリブロックが約 1.2 msec の遅延でスワップインされることを確認した.The Teramem remote swapping system is proposed in order to provide a large virtual memory space efficiently in 64-bit commodity architectures. Teramem has mainly two advantages: i) It is implemented as a Linux kernel module so that swap-out algorithms, such as pseudo LRU, are realized based on memory management information unlike traditional user-level remote swapping systems. ii) It is independent of the Linux swap mechanism, and thus, remote memory transfer is optimized. The evaluation results show that the GNU sort benchmark program runs 40 times faster on Teramem than using disk swapping. It is also confirmed that it takes about 1.2 msec to swap in a 1MB memory block.

2014-07-22 20:12:11
1 + 0 Twitter

https://ci.nii.ac.jp/naid/110007990257

1 0 0 0 2 パス限定投機方式の提案

著者: 横田隆史斎藤盛幸大津金光古川文人馬場敬信
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.46, no.16, pp.1-13, 2005-12-15
参考文献数: 20
被引用文献数: 23

LSI の高集積化にともない,計算機システムで利用可能なハードウェア資源の量は拡大の一途をたどっているが,一方でクロック速度の向上が飽和する状況になっており,命令レベル・スレッドレベルの並列性を活かした効果的な実行方式が求められている.本論文は,実行頻度の高いホットループに対して,次のイテレーションで行われる実行経路(パス)を予測して投機実行するパスベースの投機的マルチスレッド処理に関して,スレッドレベル並列性を得るための現実的かつ効果的な方法を検討する.パスを投機の対象とすることで,スレッド間依存の問題の緩和や,スレッドコードの最適化が図れるメリットを享受できるが,その一方で,効果的なパスの予測方法・投機方法が課題となる.本論文では,一般的なプログラムでは多くの場合,予測・投機の対象を実行頻度の高い2 つのパスに絞っても実質上問題にならないことを示し,2 つのパスに限定して投機実行する2 パス限定投機実行方式を提案する.実行頻度の上位2 つのパスが支配的である場合は,最初のパスの投機に失敗しても次点のパスが高確率で成功するために実行効率を上げられる.本提案方式をモデル化し解析的に性能見積りを行うとともに,2 レベル分岐予測器をもとにしたパス予測器を用い,トレースベースのシミュレータにより評価を行い有効性を示す.Modern microprocessor systems take their advantages by exploiting large hardware resources in a single chip and by accelerating clock speed. However, in near future, LSI integration will be continued while clock speed be saturated. Thus efficient instruction- and thread-level parallelism is required to achieve higher performance. This paper addresses a path-based speculative multithreading, where frequently executed path is predicted and executed speculatively. We propose a practical speculation method for path-based speculative multithreading. Most practical programs execute only one or two paths in hot-loops, while there are many possible paths according to many branches. We show most frequent two paths are practical candidates to predict and speculate, and thus we propose the two-path limited speculation method. Analytical performance estimation and trace-based simulation results show effectiveness of the proposed method.

2014-07-12 18:15:07
1 + 0 Twitter

https://ci.nii.ac.jp/naid/110002973618

1 0 0 0 スーパーコンピュータ「京」における地震動シミュレーションコードの高性能化

著者: 井上俊介堤重信前田拓人南一生
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.6, no.3, pp.22-30, 2013-09-25

理化学研究所では,スーパーコンピュータ「京」の高性能化を目的とし,6本の重点アプリケーションを選定し,高性能化,高並列化を進めてきた.うち地球科学の分野から選択された地震動シミュレーションコードであるSeism3Dについては,比較的高いByte/Flop値を要求する演算と,隣接プロセス間のみの通信という特徴があげられる.よって,Seism3Dの高性能化,高並列化のポイントとして,メモリバンド幅を最大限に生かすこと,キャッシュの効率的な利用をすること,6次元メッシュ上での最適な隣接通信を実現すること,に絞られる.我々はコードの持つ要求Byte/Flopから求まるピーク比性能の推定を実施し,詳細プロファイラ機能を活用することにより問題点を把握し,実測,チューニングを実施し,CPU単体性能向上策の検証と通信部の検証を進めた結果,82,944並列で理論ピーク比17.9%(1.9PFLOPS)に達したため,本稿で報告する.In order to optimize performance of the K computer, we selected six applications from various scientific fields. We optimized CPU performance and massively parallelization to them. Seism3D which was selected from earth science field is seismic wave simulation code. It has calculation parts which demands high Byte/Flop and communication parts between neighborhood processes. So optimization points are using enough memory bandwidth, using cache effectively and realization of optimal neighborhood communications on six-dimensional mesh/torus network. We estimated theoretical performance from required Byte/Flop of code and utilized advanced profiler to have a clear grasp of bottle neck. As a result, we achieved 17.9% per peak performance by using 82,944 cpus.

2013-12-27 15:23:35
1 + 0 Twitter

https://ci.nii.ac.jp/naid/110009606657

1 0 0 0 オフセット付きCANメッセージの最大遅れ時間解析

著者: 飯山真一冨山宏之高田広章城戸正利細谷伊知郎
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.45, no.11, pp.455-464, 2004-10-15
参考文献数: 8
被引用文献数: 5

現在,自動車の制御系ネットワークでは,CAN(Controller Area Network)が事実上の標準となっており,CANメッセージの最大遅れ時間を求める手法が提案されている.しかしながら,従来の手法では,メッセージの送信要求時刻がオフセットを持つ状況を取り扱うことができない.より広範な実システムに対応するため,本論文では,グループ分けされたメッセージの送信要求時刻がオフセットを持つメッセージモデルに対する解析手法を提案する.また,実際の車両への適用が検討されているメッセージセットに対して提案手法を適用し,提案手法の有効性を確認した.CAN (Controller Area Network) is a de-facto standard of automotive networks for control. Some methods to evaluate the worst-case response time of CAN messages have been proposed. However, these conventional methods cannot evaluate response times of messages with offsets. This paper proposes a method to decide the worst-case response times of grouped CAN messages with offsets. We also apply our method to a message set currently considered using for a control network of an actual automotive, and confirm the method to be effective.

2013-10-04 14:15:07
1 + 0 Twitter

https://ci.nii.ac.jp/naid/110002712315

1 0 0 0 OA メニーコアプロセッサのための通信衝突に着目したタスク配置手法

著者: 佐野伸太郎吉瀬謙二
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.4, no.4, pp.96-109, 2011-10-05

Network-on-Chip で接続されたコアを持つメニーコアアーキテクチャでは,並列化されたタスクのコアへの割り当て方によって性能が大きく変化する.そこで,自動的に最適なタスク配置を求めることが望まれる.本論文では,メニーコアプロセッサの性能向上を目指すタスク配置手法として,パターンに基づいた配置手法を提案する.シミュレータを用いた評価から,提案手法は NAS Parallel Benchmarks において有用性を確認した.

2013-07-18 13:35:52
1 + 0 Twitter

http://id.nii.ac.jp/1001/00078056/

1 0 0 0 並列処理性能向上を目的としたマルチコア向けヘルパースレッド実行法

著者: 福本尚人佐々木広井上弘士村上和彰
出版者: 情報処理学会
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.5, no.3, pp.101-111, 2012-05-29

本稿では,マルチコア・プロセッサの性能向上を目的としたヘルパースレッド実行法を提案する.マルチコア・プロセッサの性能向上阻害要因として,メモリウォール問題の顕著化がある.これに対して,プロセッサ・コアを「演算用」だけでなく「メモリ性能向上用」に用いることで,性能向上を目指す.メモリ性能向上用のコアでは,プリフェッチを行うヘルパースレッドを実行する.提案方式では,コア間の同期などによりアイドルとなったコアを活用しヘルパースレッド実行を行う.さらに,メモリ性能がボトルネックとなる場合,並列プログラムを実行するコアを減らしてヘルパースレッドを実行する.これにより,プログラムの特徴に応じてメモリ性能向上用のコア数を変更することで,演算性能とメモリ性能の間の適切なバランスをとる.提案方式をシミュレータを用いて評価した結果,従来の全コア実行に対して最大で42%の性能向上を達成した.This paper proposes the helper threads management technique for a multicore processor, and reports its performance impact. Integrating multiple processor cores into a single chip, can achieve higher peak performance by means of exploiting thread level parallelism. However, the memory-wall problem becomes more critical in multicore processors, resulting in poor performance in spite of high TLP. To solve this issue, we propose an efficient helper threads management technique. Unlike conventional parallel executions, this approach exploits some cores to improve the memory performance. In our evaluation, the proposed approach can achieve 42% performance improvement to a conventional parallel execution model.

2012-12-24 10:24:33
1 はてなブックマーク

https://ci.nii.ac.jp/naid/40019469301

1 0 0 0 OA VMMによるアプリケーションを意識したカーネル内の振舞い制御

著者: 尾上浩一大山恵弘米澤明憲
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.3, no.2, pp.163-176, 2010-06-21

カーネルレベルで稼働するマルウェア (カーネルレベルマルウェア) による攻撃は,システム全体に被害を与えたり,攻撃の検出が困難であったりすることから,その脅威は深刻である.これまでカーネルレベルマルウェアに対する様々なセキュリティシステムが提案されているが,保守的すぎるカーネル拡張の制限や適用時の実行時の性能低下に関して改善すべき点がある.本論文では,仮想マシンモニタ (VMM) を用いて,VM 内で稼働する OS カーネルの振舞いを制御するセキュリティシステム ShadowXeck を提案する.この制御は,読み込み専用のメモリ領域の保護と,OS カーネルにより発行された間接呼び出し命令や間接ジャンプ命令の制御によって実現される.ShadowXeck は,OS カーネルレベルよりも高い特権レベルの VMM による制御であるため,VM 内から ShadowXeck の振舞い制御機構を無効化することは困難である.我々は,AMD 64 アーキテクチャ上で Xen を用いて ShadowXeck を実装し,既存のカーネルレベルマルウェアを用いた ShadowXeck による OS カーネルの振舞い制御の確認や実行時オーバヘッドの計測を行った.

2012-11-12 18:15:00
1 はてなブックマーク

http://id.nii.ac.jp/1001/00069743/

1 0 0 0 グリッド環境におけるマルチレーンを用いたMPIコレクティブ通信アルゴリズム

著者: 千葉立寛遠藤敏夫松岡聡
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.48, no.8, pp.104-113, 2007-05-15

グリッド環境上におけるMPIコレクティブ通信の性能は,ネットワークトポロジに強く依存しており,これまでにも最適なネットワークトポロジを構築してコレクティブ通信を高速化させるための様々な手法が数多く提案されてきた.また,近年のクラスタシステムでは,各ノードが複数のNICを備えていることが多い.しかしながら,これまでに提案されている手法は,各ノードの送受信が実行できるポートを1つと仮定してトポロジを構築する手法がほとんどである.そこで我々は,各ノードにある2枚のNICのバンド幅を最大限利用するマルチレーンブロードキャストツリー構築アルゴリズムを提案する.このアルゴリズムでは,ブロードキャストするメッセージを2つに分割し,2枚のNICを用いて2つの独立したバイナリツリーを構築して,それに沿って分割したメッセージのパイプライン転送を行う.また,提案アルゴリズムは,クラスタ,グリッド両方のシステムで効果的に実行でき,NICを1枚だけ備えるノードに対しても複数のソケットを用意することで動作可能である.本稿では,ブロードキャスト通信に対してシミュレータ環境上で実験,評価を行い,従来手法よりも性能が向上したことを確認した.The performance of MPI collective operations, such as broadcast and reduction, is heavily affected by network topologies, especially in grid environments. Many techniques to construct efficient broadcast trees have been proposed for grids.On the other hand, recent high performance computing nodes are often equipped with multi-lane network interface cards (NICs), most previous collective communication methods fail to harness effectively. Our new broadcast algorithm for grid environments harnesses almost all downward and upward bandwidths of multi-lane NICs; a message to be broadcast is split into two pieces, which are broadcast along two independent binary trees in a pipelined fashion, and swapped between both trees. The salient feature of our algorithm is generality; it works effectively on both large clusters and grid environments. It can be also applied to nodes with a single NIC, by making multiple sockets share the NIC. Experimentations on a emulated network environment show that we achieve higher performance than traditional methods, regardless of network topologies or the message sizes.

2012-10-31 21:17:46
1 はてなブックマーク

https://ci.nii.ac.jp/naid/110006274067

1 0 0 0 OA Starving Writerの解消によるLogTMの高速化

著者: 江藤正通堀場匠一朗浅井宏樹津邑公暁松尾啓志
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.5, no.5, pp.55-65, 2012-10-15

マルチコア環境における並列プログラミングでは,メモリアクセスの調停には一般にロックが用いられてきた.しかしロックを使用する場合,デッドロックの発生や並列性の低下などの問題がある.そこでロックを用いない並行性制御機構として LogTM が提案されている. LogTM では possible_cycle というフラグを用いて競合を解決する.しかし,この競合解決手法では starving writer が発生し,長期にわたるストールや競合の繰返しにより性能が大きく低下してしまう.そこで本稿では, starving writer の解決手法を提案する.提案手法の有効性を検証するためにシミュレーションによる評価を行った結果,既存の LogTM に比べて最大で 18.7%,平均で 6.6% の性能向上が得られることを確認した.

2012-10-15 16:38:39
1 + 1 Twitter

http://id.nii.ac.jp/1001/00086043/

1 0 0 0 OA デッドロック検出の厳密化によるLogTMのアボート削減手法

著者: 堀場匠一朗江藤正通浅井宏樹津邑公暁松尾啓志
雑誌: 情報処理学会論文誌コンピューティングシステム(ACS) (ISSN:18827829)
巻号頁・発行日: vol.5, no.5, pp.43-54, 2012-10-15

マルチコア環境における並列プログラミングでは,共有メモリへのアクセス制御にロックが広く用いられてきた.しかし,ロックには並列性の低下やデッドロックの発生などの問題がある.そこで,ロックを用いない並行性制御機構として,トランザクショナル・メモリが提案されている.このハードウェアによる一実装である LogTM においては, possible_cycle と呼ばれるフラグを用いてデッドロックの発生を検出する.しかしこの手法では,デッドロックの判定に偽陽性が存在し,アボートが過剰に発生する可能性がある.そこで本稿では, 3 者以上のトランザクション間の依存関係を考慮することでデッドロックを検出可能とする手法を提案し,さらに適切なアボート対象を選択する手法も検討する.シミュレーションによる評価の結果,提案手法によりアボートの発生が抑制され,ログの書き戻しコストなどが削減されることで,最大 31.5% の高速化を確認した.

2012-10-15 16:37:52
1 + 1 Twitter

http://id.nii.ac.jp/1001/00086042/