狠狠撸

Solving Nonograms
in parallel
Group 7
0456095 葉家郡
0116339 黃靖宇
0116097 陳光洋

背景
? 邏輯遊戲
? 根據題目線索，畫出一張點陣圖
解題
? NP-complete
? O(2NxN)
? 1000道題目
? 黑點率50%~35%
? Size = 25x25
What is Nonogram

Motivation
平均需要花10~15分鐘計算完一輪
?平行化加速
?減少實驗需要的時間
?挖掘其他能讓演算法更快的思路

2.
Related Work /
Problem statement

How to solve
Genetic Algorithm
For big size problem,depth-first
search algorithm is better
Linear programming
Exact cover
Interesting way,but hard to
improve
Search Algorithm
Most paper
Heuristics rules
Branch and bound
Depth-first search

The Algorithm we choose
1. Line solving
dynamic programming
一次計算題目中的一行
2. Fully probing
比對題目中不同盤面的情形
找出一定合法和一定非法的部份解
3. Backtracking
暴力搜尋解
depth-first search
I-Chen Wu; Der-Johng Sun; Lung-Ping Chen; Kan-Yueh Chen; Ching-Hua Kuo; Hao-Hua Kang; Hung-Hsuan Lin, “An
Efficient Approach to Solving Nonograms,”
葉家郡; 黃國展, “Nonogram解題加速方法探討”

Online resources
Original program: https://github.com/xnum/nng2014
Google-perftools: http://gperftools.googlecode.com/svn/trunk/doc/cpuprofile.html
About SIMD: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-
architectures-software-developer-vol-1-manual.pdf
OpenMPI v1.6 documentation: http://www.open-mpi.org/doc/v1.6/
pthread on Linux: http://linux.die.net/man/3/pthread_create

Environment
Hardware
? E5-2620 v2 @ 2.1GHz
2cpus, 12cores, HT disabled
? DDR3 16GB 1600MHz w/ ECC
? 1 node
Software
? Ubuntu 14.04.3 LTS
? Gcc 4.8.4
? OpenMPI 1.6.5
Compiler arguments:
mpic++ -Wall -Wextra -std=c++11 -
m64 -march=native -msse4.2 -g

3.
Proposed solution
/ Evaluation

程式結構優化
?程式經過編譯器優化後，會將部份計算改用SIMD指令集處理
?memset ,memcmp ,xor ,and ,or
?太大的資料型態，需要更多cpu cycle來計算
?SIMD register size (128bits,256bits,512bits)

程式結構優化(cont.d)
old new
544bits 192bits

程式結構優化(cont.d)
old new
192bits 128bits

程式結構優化(Eval)
old new new2
data1 398s 383s 380s
data2 563s 542s 538s
data3 417s 401s 398s

記憶體存取優化
1. 把需要做memcpy的data照16bytes align
可以用aligned load指令
2. struct ( <=128bits) assignment 取代 memcpy

記憶體存取優化(Eval)
serial optimized
data1 1242s 1131s
data2 1403s 1358s
data3 1204s 1143s
? Compiler arguments:
? g++ -std=c++11 -msse4.2 -
march=native -O3
? 執行環境:
? 課程工作站

MPI Ver 1
for( int i = rank*data_size ;
i < (rank+1)*data_size ;
++i )
doSolve();
1 1 2 2 3 3 4 4 5 5
? 最簡單的切法
? 容易造成工作量不平均
? 用來與之後的演算法比較

MPI Ver 2
for( int i = rank ;
i < problem_size ;
i += size )
doSolve();
1 2 3 4 5 1 2 3 4 5
? 讓problems被交錯計算
? 工作量還是有分配不均情況

MPI Ver 3 Dynamic Scheduling
Master
Do
wait for ask problem
dispatch to slave
Until All problems are solved
Slaves
Do
ask new problem
do solve
Until no more problems

MPI Ver 3 (patch)
Master
Do
wait for ask problem
dispatch to slave
Until All problems are solved
Slaves
Do
ask new problem
do solve
Until no more problems

Busy waiting
Master call MPI-Recv and
it’s a sync IO function

“Async IO for MPI-Recv
int MPI_MyRecv( ... ) {
PMPI_Irecv( ... );
do {
nanosleep( ... );
PMPI_Request_get_status( ... );
} while (!flag);
return (*status).MPI_ERROR;
}

0
1
2
3
4
5
6
7
8
9
10
1 2 4 8 10
MPI ver 1
TAAI2012 TAAI2013 TAAI2014 Perfect

0
1
2
3
4
5
6
7
8
9
10
1 2 4 8 10
MPI ver 2

0
1
2
3
4
5
6
7
8
9
10
1 2 4 8 10
MPI ver 3

0
1
2
3
4
5
6
7
8
9
10
1 3 5 9 11
MPI ver 3(patch) N+1 core

MPI ver execution time(seconds) on 10 cores
serial ver1 ver2 ver3 ver3(p)
data1 2089 891 451 413 400
data2 2635 978 811 577 560
data3 2154 741 409 441 415

WHY
When we use 10 cores,
speedup only 5.2x

0
100
200
300
400
500
600
TAAI2012 TAAI2013 TAAI2014
execution time on 10 core
1 2 3 4 5 6 7 8 9 10

0
100
200
300
400
500
600
700
800
900
TAAI2012 TAAI2013 TAAI2014
execution time on 4 core
1 2 3 4

Min and Max proc. Exec time
4,min 10,min 4,max 10,max
data1 538 191 601 400
data2 652 216 816 560
data3 480 164 710 415

我們嘗試了
?Custom thread pools
?Tiny master and worker system
?Sync without pthread-cond-… pthread-mutex… balabala
?pthread-rwlock for data dependency
But still fails
Execution time ~7000s

How about GPU?
?太多Branch
?throughput小、呼叫頻率高
?用一塊memory(600M)當cache減少計算量

Contributions
葉家郡黃靖宇陳光洋
Proposal 100% 0% 0%
Related work 70% 15% 15%
Design / coding 70% 15% 15%
Evaluation 0% 50% 50%
Presentation 100% 0% 0%

Conclusion
Data dependency
資料相依性低的地方，能得到很好的
效能增長。
相對的如果無法解決相依性問題，程
式效能反而可能會變差。
Memory
除了計算量問題外，記憶體的使用和
快取也是很重要的考量點。
Load Balancing
核心數多的情況下，工作量不均衡的
問題特別容易被暴露出來，因此讓工
作能平均分配就能有一定程度的效能
提升。
Project Repository
https://github.com/xnum/nng2014

thanks!
Any questions?
You can find me at
@xnum
s000032001@gmail.com

狠狠撸

Solving Nonograms In Parallel

Recommended

More Related Content

What's hot (20)

Similar to Solving Nonograms In Parallel (20)

Recently uploaded (20)

Solving Nonograms In Parallel