Julia编程(三): 高性能计算

本文是Julia编程系列的第三篇文章，着重讲述它在高性能计算方面的应用。

1 引言

2 语言分析¹

2.1 Profiling

Julia中自带Profile包，实现了所谓的采样(sampling)或统计分析(statistical profiler)。在任务执行期间，它通过周期性地回溯(backtrace)来工作。每次回溯捕获当前运行的函数和行号，以及导致这行的完整函数调用链，因此这是当前执行状态的一个快照。一个统计分析软件并不提供完整的逐行覆盖，因为回溯是在一定时间段发生的，默认情况下，Unix系统为1ms，Windows系统为10ms。并且采样是在所有执行点的一个稀疏子集上进行的，因此这样的数据具有一定的统计噪声。

function myfunc()
  A = rand(100, 100, 200) 
  maximum(A)  # expensive
end

@profile myfunc()
Profile.print()

@profile (for i = 1:100; myfunc(); end)
Profile.print()  # default "tree" dump

Profile.print(format=:flat)

代码中存在递归

dumbsum(n::Integer) = n == 1 ? 1 : 1 + dumbsum(n-1)
dumbsum3() = dumbsum(3)
@profile dumbsum3()
Profile.print()

累积(accumulation)&清除

@profile产生的结果都累积在一个缓冲区中，如果在@profile下运行多个代码块，那么Profile.print()将显示组合结果。我们可以调用Profile.clear()刷新缓冲区。

print参量说明

function print(io::IO = STDOUT, data = fetch(); format = :tree, C = false, combine = true, cols = tty, maxdepth, sortedby = :count, noisefloor, mincount)

data = copy(Profile.fetch())
Profile.clear()
@profile Profile.print(STDOUT, data) # Prints the previous results 
Profile.print() # Prints results from Profile.print()

s = open("/tmp/prof.txt","w")
Profile.print(s, cols = 500)
close(s)

配置

Profile.init() # returns the current settings
Profile.init(n, delay)  #default n = 10^6, delay = 0.001
Profile.init(delay = 0.01)

2.2 内存分配分析

减少内存分配是一种最常见的提升性能的技术。分配总数可以通过@time和@allocated来估量，特定的触发分配的行可以通过Profiling中由这些行引起的垃圾回收(garbage collection)代价推理出来。当然这也可以直接通过代码每一行分配的内存量测量出来。

$ julia  --track-allocation=<setting>

这里的<setting>选项包括none, user和all。

当退出Julia时，结果被写入相同目录下的.mem文件中。Coverage包包含一些基本分析工具，如按分配字节数的顺序将行进行排序。

2.3 Stack Traces

stacktrace()

example() = stacktrace()
example()

@noinline child() = stacktrace()
@noinline myparent() = child()  # avoid Base.parent
grandparent() = myparent()
grandparent()

信息抽取

top_frame = stacktrace()[1]
top_frame.func
top_frame.file
top_frame.line
top_frame.linfo
top_frame.inlined
top_frame.from_c
top_frame.pointer

错误处理

@noinline bad_function() = undeclared_variable

@noinline example() = try 
  bad_function()
catch
  stacktrace()
end

example()

@noinline example() = try 
  bad_function()
catch
  catch_stacktrace()
end

example()

@noinline child() = error("Whoops!")
@noinline myparent() = child()
@noinline function grandparent() 
  try
    myparent() 
  catch err
    println("ERROR: ", err.msg)
    catch_stacktrace()
  end
end

grandparent()

比较

backtrace()返回Ptr{Void}向量

trace = backtrace()
stacktrace(trace)

默认情况下，stacktrace()从stack栈中移除了低级C函数

stacktrace(trace, true)

pointer = backtrace()[1]
frame = StackTraces.lookup(pointer)
println("The top frame is from $(frame[1].func)!")

2.4 性能提升的技巧

避免全局变量

任何性能攸关或作为基准的代码都应该放置在函数中
置为常量
```
 	const DEFAULT_VAL = 0
```
声明类型
```
  global x
  y = f(x::Int + 1)
```
REPL中的所有代码都处于全局作用域

用`@time`评估性能&关注内存分配

@time不同于tic()toc()函数，它除了输出花费时间外还输出占用内存。

function f(n)
  s=0
  for i = 1:n
    s += i/2
  end
  s
end

@time f(1)
@time f(10^6)

工具

关于第三方包更具体的使用，请看下面第二部分。

Profiling
ProfileView包：用于复杂项目
@code_warntype
Lint包
TypeCheck包

避免带抽象类型参量的容器

a = Real[] # typeof(a) = Array{Real,1} 
if (f = rand()) < .8
  push!(a, f)
end

因为a是Real抽象类型的数组，所以它必须要能够承载任意Real值。Real对象是任意大小和任意结构的，所以a必须表示成指向单独分配的Real对象的指针数组。因为f是Float64类型的，所以a = Float64[]更好。这将创建可以高效操纵的64位浮点数值的连续块，这在系列文章的第一篇入门中也经常提到。

类型声明

避免抽象类型的域

struct MyAmbiguousType 
  a
end
b = MyAmbiguousType("Hello")
c = MyAmbiguousType(17)
typeof(b)
typeof(c)

编译器使用对象类型而不是值来确定如何构建代码，但是从对象中编译器并不能推出有用信息。下面的b和c具有相同类型，但是它们在内存中的基本数据表示是有很大不同的。即使你在域a中存储数值，UIint8和Float64类型的数值在内存中的表示仍然不同，因此CPU需要使用两种不同的指令来处理它们。因为所需信息在下面的类型中不能获知，因此这样的决策需要在运行时进行。

mutable struct MyType{T<:AbstractFloat} 
  a::T
end
m = MyType(3.2)
typeof(m)

m.a = 4.5f0
typeof(m.a)  # not change

func(m::MyType) = m.a + 1
code_llvm(func,(MyType{Float64},)) 
code_llvm(func,(MyType{AbstractFloat},))
code_llvm(func, (MyType, ))

避免抽象容器的域

mutable struct MySimpleContainer{A<:AbstractVector} 
  a::A
end
c = MySimpleContainer(1:3)
typeof(c)
c = MySimpleContainer([1:3;])
typeof(c)

对于a的不同类型，定义不同函数

function sumfoo(c::MySimpleContainer) 
  s=0
  for x in c.a
    s += foo(x)
  end
  s
end

foo(x::Integer) = x
foo(x::AbstractFloat) = round(x)

function myfun{T<:AbstractFloat}(c::MySimpleContainer{Vector{T}})   
  ...
end
function myfun{T<:Integer}(c::MySimpleContainer{Vector{T}})
  ...
end

对于UnitRange{T}或其它抽象类型又需要重新定义，这样非常单调乏味。

mutable struct MyContainer{T, A<:AbstractVector}
  a::A
end
MyContainer(v::AbstractVector) = MyContainer{eltype(v), typeof(v)}(v)
b = MyContainer(1.3:5)
typeof(b)

function myfunc{T<:Integer, A<:AbstractArray}(c::MyContainer{T, A}) 
  return c.a[1] + 1
end
# Note: because we can only define MyContainer for
# A<:AbstractArray, and any unspecified parameters are arbitrary,
# the previous could have been written more succinctly as
#     function myfunc{T<:Integer}(c::MyContainer{T})
function myfunc{T<:AbstractFloat}(c::MyContainer{T})
  return c.a[1] + 2
end
function myfunc{T<:Integer}(c::MyContainer{T,Vector{T}})
  return c.a[1] + 3
end
myfunc(MyContainer(1:3))
myfunc(MyContainer(1.0:3))
myfunc(MyContainer([1:3]))

但是很明显，上面我们并没有约束A的元素类型为T，下面我们用内部构造器进行修改

mutable struct MyBetterContainer{T<:Real, A<:AbstractVector}
  a::A
  MyBetterContainer{T, A}(v::AbstractVector{T}) where {T, A} = new(v) 
end
MyBetterContainer(v::AbstractVector) = MyBetterContainer{eltype(v),typeof(v)}(v)
b = MyBetterContainer(1.3:5)
typeof(b)

声明取自无类型位置处的值

作出这样的声明对于性能提升具有很多好处，如果值不是期望类型，将抛出一个运行时错误

function foo(a::Array{Any,1}) 
  x = a[1]::Int32
  b = x+1
  ...
end

声明关键字参量的类型

关键字声明不会影响函数内代码的性能。但是它们却可以减少包含关键字参量的函数调用的负载，因为对于调用站(call site)而言，带关键字参量的函数负载几乎为零，调用站只传递基于位置的参量。

尽量避免在性能攸关的代码中使用关键字参量列表，因为形如f(x; keywords...)的关键字动态列表的传递是很慢很慢的

function with_keyword(x; name::Int = 1)
  ...
end

将函数拆成多个定义

这使得编译器能够直接调用那些最适用的代码，甚至可以内连(inline)

function norm(A)
  if isa(A, Vector)
    return sqrt(real(dot(A,A))) 
  elseif isa(A, Matrix)
    return max(svd(A)[2]) 
  else
    error("norm: invalid argument") 
  end
end

norm(x::Vector) = sqrt(real(dot(x,x)))
norm(A::Matrix) = max(svd(A)[2])

写类型稳定的函数

pos(x) = x < 0 ? zero(x) : x  # or use oftype(x, 0)

避免改变变量类型

function foo() 
  x = 1
  for i = 1:10
    x = x/bar()
  end
  return x 
end

上面x = 1x为整型，在循环中，x变成浮点型，下面有几种修改方法

x = 1.0
x :: Float64 = 1
x = one(T)

核心计算独立

正如下面代码所示，如果不将核心计算部分独立出来，由于数组是随机产生的，编译器并不知道a的类型，而下面的代码中对于a的不同类型，内部循环被重新编译成fill_two!的一部分，并且这样的代码风格更适合代码重用。

function fill_twos!(a) 
  for i=1:length(a)
    a[i] = 2 
  end
end

function strange_twos(n)
  a = Array(rand(Bool) ? Int64 : Float64, n)
  fill_twos!(a)
  return a
end

值作参量的类型

编译器很容易明白A的类型Array{Float64, 2}，因为它知道填充值为浮点数，维数为NTuple{2, Int}

A = fill(5.0, (3, 3))

但是类型推理并不能提前预测整型值，因此可以通过Val传递。但是这种方式比较微妙，对它的无用甚至会造成更差的性能

function array3{N}(fillval, ::Type{Val{N}}) 
  fill(fillval, ntuple(d->3, Val{N}))
end

function filter3{T, N}(A::AbstractArray{T, N}) 
  kernel = array3(1, Val{N})
  filter(A, kernel)
end

避免对多指派甚至在值作参量的类型上的多指派的滥用

按列优先的顺序访问数组

column-major ordering: Fortran, Matlab, R, Julia
row-major ordering: C, Python(numpy)

x = [1 2; 3 4]
x[:]
vec(x)

预分配输出空间

function xinc!{T}(ret::AbstractVector{T}, x::T) 
  ret[1] = x
  ret[2] = x+1
  ret[3] = x+2
  nothing
end

function loopinc_prealloc() 
  ret = Array{Int}(3) 
  y=0
  for i = 1:10^7
    xinc!(ret, i)
    y += ret[2]
  end
  y
end

@time loopinc_prealloc()

避免在I/O中引入字符串插值

println(file, "$a $b")  # bad
println(file, a, " ", b)

在并行执行期间优化网络I/O

responses = Vector{Any}(nworkers())
@sync begin
  for (idx, pid) in enumerate(workers())
    @async responses[idx] = remotecall_fetch(pid, foo, args...)
  end 
end

微调(tweaks)

避免不必要的数组使用: x + y + z 好于 sum([x, y, z])
复数z模平方:abs2(z) 好于 abs(z) ^ 2
截断除法: div(x, y) 好于 trunc(x / y)
向下舍入除法: fld(x, y) 好于 floor(x / y)
向上舍入除法: cld(x, y) 好于 ceil(x / y)

将异常值(subnormal numbers)视为零

需要小心使用

set_zero_subnormals(true)

function timestep{T}( b::Vector{T}, a::Vector{T}, Δt::T ) 
  @assert length(a)==length(b)
  n = length(b)
  b[1] = 1 # Boundary condition 
  for i=2:n-1
    b[i] = a[i] + (a[i-1] - T(2)*a[i] + a[i+1]) * Δt   
  end
  b[n] = 0 # Boundary condition 
end

function heatflow{T}( a::Vector{T}, nstep::Integer ) 
  b = similar(a)
  for t=1:div(nstep,2) 
    timestep(b,a,T(0.1)) 
    timestep(a,b,T(0.1))
  end
end

heatflow(zeros(Float32,10),2) # Force compilation 
for trial=1:6
  a = zeros(Float32,1000)
  set_zero_subnormals(iseven(trial)) # Odd trials use strict IEEE arithmetic 
  @time heatflow(a,1000)
end

@code_warntype

code_warntype()
@code_lowered
@code_typed
@code_llvm
@code_native

pos(x) = x < 0 ? 0 : x

function f(x)
  y = pos(x)
  sin(y * x + 1)
end

@code_warntype f(3.2)

3 第三方包的使用

High Performance Computing²

Jeff Bezanson, Stefan Karpinski, Viral Shah, Alan Edelman, et al., Julia Language Documentation(Release 0.6.0-dev). ↩
Avik Sengupta, Julia High Performance, 2016. ↩

Written on February 14th, 2016 by 李军

Feel free to share!

Bruce Lee random walk

About Writing Contact

Julia编程(三): 高性能计算

1 引言

2 语言分析¹

2.1 Profiling

累积(accumulation)&清除

配置

2.2 内存分配分析

2.3 Stack Traces

信息抽取

错误处理

比较

2.4 性能提升的技巧

避免全局变量

用`@time`评估性能&关注内存分配

工具

避免带抽象类型参量的容器

类型声明

避免抽象类型的域

避免抽象容器的域

声明取自无类型位置处的值

声明关键字参量的类型

将函数拆成多个定义

写类型稳定的函数

避免改变变量类型

核心计算独立

值作参量的类型

避免对多指派甚至在值作参量的类型上的多指派的滥用

按列优先的顺序访问数组

预分配输出空间

避免在I/O中引入字符串插值

在并行执行期间优化网络I/O

微调(tweaks)

更多声明

将异常值(subnormal numbers)视为零

@code_warntype

3 第三方包的使用

Julia编程(三): 高性能计算

1 引言

2 语言分析1

2.1 Profiling

累积(accumulation)&清除

配置

2.2 内存分配分析

2.3 Stack Traces

信息抽取

错误处理

比较

2.4 性能提升的技巧

避免全局变量

用@time评估性能&关注内存分配

工具

避免带抽象类型参量的容器

类型声明

避免抽象类型的域

避免抽象容器的域

声明取自无类型位置处的值

声明关键字参量的类型

将函数拆成多个定义

写类型稳定的函数

避免改变变量类型

核心计算独立

值作参量的类型

避免对多指派甚至在值作参量的类型上的多指派的滥用

按列优先的顺序访问数组

预分配输出空间

避免在I/O中引入字符串插值

在并行执行期间优化网络I/O

微调(tweaks)

更多声明

将异常值(subnormal numbers)视为零

@code_warntype

3 第三方包的使用

You may also enjoy:

Julia编程(九): 内核剖析

Julia Codes: Ideas from Lloyd Trefethen's Book

Julia编程(八): 与其它语言的比较

Julia编程(二): 通用计算

2 语言分析¹

用`@time`评估性能&关注内存分配