国模GOGO无码人体啪啪,疯狂撞击丝袜人妻

選自nextjournal，作者：Simon Danisch，機器之心編譯，參與：高璇、劉曉坤。

GPU 的并行線程可以大幅提升速度，但也使得代碼編寫變得更復(fù)雜。而 Julia 作為一種高級腳本語言，允許在其中編寫內(nèi)核和環(huán)境代碼，并可在大多數(shù) GPU 硬件上運行。本文旨在介紹 GPU 的工作原理，詳細說明當前的 Julia GPU 環(huán)境，以及展示如何輕松運行簡單 GPU 程序。

為了簡化操作，可以在 nextjournal 上注冊賬戶，點擊「edit」即可直接運行文章中的簡單代碼了。

注冊地址：https://nextjournal.com/signup

首先，什么是 GPU？

GPU 是一種大型并行處理器，有幾千個并行處理單元。例如，本文使用的 Tesla k80，能提供 4992 個并行 CUDA 核。GPU 在頻率、延遲和硬件性能方面與 CPU 有很大的不同，但實際上 Tesla k80 有點類似于具有 4992 核的慢速 CPU。

教程｜在Julia編程中實現(xiàn)GPU加速（julia gpu編程）

能夠啟動的并行線程可以大幅提升速度，但也令使用 GPU 變得更困難。當使用這種未加處理的能量時，會出現(xiàn)以下缺點：

GPU 是一種有專屬內(nèi)存空間和不同架構(gòu)的獨立硬件。因此，從 RAM 到 GPU 內(nèi)存（VRAM，顯存）的傳輸時間很長。甚至在 GPU 上啟動內(nèi)核（調(diào)用調(diào)度函數(shù)）也會帶來很大的延遲，對于 GPU 而言是 10us 左右，而對于 CPU 只有幾納秒。
在沒有高級封裝的情況下，建立內(nèi)核會變得復(fù)雜。
低精度是默認值，高精度的計算可以很容易地消除所有性能增益。
GPU 函數(shù)（內(nèi)核）本質(zhì)上是并行的，所以編寫 GPU 內(nèi)核不比編寫并行 CPU 代碼容易，而且硬件上的差異增加了一定的復(fù)雜性。
與上述情況相關(guān)的很多算法都不能很好地遷移到 GPU 上。想要了解更多的細節(jié)，請看這篇博文：https://streamhpc.com/blog/2013-06-03/the application-areas-OpenCL– cuda-can- used/。
內(nèi)核通常是用 C/ C 語言編寫的，但這并不是寫算法的最好語言。
CUDA 和 OpenCL 之間有差異，OpenCL 是編寫底層 GPU 代碼的主要框架。雖然 CUDA 只支持英偉達硬件，OpenCL 支持所有硬件，但并不精細。要看個人需求進行選擇。

而 Julia 作為一種高級腳本語言，允許在其中編寫內(nèi)核和環(huán)境代碼，同時可在大多數(shù) GPU 硬件上運行！

GPUArrays

大多數(shù)高度并行的算法都需要同時處理大量數(shù)據(jù)，以克服所有的多線程和延遲損耗。因此，大多數(shù)算法都需要數(shù)組來管理所有數(shù)據(jù)，這就需要一個好的 GPU 數(shù)組庫作為關(guān)鍵的基礎(chǔ)。

GPUArrays.jl 是 Julia 為此提供的基礎(chǔ)。它實現(xiàn)了一個專門用于高度并行硬件的抽象數(shù)組。它包含了設(shè)置 GPU、啟動 Julia GPU 函數(shù)、提供一些基本數(shù)組算法等所有必要功能。

抽象意味著它需要以 CuArrays 和 CLArrays 的形式實現(xiàn)。由于繼承了 GPUArrays 的所有功能，它們提供的接口完全相同。唯一的區(qū)別出現(xiàn)在分配數(shù)組時，這會強制用戶決定這一數(shù)組是存在于 CUDA 還是 OpenCL 設(shè)備上。關(guān)于這一點的更多信息，請參閱「內(nèi)存」部分。

GPUArrays 有助于減少代碼重復(fù)，因為它允許編寫?yīng)毩⒂谟布?GPU 內(nèi)核，這些內(nèi)核可以通過 CuArrays 或 CLArrays 編譯到本地的 GPU 代碼。因此，大多通用內(nèi)核可以在從 GPUArrays 繼承的所有包之間共享。

選擇小貼士：CuArrays 只支持 Nvidia GPU，而 CLArrays 支持大多數(shù)可用的 GPU。CuArrays 比 CLArrays 更穩(wěn)定，可以在 Julia 0.7 上使用。速度上兩者大同小異。我建議都試一試，看看哪種最有效。

本文中，我將選擇 CuArrays，因為本文是在 Julia 0.7 / 1.0 上編寫的，CLArrays 暫不支持。

性能

用一個簡單的交互式代碼示例來快速說明：為了計算 julia 集合（曼德勃羅集合），我們必須要將計算轉(zhuǎn)移到 GPU 上。

using CuArrays, FileIO, Colors, GPUArrays, BenchmarkToolsusing CuArrays: CuArray”””The function calculating the Julia set”””function juliaset(z0, maxiter) c = ComplexF32(-0.5, 0.75) z = z0 for i in 1:maxiter abs2(z) > 4f0 && return (i – 1) % UInt8 z = z * z c end return maxiter % UInt8 # % is used to convert without overflow checkendrange = 100:50:2^12cutimes, jltimes = Float64[], Float64[]function run_bench(in, out) # use dot syntax to apply `juliaset` to each elemt of q_converted # and write the output to result out .= juliaset.(in, 16) # all calls to the GPU are scheduled asynchronous, # so we need to synchronize GPUArrays.synchronize(out)end# store a reference to the last results for plottinglast_jl, last_cu = nothing, nothingfor N in range w, h = N, N q = [ComplexF32(r, i) for i=1:-(2.0/w):-1, r=-1.5:(3.0/h):1.5] for (times, Typ) in ((cutimes, CuArray), (jltimes, Array)) # convert to Array or CuArray – moving the calculation to CPU/GPU q_converted = Typ(q) result = Typ(zeros(UInt8, size(q))) for i in 1:10 # 5 samples per size # benchmarking macro, all variables need to be prefixed with $ t = Base.@elapsed begin run_bench(q_converted, result) end global last_jl, last_cu # we’re in local scope if result isa CuArray last_cu = result else last_jl = result end push!(times, t) end endendcu_jl = hcat(Array(last_cu), last_jl)cmap = colormap(“Blues”, 16 1)color_lookup(val, cmap) = cmap[val 1]save(“results/juliaset.png”, color_lookup.(cu_jl, (cmap,)))

教程｜在Julia編程中實現(xiàn)GPU加速（julia gpu編程）

using Plots; plotly()x = repeat(range, inner = 10)speedup = jltimes ./ cutimesPlots.scatter( log2.(x), [speedup, fill(1.0, length(speedup))], label = [“cuda” “cpu”], markersize = 2, markerstrokewidth = 0, legend = :right, xlabel = “2^N”, ylabel = “speedup”)

教程｜在Julia編程中實現(xiàn)GPU加速（julia gpu編程）

對于大型數(shù)組，通過將計算轉(zhuǎn)移到 GPU，可以穩(wěn)定地將速度提高 60-80 倍。獲得此加速和將 Julia 數(shù)組轉(zhuǎn)換為 GPUArray 一樣簡單。

有人可能認為 GPU 性能會受到像 Julia 這樣的動態(tài)語言影響，但 Julia 的 GPU 性能應(yīng)該與 CUDA 或 OpenCL 的原始性能相當。Tim Besard 在集成 LLVM Nvidia 編譯流程方面做得很好，能夠?qū)崿F(xiàn)與純 CUDA C 語言代碼相同（有時甚至更好）的性能。他在博客（https://devblogs.nvidia.com/gpu-computing-julia-programming-language/）中作了進一步解釋。CLArrays 方法有點不同，它直接從 Julia 生成 OpenCL C 代碼，代碼性能與 OpenCL C 相同！

為了更好地了解性能并與多線程 CPU 代碼進行比對，我整理了一些基準：https://github.com/JuliaGPU/GPUBenchmarks.jl/blob/master/results/results.md

內(nèi)存

GPU 具有自己的存儲空間，包括顯存（VRAM）、不同的高速緩存和寄存器。無論做什么，運行前都要先將 Julia 對象轉(zhuǎn)移到 GPU。并非 Julia 中的所有類型都可以在 GPU 上運行。

首先讓我們看一下 Julia 的類型：

struct Test # an immutable struct# that only contains other immutable, which makes # isbitstype(Test) == true x::Float32 end# the isbits property is important, since those types can be used# without constraints on the GPU!@assert isbitstype(Test) == truex = (2, 2)isa(x, Tuple{Int, Int}) # tuples are also immutablemutable struct Test2 #-> mutable, isbits(Test2) == false x::Float32endstruct Test3 # contains a heap allocation/ reference, not isbits x::Vector{Float32} y::Test2 # Test2 is mutable and also heap allocated / a referenceendVector{Test} # <- An Array with isbits elements is contigious in memoryVector{Test2} # <- An Array with mutable elements is basically an array of heap pointers. Since it just contains cpu heap pointers, it won’t work on the GPU.

“Array{Test2,1}”

所有這些 Julia 類型在傳輸?shù)?GPU 或在 GPU 上創(chuàng)建時表現(xiàn)不同。下表概述了預(yù)期結(jié)果：

教程｜在Julia編程中實現(xiàn)GPU加速（julia gpu編程）

創(chuàng)建位置描述對象是在 CPU 上創(chuàng)建的，然后轉(zhuǎn)移到 GPU 內(nèi)核上，或者本身就由內(nèi)核內(nèi)部的 GPU 創(chuàng)建。該表顯示創(chuàng)建類型的實例是否可行，對于從 CPU 到 GPU 的轉(zhuǎn)移，該表還說明了對象是否能通過參照進行復(fù)制或傳遞。

垃圾收集

當使用 GPU 時，要注意 GPU 上沒有垃圾收集器（GC）。這不會造成太大影響，因為寫入 GPU 的高性能內(nèi)核不應(yīng)該創(chuàng)建任何 GC-跟蹤的內(nèi)存作為起始。

在 GPU 上實現(xiàn) GC 不無可能，但請記住，每個執(zhí)行內(nèi)核都是大規(guī)模并行的。在大約 1000 個 gpu 線程中的每一個創(chuàng)建和跟蹤大量堆內(nèi)存就會馬上破壞性能增益，因此實現(xiàn) GC 是得不償失的。

使用 GPUArrays 可以作為在內(nèi)核中分配數(shù)組的替代方法。GPUArray 構(gòu)造函數(shù)將創(chuàng)建 GPU 緩沖區(qū)并將數(shù)據(jù)轉(zhuǎn)移到 VRAM。如果調(diào)用 Array(gpu_array)，數(shù)組將被轉(zhuǎn)移回 RAM，變?yōu)槠胀ǖ?Julia 數(shù)組。這些 gpu 數(shù)組的 Julia 操作由 Julia 的 GC 跟蹤，如果不再使用，GPU 內(nèi)存將被釋放。

因此，只能在設(shè)備上使用堆棧分配，并且只能被其他的預(yù)先分配的 GPU 緩沖區(qū)使用。由于轉(zhuǎn)移代價很高，因此在編寫 GPU 時，往往要盡可能重用和預(yù)分配。

GPUArray 構(gòu)造函數(shù)

using CuArrays, LinearAlgebra# GPU Arrays can be constructed from all Julia arrays containing isbits types!A1D = cu([1, 2, 3]) # cl for CLArraysA1D = fill(CuArray{Int}, 0, (100,)) # CLArray for CLArrays# Float32 array – Float32 is usually preferred and can be up to 30x faster on most GPUs than Float64diagonal_matrix = CuArray{Float32}(I, 100, 100)filled = fill(CuArray, 77f0, (4, 4, 4)) # 3D array filled with Float32 77randy = rand(CuArray, Float32, 42, 42) # random numbers generated on the GPU# The array constructor also accepts isbits iterators with a known size# Note, that since you can also pass isbits types to a gpu kernel directly, in most cases you won’t need to materialize them as an gpu arrayfrom_iter = CuArray(1:10)# let’s create a point type to further illustrate what can be done:struct Point x::Float32 y::Float32endBase.convert(::Type{Point}, x::NTuple{2, Any}) = Point(x[1], x[2])# because we defined the above convert from a tuple to a point# [Point(2, 2)] can be written as Point[(2,2)] since all array # elements will get converted to Pointcustom_types = cu(Point[(1, 2), (4, 3), (2, 2)])typeof(custom_types)

“CuArray{point,1}”

數(shù)組操作

我們已經(jīng)定義了許多操作。最重要的是，GPUArrays 支持 Julia 的融合點廣播表示法（fusing dot broadcasting notation）。此表示法允許你將函數(shù)應(yīng)用于數(shù)組的每個元素，并使用 f 的返回值創(chuàng)建新數(shù)組。此功能通常稱為映射（map）。broadcast 指的是形狀各異的數(shù)組被 broadcast 成相同形狀。

工作原理如下：

x = zeros(4, 4) # 4×4 array of zerosy = zeros(4) # 4 element arrayz = 2 # a scalar# y’s 1st dimension gets repeated for the 2nd dimension in x# and the scalar z get’s repeated for all dimensions# the below is equal to `broadcast( , broadcast( , xx, y), z)`x . y . z

教程｜在Julia編程中實現(xiàn)GPU加速（julia gpu編程）

發(fā)生「融合」是因為 Julia 編譯器會重寫該表達式為一個傳遞調(diào)用樹的 lazy broadcast 調(diào)用，然后可以在循環(huán)遍歷數(shù)組之前將整個調(diào)用樹融合到一個函數(shù)中。

如果你想要更詳細的了解 broadcast，可以看該指南：julia.guide/broadcasting。

這意味著在不分配堆內(nèi)存（僅創(chuàng)建 isbits 類型）的情況下運行的任何 Julia 函數(shù)，都可以應(yīng)用于 GPUArray 的每個元素，并且多點調(diào)用會融合到一個內(nèi)核調(diào)用中。由于內(nèi)核調(diào)用會有很大延遲，所以這種融合是一個非常重要的優(yōu)化。

using CuArraysA = cu([1, 2, 3])B = cu([1, 2, 3])C = rand(CuArray, Float32, 3)result = A . B .- Ctest(a::T) where T = a * convert(T, 2) # convert to same type as `a`# inplace broadcast, writes directly into `result`result .= test.(A) # custom function work# The cool thing is that this composes well with custom types and custom functions.# Let’s go back to our Point type and define addition for itBase.:( )(p1::Point, p2::Point) = Point(p1.x p2.x, p1.y p2.y)# now this works:custom_types = cu(Point[(1, 2), (4, 3), (2, 2)])# This particular example also shows the power of broadcasting: # Non array types are broadcasted and repeated for the whole lengthresult = custom_types . Ref(Point(2, 2))# So the above is equal to (minus all the allocations):# this allocates a new array on the gpu, which we can avoid with the above broadcastbroadcasted = fill(CuArray, Point(2, 2), (3,))result == custom_types . broadcasted

true

GPUArrays 支持更多操作：

實現(xiàn) GPU 數(shù)組轉(zhuǎn)換為 CPU 數(shù)組和復(fù)制
多維索引和切片 (xs[1:2, 5, :])
permutedims
串聯(lián) (vcat(x, y), cat(3, xs, ys, zs))
映射，融合 broadcast(zs .= xs.^2 . ys .* 2)
填充 (CuArray, 0f0, dims)，填充 (gpu_array, 0)
減小尺寸 (reduce( , xs, dims = 3), sum(x -> x^2, xs, dims = 1)
縮減為標量 (reduce(*, xs), sum(xs), prod(xs))
各種 BLAS 操作 (matrix*matrix, matrix*vector)
FFT，使用與 julia 的 FFT 相同的 API

GPUArrays 實際應(yīng)用

讓我們直接看一些很酷的實例。

GPU 加速煙霧模擬器是由 GPUArrays CLArrays 創(chuàng)建的，可在 GPU 或 CPU 上運行，GPU 版本速度提升 15 倍：

還有更多的例子，包括求微分方程、FEM 模擬和求解偏微分方程。

演示地址：https://juliagpu.github.io/GPUShowcases.jl/latest/index.html

讓我們通過一個簡單的機器學(xué)習示例，看看如何使用 GPUArrays：

using Flux, Flux.Data.MNIST, Statisticsusing Flux: onehotbatch, onecold, crossentropy, throttleusing Base.Iterators: repeated, partitionusing CuArrays# Classify MNIST digits with a convolutional networkimgs = MNIST.images()labels = onehotbatch(MNIST.labels(), 0:9)# Partition into batches of size 1,000train = [(cat(float.(imgs[i])…, dims = 4), labels[:,i]) for i in partition(1:60_000, 1000)]use_gpu = true # helper to easily switch between gpu/cputodevice(x) = use_gpu ? gpu(x) : xtrain = todevice.(train)# Prepare test set (first 1,000 images)tX = cat(float.(MNIST.images(:test)[1:1000])…, dims = 4) |> todevicetY = onehotbatch(MNIST.labels(:test)[1:1000], 0:9) |> todevicem = Chain( Conv((2,2), 1=>16, relu), x -> maxpool(x, (2,2)), Conv((2,2), 16=>8, relu), x -> maxpool(x, (2,2)), x -> reshape(x, :, size(x, 4)), Dense(288, 10), softmax) |> todevicem(train[1][1])loss(x, y) = crossentropy(m(x), y)accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))evalcb = throttle(() -> @show(accuracy(tX, tY)), 10)opt = ADAM(Flux.params(m));

# trainfor i = 1:10 Flux.train!(loss, train, opt, cb = evalcb)end

教程｜在Julia編程中實現(xiàn)GPU加速（julia gpu編程）

using Colors, FileIO, ImageShowN = 22img = tX[:, :, 1:1, N:N]println(“Predicted: “, Flux.onecold(m(img)) .- 1)Gray.(collect(tX[:, :, 1, N]))

教程｜在Julia編程中實現(xiàn)GPU加速（julia gpu編程）

只需將數(shù)組轉(zhuǎn)換為 GPUArrays（使用 gpu(array），就可以將整個計算移動到 GPU 并獲得可觀的速度提升。這要歸功于 Julia 復(fù)雜的 AbstractArray 基礎(chǔ)架構(gòu)，使 GPUArray 可以無縫集成。隨后，如果省略轉(zhuǎn)換為 GPUArray 這一步，代碼會按普通的 Julia 數(shù)組處理，但仍在 CPU 上運行?？梢試L試將 use_gpu = true 改為 use_gpu = false，重新運行初始化和訓(xùn)練單元格。對比 GPU 和 CPU，CPU 運行時間為 975 秒，GPU 運行時間為 29 秒，速度提升約 33 倍。

另一個優(yōu)勢是為了有效地支持神經(jīng)網(wǎng)絡(luò)的反向傳播，GPUArrays 無需明確地實現(xiàn)自動微分。這是因為 Julia 的自動微分庫適用于任意函數(shù)，并存有可在 GPU 上高效運行的代碼。這樣即可利用最少的開發(fā)人員就能在 GPU 上實現(xiàn) Flux，并使 Flux GPU 能夠高效實現(xiàn)用戶定義的功能。這種開箱即用的 GPUArrays Flux 不需要協(xié)調(diào)，這是 Julia 的一大特點，詳細解釋如下：為什么 Numba 和 Cython 不能代替 Julia（http://www.stochasticlifestyle.com/why）。

編寫 GPU 內(nèi)核

一般情況，只使用 GPUArrays 的通用抽象數(shù)組接口即可，而不需要編寫任何 GPU 內(nèi)核。但是有些時候，可能需要在 GPU 上實現(xiàn)一個無法通過一般數(shù)組算法組合表示的算法。

好消息是，GPUArrays 通過分層法消除了大量工作，可以實現(xiàn)從高級代碼開始，編寫類似于大多數(shù) OpenCL / CUDA 示例的低級內(nèi)核。同時可以在 OpenCL 或 CUDA 設(shè)備上執(zhí)行內(nèi)核，從而提取出這些框架中的所有差異。

實現(xiàn)上述功能的函數(shù)名為 gpu_call。調(diào)用語句為 gpu_call(kernel, A::GPUArray, args)，在 GPU 上使用參數(shù) (state, args…) 調(diào)用 kernel。State 是一個用于實現(xiàn)獲取線程索引等功能的后端特定對象。GPUArray 需要作為第二個參數(shù)傳遞，以分配到正確的后端并提供啟動參數(shù)的默認值。

讓我們使用 gpu_call 來實現(xiàn)一個簡單的映射內(nèi)核：

using GPUArrays, CuArrays# Overloading the Julia Base map! function for GPUArraysfunction Base.map!(f::Function, A::GPUArray, B::GPUArray) # our function that will run on the gpu function kernel(state, f, A, B) # If launch parameters aren’t specified, linear_index gets the index # into the Array passed as second argument to gpu_call (`A`) i = linear_index(state) if i <= length(A) @inbounds A[i] = f(B[i]) end return end # call kernel on the gpu gpu_call(kernel, A, (f, A, B))end

簡單來說，這將在 GPU 上并行調(diào)用 julia 函數(shù) kernel length(A) 次。kernel 的每個并行調(diào)用都有一個線程索引，可以利用它索引到數(shù)組 A 和 B。如果計算索引時沒有使用 linear_index，就需要確保沒有多個線程讀取和寫入相同的數(shù)組位置。因此，如果在純 Julia 中使用線程編寫，可等效如下：

using BenchmarkToolsfunction threadded_map!(f::Function, A::Array, B::Array) Threads.@threads for i in 1:length(A) A[i] = f(B[i]) end Aendx, y = rand(10^7), rand(10^7)kernel(y) = (y / 33f0) * (732.f0/y)# on the cpu without threads:single_t = @belapsed map!($kernel, $x, $y)# “on the CPU with 4 threads (2 real cores):thread_t = @belapsed threadded_map!($kernel, $x, $y)# on the GPU:xgpu, ygpu = cu(x), cu(y)gpu_t = @belapsed begin map!($kernel, $xgpu, $ygpu) GPUArrays.synchronize($xgpu)endtimes = [single_t, thread_t, gpu_t]speedup = maximum(times) ./ timesprintln(“speedup: $speedup”)bar([“1 core”, “2 cores”, “gpu”], speedup, legend = false, fillcolor = :grey, ylabel = “speedup”)

教程｜在Julia編程中實現(xiàn)GPU加速（julia gpu編程）

由于該函數(shù)未實現(xiàn)過多內(nèi)容，也得不到更多的擴展，但線程化和 GPU 版本仍然有一個很好的加速。

GPU 與線程示例相比，能顯示更復(fù)雜的內(nèi)容，因為硬件線程是以線程塊的形式分布的，gpu_call 是從簡單版本中提取出來的，但它也可以用于更復(fù)雜的啟動配置：

using CuArraysthreads = (2, 2)blocks = (2, 2)T = fill(CuArray, (0, 0), (4, 4))B = fill(CuArray, (0, 0), (4, 4))gpu_call(T, (B, T), (blocks, threads)) do state, A, B # those names pretty much refer to the cuda names b = (blockidx_x(state), blockidx_y(state)) bdim = (blockdim_x(state), blockdim_y(state)) t = (threadidx_x(state), threadidx_y(state)) idx = (bdim .* (b .- 1)) . t A[idx…] = b B[idx…] = t returnendprintln(“Threads index: n”, T)println(“Block index: n”, B)

教程｜在Julia編程中實現(xiàn)GPU加速（julia gpu編程）

上面的示例中啟動配置的迭代順序更復(fù)雜。確定合適的迭代啟動配置對于實現(xiàn)最優(yōu) GPU 性能至關(guān)重要。很多關(guān)于 CUDA 和 OpenCL 的 GPU 教程都非常詳細地解釋了這一點，在 Julia 中編程 GPU 時這些原理是相通的。

結(jié)論

Julia 為高性能的世界帶來了可組合的高級編程?，F(xiàn)在是時候為 GPU 做同樣的事了。

希望 Julia 能降低人們在 GPU 編程的門檻，我們可以為開源 GPU 計算開發(fā)可擴展的平臺。第一個成功案例是通過 Julia 軟件包實現(xiàn)自動微分解決方案，這些軟件包甚至都不是為 GPU 編寫的，因此可以相信 Julia 在 GPU 計算領(lǐng)域的擴展性和通用設(shè)計中一定會大放異彩。

教程｜在Julia編程中實現(xiàn)GPU加速（julia gpu編程）

相關(guān)新聞