null - 程序员宅基地

2016：DianNao Family Energy-Efficient Hardware Accelerators for Machine Learning-程序员宅基地

技术标签： pr

文章目录

Abstract
- 第二段
1 INTRODUCTION
2 DIANNAO: A NN ACCELERATOR

这个发表在了
https://cacm.acm.org/
communication of the ACM
上

参考文献链接是

Chen Y , Chen T , Xu Z , et al. 
DianNao Family: Energy-Efficient Hardware Accelerators for Machine Learning[J]. 
Communications of the Acm, 2016, 59(11):105-112

https://cacm.acm.org/magazines/2016/11/209123-diannao-family/fulltext
果然找到了他

特码的，我下载了，6不6

The original version of this paper is entitled “DianNao: A
Small-Footprint, High-Throughput Accelerator for Ubiq-
uitous Machine Learning” and was published in Proceed-
ings of the International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS)
49, 4 (March 2014), ACM, New York, NY, 269–284.

Abstract

ML pervasive
- broad range of applications
  - broad range of systems(embedded to data centers)

computer
- toward heterogeneous multi-cores
- a mix of cores and hardware accelerators,
designing hardware accelerators for ML
- achieve high efficiency and broad application scope

第二段

efficient computational primitives
- important for a hardware accelerator,
inefficient memory transfers can
- potentially void the throughput, energy, or cost advantages of accelerators,
an Amdahl’s law effect
become a first-order concern,

like in processors,
- rather than an element factored in accelerator design on a second step

a series of hardware accelerators
- designed for ML(nn),
- the impact of memory on accelerator design, performance, and energy.

representative neural network layers
450.65x over GPU
energy by 150.31x on average
- for 64-chip DaDianNao (a member of the DianNao family)

1 INTRODUCTION

designing hardware accelerators which realize the best possible tradeoff between flexibility and efficiency is becoming a prominent
issue.

The first question is for which category of applications one should primarily design accelerators?
Together with the architecture trend towards accelerators, a second simultaneous and significant trend in high-performance and embedded applications is developing: many of the emerging high-performance and embedded applications, from image/video/audio recognition to automatic translation, business analytics, and robotics rely on machine learning
techniques.
This trend in application comes together with a third trend in machine learning (ML) where a small number
of techniques, based on neural networks (especially deep learning techniques 16, 26 ), have been proved in the past few
years to be state-of-the-art across a broad range of applications.
As a result, there is a unique opportunity to design accelerators having significant application scope as well as
high performance and efficiency. 4

第二段

Currently, ML workloads
mostly executed on
- multicores using SIMD[44]
- on GPUs[7]
- or on FPGAs[2]

the aforementioned trends
- have already been identified
- by researchers who have proposed accelerators implementing,
CNNs[2]
Multi-Layer Perceptrons [43] ;

accelerators focusing on other domains,
- image processing,
- propose efficient implementations of some of the computational primitives used
- by machine-learning techniques, such as convolutions[37]

There are also ASIC implementations of ML
- such as Support Vector Machine and CNNs.

these works focused on
- efficiently implementing the computational primitives
  - ignore memory transfers for the sake of simplicity[37,43]
  - plug their computational accelerator to memory via a more or less sophisticated DMA. [2,12,19]

第三段

While efficient implementation of computational primitives is a first and important step with promising results,
inefficient memory transfers can potentially void the throughput, energy, or cost advantages of accelerators, that is, an
Amdahl’s law effect, and thus, they should become a first-
order concern, just like in processors, rather than an element
factored in accelerator design on a second step.
Unlike in processors though, one can factor in the specific nature of
memory transfers in target algorithms, just like it is done for accelerating computations.
This is especially important in the domain of ML where there is a clear trend towards scaling up the size of learning models in order to achieve better accuracy and more functionality. 16, 24

第四段

In this article, we introduce a series of hardware accelerators designed for ML (especially neural networks), including
DianNao, DaDianNao, ShiDianNao, and PuDianNao as listed in Table 1.
We focus our study on memory usage, and we investigate the accelerator architecture to minimize memory
transfers and to perform them as efficiently as possible.

2 DIANNAO: A NN ACCELERATOR

DianNao
- first of DianNao accelerator family,
accommodates sota nn techniques (dl ),
inherits the broad application scope of nn.

2.1 Architecture

DianNao
- input buffer for input (NBin)
- output buffer for output (NBout)
- buffer for synaptic(突触) weights (SB)
- connected to a computational block (performing both synapses and neurons computations)
- NFU, and CP, see Figure 1

NBin是存放输入神经元
SB是存放突触的权重的
这个NBout是存放输出神经元

我觉得图示的可以这样理解：2个输入神经元，2个突触，将这2个对应乘起来，输出是1个神经元啊。但是我的这个NFU牛逼啊，他可以一次性求两个输出神经元。

NFU

a functional block of $T_i$ inputs/synapses(突触)
- $T_n$ output neurons,
time-shared by different algorithmic blocks of neurons.

这个NFU对 $T_i$ 个输入和突触运算，得到 $T_n$ 个输出神经元，突触不是应该是 $T_i\times T_n$ 个吗？？，

Depending on the layer type,
- computations at the NFU can be decomposed in either two or three stages

For classifier and convolutional:
- multiplication of synapses $\times$ inputs:NFU-1
- , additions of all multiplications, :NFU-2
- sigmoid. :NFU-3

如果是分类层或者卷积的话的话，那就是简单的突触 $\times$ 输入，然后加起来，求sigmoid。这个我能理解哦，这种情况不就是卷积吗。

如果是分类层，那么输入就是

last stage (sigmoid or another nonlinear function) can vary.

For pooling, no multiplication(no synapse),
- pooling can be average or max.

adders(加法器) have multiple inputs,
- they are in fact adder trees,

the second stage also contains
- shifters and max operators for pooling.

要啥移位啊？？

the sigmoid function (for classifier and convolutional layers)can be efficiently implemented using ( $a_i x \times + b_i , x \in [x_i , x_{i+1} ]$ ) (16 segments are sufficient)

On-chip Storage

on-chip storage structures of DianNao
- can be construed as modified buffers of scratchpads.

While a cache is an excellent storage structure for a general-purpose processor, it is a sub-optimal way to exploit reuse because of the cache access overhead (tag check, associativity, line size, speculative read, etc.) and cache conflicts.
The efficient alternative, scratchpad, is used in VLIW processors but it is known to be very difficult to compile for.
However a scratchpad in a dedicated accelerator realizes the best of both worlds: efficient
storage, and both efficient and easy exploitation of locality because only a few algorithms have to be manually adapted.

第二段

on-chip storage into three (NBin, NBout,and SB), because there are three type of data (input neurons,output neurons and synapses) with different characteristics (read width and reuse distance).

The first benefit of splitting structures is to tailor the SRAMs to the appropriate
read/write width,
and the second benefit of splitting storage structures is to avoid conflicts, as would occur in a cache.
Moreover, we implement three DMAs to exploit spatial locality of data, one for each buffer (two load DMAs for inputs, one store DMA for outputs).

2.2 Loop tiling

DianNao 用 loop tiling去减少memory access
- so可容纳大的神经网络
举例
- 一个classifier 层
  - 有 $N_n$ 输出神经元
  - 全连接到 $N_i$ 的输入
  - 如下图

$N_n$ 个输出， $N_i$ 个输入，sypase应该是 $N_n\times N_i$ 大小，用这个矩阵 $×Ni \times N_i$ 即可得到结果啊

先取出来一块
- 有点疑惑啊
- 万一右边第一个元素和左边全部元素都有关
- 你咋算啊（）
- 其实啊，我他妈算右边第一个时候
- 只需要用到和synapse的一行呀！
- 那你那个大大的synapse矩阵咋办啊
下面是原始代码和和
- tiled代码
- 他把分类层映射到DianNao

在这里插入图片描述

for(int n=0;n<Nn;n++)
	sum[n]=0;
for(int n=0;n<Nn;n++) //输出神经元
	for(int i=0;i<Ni;i++) //输入神经元
		sum[n]+=synapse[n][i]*neuron[i];
for(int n=0;n<Nn;n++)
	neuron[n]=Sigmoid(sum[n]);

俺的想法：
- 一次来Tnn个输出
- 和Tii个输入
- 然后这个东西对于硬件还是太大了
- 再拆
- 来Tn个和Ti个吧
- 就酱

for(int nnn=0;nnn<Nn;nnn+=Tnn){
    //tiling for output 神经元
//第一个for循环准备扔出去Tnn个输出
    for(int iii=0;iii<Ni;iii+=Tii){
    //tiling for input 神经元
//第二个for循环准备扔进来Tii个输入
//下面就这两个东西动手

        for(int nn=nnn;nn<nnn+Tnn;nn+=Tn){
    
//第三个for循环觉得觉得Tnn还是太大了，继续拆
//大小是Tn
//那么我们对每一个Tnn块！（开始位置是nn哦！！）
//我们如下求解

///
            for(int n=nn;n<nn+Tn;n++)
//第一步把中间结果全部搞成零！

            sum[n]=0;
//为求sum[n]，sum[n]=synapse的第n行乘neuron的全部啊！
        for(int ii=iii;ii<iii+Tii;ii+=Ti)

//上面的for是对Ti进行拆

            for(int n=nn;n<nn+Tn;n++)
                for(int i=ii;i<ii+Ti;i++)
                    sum[n]+=synapse[n][i]*neuron[i];

    for(int nn=nnn;nn<nnn+Tnn;nn+=Tn)
        neuron[n]=sigmoid(sum[n]);
///

 }   }  }

在tiled代码中， $i i$ 和 $n n$
- 表示NFU有 $T_i$ 个输入和突触
  - 和 $T_n$ 个输出神经元
输入神经元被每个输出神经元需要重用
- 但这个输入向量也太他妈大了
- 塞不到Nbin块里啊
- 所以也要对循环 $i i$ 分块，因子 $T_{ii}$

上面的代码肯定有问题，正确的如下：

	for (int nnn = 0; nnn < Nn; nnn += Tnn) {
    
		for (int nn = nnn; nn < nnn + Tnn; nn += Tn) {
    
			for (int n = nn; n < nn + Tn; n++)
				sum[n] = 0;
			for (int iii = 0; iii < Ni; iii += Tii) {
    
				for (int ii = iii; ii < iii + Tii; ii += Ti)
					for (int n = nn; n < nn + Tn; n++)
						for (int i = ii; i < ii + Ti; i++)
							sum[n] += synapse[n][i] * neuron[i];
			}
			for (int n = nn; n < nn + Tn; n++)
				printf("s%ds ", sum[n]);
		}
	}
	for (int index = 0; index < Nn; index++)
		printf("%d ", sum[index]);

本文链接：https://blog.csdn.net/zhoutianzi12/article/details/110244427

原作者删帖不实内容删帖广告或垃圾文章投诉

智能推荐

数据库安全：Hadoop 未授权访问-命令执行漏洞._hadoop未授权访问-程序员宅基地

文章浏览阅读2.9k次，点赞3次，收藏4次。Hadoop 未授权访问主要因HadoopYARN资源管理系统配置不当，导致可以未经授权进行访问，从而被攻击者恶意利用。攻击者无需认证即可通过RESTAPI部署任务来执行任意指令，最终完全控制服务器。_hadoop未授权访问

100个替代昂贵商业软件的开源应用_citadel开源中文版本-程序员宅基地

文章浏览阅读4.1k次，点赞3次，收藏18次。100个替代昂贵商业软件的开源应用面对大，中，小企业和家庭用户，立竿见影显著降低成本的开源软件。某些商业软件素以昂贵著称。随着云计算的日益普及，很多常用软件包供应商将一次性收费改为月租模式。虽然月租费貌似便宜，但也经不起长时间的累积。100个替代昂贵商业软件的开源应用尽管有许多好理由，但避免或减少使用费，仍然是许多用户看中开源应用软件的主要因素。基于这一点，我们更新了可替代_citadel开源中文版本

竞选计算机协会网络部部长,2019年计算机协会部长竞选演讲稿-程序员宅基地

文章浏览阅读55次。2019年计算机协会部长竞选演讲稿篇一:计算机协会部长竞选演讲稿尊敬的领导，敬爱的老师，亲爱的同学们：大家晚上好！俗话说：马只有驰骋千里，方知其是否为良驹；人只有通过竞争，才能知其是否为栋梁。我是来自ｘｘｘ班的伍朝海，今晚，我很荣幸能够站在这里参加这次学生会的竞选，职位是ｘｘ系的宣传窗口——新闻网络部的负责人。我知道，今晚竞选的不仅仅是个职位，也是在竞选一个为同学们服务的机会，更是在竞选一个为我们...

ipython和jupyter notebook_第02章 Python语法基础，IPython和Jupyter Notebook-程序员宅基地

文章浏览阅读185次。第2章 Python语法基础，IPython和Jupyter Notebooks当我在2011年和2012年写作本书的第一版时，可用的学习Python数据分析的资源很少。这部分上是一个鸡和蛋的问题：我们现在使用的库，比如pandas、scikit-learn和statsmodels，那时相对来说并不成熟。2017年，数据科学、数据分析和机器学习的资源已经很多，原来通用的科学计算拓展到了计算机科学家..._jupyter notebook if后面有多个条件

雪花算法生成的ID精度丢失问题_雪花算法生成id精度丢失-程序员宅基地

文章浏览阅读494次，点赞2次，收藏2次。雪花算法生成的ID精度丢失问题雪花算法ID精度丢失_雪花算法生成id精度丢失

Sublime Text关闭更新(亲测可用)_sublime关闭更新检测-程序员宅基地

文章浏览阅读2.2k次。sumlime text关闭自动更新_sublime关闭更新检测

2016：DianNao Family Energy-Efficient Hardware Accelerators for Machine Learning-程序员宅基地

文章目录

Abstract

第二段

1 INTRODUCTION

第二段

第三段

第四段

2 DIANNAO: A NN ACCELERATOR

2.1 Architecture

NFU

On-chip Storage

第二段

2.2 Loop tiling

上面的代码肯定有问题，正确的如下：

智能推荐

数据库安全：Hadoop 未授权访问-命令执行漏洞._hadoop未授权访问-程序员宅基地

100个替代昂贵商业软件的开源应用_citadel开源中文版本-程序员宅基地

竞选计算机协会网络部部长,2019年计算机协会部长竞选演讲稿-程序员宅基地

ipython和jupyter notebook_第02章 Python语法基础，IPython和Jupyter Notebook-程序员宅基地

雪花算法生成的ID精度丢失问题_雪花算法生成id精度丢失-程序员宅基地

Sublime Text关闭更新(亲测可用)_sublime关闭更新检测-程序员宅基地

随便推点

文本文件数据输入与读取_文本输入读取-程序员宅基地

极速进化,光速转录,C++版本人工智能实时语音转文字(字幕/语音识别)Whisper.cpp实践_c++语音识别库-程序员宅基地

前端（vue）导出word文档（导出图片）_前端批量docx转jpg-程序员宅基地

TaiShan 200服务器安装Ubuntu 18.04_ubuntu登录华为泰山服务器.-程序员宅基地

linux openerp,openerp-程序员宅基地

STM32+AS608指纹模块串口通讯_as068-程序员宅基地

推荐文章

热门文章

相关标签