开启报名 | AI芯片体系架构和软件专题报告会2020

报告会举办时间：

2020年4月12日（周日）

上午9:00-12:00

报告会官方主页：

https://event.baai.ac.cn/con/architecture-and-software-design-for-ai-chip-2020/

（复制网址至浏览器查看）

陈云霁

中科院计算所研究员

智源首席科学家

报告会主席

梁云

北京大学研究员，智源青年科学家

主要研究领域为计算机体系结构、编译优化、芯片设计自动化。在 MICRO、HPCA、PPoPP、DAC 等顶级会议发表论文 90 多篇，谷歌学术引用超过 2400 次，根据 CSranking 的统计，共发表 24 篇顶级会议论文。8 次被评选或提名为国际会议最佳论文，包括 ICCAD 2017 和 FCCM 2011 最佳论文，DAC 2017,2012 和 PPoPP 2019 的最佳论文提名。担任 ASAP 2019 Program Chair 及 6个 CCF A 会议的 TPC。目前担任北京大学商汤科技智能计算联合实验室主任，曾获北京市自然科学基金杰出青年项目支持。

高鸣宇

清华大学交叉信息研究院助理教授

清华大学交叉信息研究院助理教授，博士生导师。美国斯坦福大学电子工程系博士、硕士，清华大学微纳电子系学士。研究方向为计算机系统与体系结构，大数据系统优化，硬件系统安全。主要研究成果包括针对大数据应用的近数据处理架构软硬件系统，高密度低功耗可重构计算芯片，及专用神经网络芯片的调度优化。已发表多篇国际顶级学术会议（ISCA、ASPLOS、HPCA、PACT 等）论文，授权多项专利。曾获得 IEEE Micro 2016年度计算机系统结构最佳论文奖（Top Picks）、欧洲HiPEAC论文奖、福布斯中国评选2019年“30 Under 30”等荣誉。

侯锐

中国科学院信息工程研究所研究员，博导

信息安全国家实验室副主任

主要研究方向包括处理器芯片设计与安全、AI 芯片安全与数据隐私，以及数据中心服务器等领域。哈尔滨工程大学兼职教授，通信学会区块链专委会副主任，计算机学会体系结构专委会委员。先后主持或参与国家自然基金、科学院战略先导等多项重大项目。侯锐长期从事国产自主可控高性能处理器芯片的研制和开发，主持、参与了多款芯片的设计开发工作。侯锐在国内外期刊及会议上发表论文 40 余篇，包括 ACM TOCS、HPCA，ASPLOS，S&P 等多个体系结构和安全领域顶级会议及期刊，并申请国内外专利 50 余项。曾作为技术委员会或组织委员会委员服务多个国际顶级学术会议（CCF A 类会议 Micro 2020, HPCA 2017/2018 技术委员会委员, CCF B类会议 PACT 2017 技术委员会委员等），担任 Journal of Parallel Distributed Computing 编委。作为主席成功举办多界“数据中心服务器国际技术论坛”与“内置安全国际学术论坛”等。多次受邀在国际学术会议、国际知名公司研究院、知名科研院所进行学术报告。

会议日程

09:00-09:05

开场致辞 – 中科院计算所研究员，智源首席科学家陈云霁

09:05-09:35

梁云北京大学研究员，智源青年科学家

报告题目：

FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System

报告摘要：

Tensor computation plays a paramount role in a broad range of domains, including machine learning, data analytics, and scientific computing. The wide adoption of tensor computa-tion and its huge computation cost has led to high demand for flexible, portable, and high-performance library imple-mentation on heterogeneous hardware accelerators such as GPUs and FPGAs. However, the current tensor library implementation mainly requires programmers to manually design low-level implementation and optimize from the al-gorithm, architecture, and compilation perspectives. Such a manual development process often takes months or even years, which falls far behind the rapid evolution of the appli-cation algorithms.

In this paper, we introduce FlexTensor, which is a schedule exploration and optimization framework for tensor compu-tation on heterogeneous systems. FlexTensor can optimize tensor computation programs without human interference, allowing programmers to only work on high-level program-ming abstraction without considering the hardware platform details. FlexTensor systematically explores the optimization design spaces that are composed of many different schedules for different hardware. Then, FlexTensor combines differ-ent exploration techniques, including heuristic method and machine learning method to find the optimized schedule configuration. Finally, based on the results of exploration, customized schedules are automatically generated for differ-ent hardware. In the experiments, we test 12 different kinds of tensor computations with totally hundreds of test cases and FlexTensor achieves average 1.83x performance speedup on NVIDIA V100 GPU compared to cuDNN; 1.72x perfor-mance speedup on Intel Xeon CPU compared to MKL-DNN for 2D convolution; 1.5x performance speedup on Xilinx VU9P FPGA compared to OpenCL baselines; 2.21x speedup on NVIDIA V100 GPU compared to the state-of-the-art.

09:40-10:10

美国南加州大学研究助理骆沁毅

报告题目：

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training

报告摘要：

Distributed deep learning training usually adopts All-Reduce as the synchronization mechanism for data parallel algorithms due to its high performance in homogeneous environment. However, its performance is bounded by the slowest worker among all workers. For this reason, it is significantly slower in heterogeneous settings. AD-PSGD, a newly proposed synchronization method which provides numerically fast convergence and heterogeneity tolerance, suffers from deadlock issues and high synchronization overhead. Is it possible to get the best of both worlds — designing a distributed training method that has both high performance like All-Reduce in homogeneous environment and good heterogeneity tolerance like AD-PSGD/p>

In this paper, we propose Prague, a high-performance heterogeneity-aware asynchronous decentralized training approach. We achieve the above goal with intensive synchronization optimization by exploring the interplay between algorithm and system implementation, or statistical and hardware efficiency. To reduce synchronization cost, we propose a novel communication primitive, Partial All-Reduce, that enables fast synchronization among a group of workers. To reduce serialization cost, we propose static group scheduling in homogeneous environment and simple techniques, i.e., Group Buffer and Group Division, to largely eliminate conflicts with slightly reduced randomness. Our experiments show that in homogeneous environment, Prague is 1.2x faster than the state-of-the-art implementation of All-Reduce, 5.3x faster than Parameter Server and 3.7x faster than AD-PSGD. In a heterogeneous setting, Prague tolerates slowdowns well and achieves 4.4x speedup over All-Reduce.

10:20-10:50

清华大学交叉信息研究院助理教授高鸣宇

报告题目：

Interstellar: Using Halide’s Scheduling Language to Analyze DNN Accelerators

报告摘要：

We show that DNN accelerator micro-architectures and their program mappings represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halide’s scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support mapping replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while keeping throughput constant, we achieve up to 4.2X energy improvement for Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.

10:55-11:25

陈晓明中国科学院计算技术研究所副研究员

报告题目：

Communication Lower Bound in Convolution Accelerators

报告摘要：

In current convolutional neural network (CNN) accelerators, communication (i.e., memory access) dominates the energy consumption. This work provides comprehensive analysis and methodologies to minimize the communication for CNN accelerators. For the off-chip communication, we derive the theoretical lower bound for any convolutional layer and propose a dataflow to reach the lower bound. This fundamental problem has never been solved by prior studies. The on-chip communication is minimized based on an elaborate workload and storage mapping scheme. We in addition design a communication-optimal CNN accelerator architecture. Evaluations based on the 65nm technology demonstrate that the proposed architecture nearly reaches the theoretical minimum communication in a three-level memory hierarchy and it is computation dominant. The gap between the energy efficiency of our accelerator and the theoretical best value is only 37-87%.

11:30-12:00

侯锐中国科学院信息工程研究所研究员

报告题目：

DNNGuard: An Elastic Heterogeneous Architecture for DNN Accelerator against Adversarial Attacks

报告摘要：

Recent studies show that Deep Neural Networks (DNN) are vulnerable to adversarial samples that are generated by perturbing correctly classified inputs to cause the misclassification of DNN models. This can potentially lead to disastrous consequences, especially in security-sensitive applications such as unmanned vehicles, finance and healthcare. Existing adversarial defense methods require a variety of computing units to effectively detect the adversarial samples. However, deploying adversary sample defense methods in existing DNN accelerators leads to many key issues in terms of cost, computational efficiency and information security. Moreover, existing DNN accelerators cannot provide effective support for special computation required in the defense methods. To address these new challenges, we propose DNNGuard, an elastic heterogeneous DNN accelerator architecture that can efficiently orchestrate the simultaneous execution of original (target) DNN networks and the detect algorithm or network that detects adversary sample attacks. The architecture tightly couples the DNN accelerator with the CPU core into one chip for efficient data transfer and information protection. An elastic DNN accelerator is designed to run the target network and detection network simultaneously. Besides the capability to execute two networks at the same time, DNNGuard also supports the non-DNN computing and allows the special layer of the neural network to be effectively supported by the CPU core. To reduce off-chip traffic and improve resources utilization, we propose a dynamical resource scheduling mechanism. To build a general implementation framework, we propose an extended AI instruction set for neural networks synchronization, task scheduling and efficient data interaction.

12:00-12:05 报告会总结

梁云北京大学研究员，智源青年科学家

报名方式

扫码添加小源微信好友

发送关键词“live0412”后

进入报名微信群，获取课件和直播间地址

或点击「阅读原文」前往报告会官网

来源：智源社区

声明：本站部分文章及图片转载于互联网，内容版权归原作者所有，如本站任何资料有侵权请您尽早请联系jinwei@zod.com.cn进行处理,非常感谢！

开启报名 | AI芯片体系架构和软件专题报告会2020

相关推荐