UM  > Faculty of Science and Technology
Overlapping Communication with Computation in Parameter Server for Scalable DL Training
Wang,Shaoqi1; Pi,Aidi1; Zhou,Xiaobo1; Wang,Jun2; Xu,Cheng Zhong3
2021-09-01
Source PublicationIEEE Transactions on Parallel and Distributed Systems
ISSN1045-9219
Volume32Issue:9Pages:2144-2159
AbstractScalability of distributed deep learning (DL) training with parameter server (PS) architecture is often communication constrained in large clusters. There are recent efforts that use a layer by layer strategy to overlap gradient communication with backward computation so as to reduce the impact of communication constraint on the scalability. However, the approaches could bring significant overhead in gradient communication. Meanwhile, they cannot be effectively applied to the overlap between parameter communication and forward computation. In this article, we propose and develop iPart, a novel approach that partitions communication and computation in various partition sizes to overlap gradient communication with backward computation and parameter communication with forward computation. iPart formulates the partitioning decision as an optimization problem and solves it based on a greedy algorithm to derive communication and computation partitions. We implement iPart in the open-source DL framework BigDL and perform evaluations with various DL workloads. Experimental results show that iPart improves the scalability of a cluster of 72 nodes by up to 94 percent over the default PS and 52 percent over the layer by layer strategy.
Keywordbackward computation forward computation gradient communication parameter communication Parameter server
DOI10.1109/TPDS.2021.3062721
URLView the original
Language英语
Scopus ID2-s2.0-85102275748
Fulltext Access
Citation statistics
Cited Times [WOS]:1   [WOS Record]     [Related Records in WOS]
Document TypeJournal article
CollectionFaculty of Science and Technology
Corresponding AuthorWang,Shaoqi
Affiliation1.Department of Computer Science,University of Colorado,Colorado Springs,United States
2.Department of Electrical and Computer Engineering,University of Central Florida,Orlando,United States
3.Faculty of Science and Technology,University of Macau,Taipa,Macao
Recommended Citation
GB/T 7714
Wang,Shaoqi,Pi,Aidi,Zhou,Xiaobo,et al. Overlapping Communication with Computation in Parameter Server for Scalable DL Training[J]. IEEE Transactions on Parallel and Distributed Systems,2021,32(9):2144-2159.
APA Wang,Shaoqi,Pi,Aidi,Zhou,Xiaobo,Wang,Jun,&Xu,Cheng Zhong.(2021).Overlapping Communication with Computation in Parameter Server for Scalable DL Training.IEEE Transactions on Parallel and Distributed Systems,32(9),2144-2159.
MLA Wang,Shaoqi,et al."Overlapping Communication with Computation in Parameter Server for Scalable DL Training".IEEE Transactions on Parallel and Distributed Systems 32.9(2021):2144-2159.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Wang,Shaoqi]'s Articles
[Pi,Aidi]'s Articles
[Zhou,Xiaobo]'s Articles
Baidu academic
Similar articles in Baidu academic
[Wang,Shaoqi]'s Articles
[Pi,Aidi]'s Articles
[Zhou,Xiaobo]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Wang,Shaoqi]'s Articles
[Pi,Aidi]'s Articles
[Zhou,Xiaobo]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.