In addition to data samples with a large number of tags and deep learning models and algorithms, a high performance system platform is also crucial to the success of deep learning. Deep learning involves offline training and online identification.
For the former, a high performance cluster system architecture which draws on the combination of GPU/KNM+IB/10GE/25GE high-speed network and distributed parallel storage can be adopted.
Due to the ever-increasing samples that require training, high performance parallel storage with large capacity and high bandwidth is required for the storage of and quick access to sample data, such as images of 100-million-level pixels and the voice of 100,000 hours with PB-level data volumes.
The long training period requires not only GPU acceleration but also parallel processing of large-scale cluster systems.
For some of the models, the parameters will reach billion-level and thus need a fast network with high bandwidth and low latency to ensure speedy updates of parameters between nodes as well as the convergence of the models.
As for online identification, thousands of nodes are needed for the provisioning of external services, which is a serious power consumption challenge.
Using a low-power FPGA structure to build an online identification platform can help solve this problem.
Inspur Deep Learning System Platform Architecture
Inspur has created a holistic system solution as illustrated below, with a focus on deep learning, which connects high-performance parallel storage solutions to computational acceleration nodes through high speed networking and provides data services.
Computational acceleration nodes that are suitable for offline training adopt high-power GPUs with strong floating point computational capacity with single precision or use KNM acceleration cards when available.
On the other hand, computational acceleration nodes that are used for online identification adopt low-power GPUs with strong INT8 computational capacity or low-power FPGA customized with recognition programs.
Operating deep learning frameworks such as TensorFlow, Caffe, and CNTK on the computation nodes, AIStation management platform provides task management, login interface, parameter tuning and other services.
AIStation also performs state monitoring and scheduling for nodes and computational acceleration components.
This whole platform set will provide support for applications based on top-level artificial intelligence.
In the future, offline training and online identification involved in deep learning will be integrated, with which online data will be directly trained offline, and models trained offline will be used to update those online.
One possible trend for the realization of online-offline unification for deep learning is likely to be a high performance low-power system structure that combines GPU+FPGA+IB high-speed network and distributed parallel storage.
Resource Encapsulation of Deep Learning Framework
The current open-source deep learning frameworks, which are rather dependent on third-party libraries and distinguish between versions, are unfriendly towards framework deployment and development of AI applications. This is especially the case when fast iteration of versions is needed. Frequent updates of OS and third-party libraries have created a lot of unnecessary extra work for the developers. Inspur carries out unified resource encapsulation of deep learning frameworks and the libraries they depend on into one image, after which the image can be loaded anytime on any Inspur platform that supports resource encapsulation. Users can start working immediately and effectively improve their productivity as their working environment is completely consistent with the original environment. Supporting distributed mapping storage in the image and the storage, scheduling, management and monitoring of mirrors, Inspur deep learning system solutions use resource encapsulation technology to improve the efficiency of deep learning framework deployment and productivity of app development. At the same time, they offer optimized integration of resource encapsulation technology and system solutions.
End-to-End System Delivery Service
Inspur deep learning system solutions provide not only a comprehensive set of hardware, but also end-to-end delivery services for system solutions.
●Consultation on application scenarios and design of system solutions
Inspur AI solution experts and AI end users discuss deep learning application scenarios and jointly analyze computation hotspots and bottlenecks to design system solutions that are suitable to the application scenarios.
●Transportation and optimization of application codes
Inspur heterogeneous application experts can help clients analyze features of CPU codes, determine whether their migration to heterogeneous acceleration components is appropriate, and collaborate to transport and optimize code hotspots to improve computational efficiency of applications and reduce time.
●Holistic solutions that integrate software and hardware
Inspur possesses not only a comprehensive server product line, high-speed network, and parallel storage products for deep learning, but also AIStation software management platform and Teye feature analysis tools for deep learning. The holistic solutions which integrate software and hardware can bring performance of deep learning frameworks to full play.
●Horizontal assessment on performance of computation acceleration components
Inspur's well-developed horizontal assessment on GPU/FPGA/KNM and other mainstream heterogeneous acceleration components provides solution choices.
●Implementation and deployment of mainstream deep learning frameworks
Focused on mainstream deep learning frameworks such as Caffe, TensorFlow, and CNTK, with resource encapsulation of codes and files from required third-party libraries and image creation that can be quickly deployed to platforms. Easy to implement and learn without having to master complex deployment procedures.