Russian programmers have developed a platform for distributed training of large neural networks. It is adapted for a network of many computers of different power, any of which can exit the process at any time. As with distributed scientific computing projects such as Folding@home, this approach, with the help of many volunteers, can create a network with computing power comparable to advanced supercomputers. The developers described the platform in a preprint available at arXiv.org, and also published the pre-alpha version code on GitHub.
The performance of neural network models largely depends on their size and on the size of the training sample. For example, the leading natural language processing model at the time of this writing – GPT-3 – has 175 billion parameters and has been trained on 570 gigabytes of text. But training of this scale requires the appropriate computing power, which, due to the high cost, is often not available to research groups that are not part of large IT companies.
In many areas of science, there are distributed computing projects that solve this problem with the help of volunteers: anyone with access to the Internet can install a program that will perform the calculations the scientists need in the background. Together, thousands or even millions of computers provide scientists with a computing network with the power of leading supercomputers for free: in 2020, the power of the Folding@home biomolecular simulation network crossed the milestone of one exaflops and continued to grow. But distributed computing networks have drawbacks: each computer can turn off at any time or transfer data slowly and unstably, and in addition, not all types of computations are equally easily divided into subtasks for distribution to individual computational nodes.
Maksim Riabinin from the Higher School of Economics and Yandex, together with his colleague Anton Gusev, developed the Learning @ home platform, which allows distributing the training of neural network models across multiple computers. The platform is based on the method of a team of experts, in which certain “experts” – separate algorithms or computers – are responsible for processing different input data. The developers suggested splitting the layers of the trained neural network into a set of experts. Each of the experts can have their own specialization, for example, act as a part of a convolutional or other type of neural network.
A network of computers for training or executing neural network algorithms has a decentralized structure, and each of its computational nodes consists of three parts: an execution environment, a control part, and a DHT node. The runtime environment is directly responsible for computing, that is, it acts as an expert. The control part accepts the incoming data, selects the experts suitable for processing them, and collects the computational data. A DHT node is part of a distributed hash table in which the network stores its data.
The authors posted the code they used to initially test the platform’s functionality on GitHub, but noted that it shouldn’t be considered a ready-to-use library yet. They also noted that the platform in its current form will have the typical disadvantages of peer-to-peer networks, including high load on the network infrastructure, as well as susceptibility to specific attacks for such networks, the possibility of which stems from their architecture, rather than specific implementation.
Researchers from all over the world are working not only to improve the software of neural networks and their training, but also on the hardware. One of the promising areas is neuromorphic chips, which in their architecture repeat the organization and principle of operation of biological neural networks, and include synapses, dendrites and axons, which make it possible to more accurately simulate interneuronal interaction. Among them are purely neuromorphic chips from IBM and Intel , as well as a hybrid chip presented last year by Chinese developers, in which blocks for classical and impulse artificial neural networks are combined.