In the history of computing, there has been an endless push and pull between the need for general-purpose versus fine-tuned custom systems and software.
The leadership in high-performance computing (HPC) is typically dominated by general-purpose designs. However, the meticulous work involved in ASIC design, system, and software optimization eventually influences architectural thinking. Currently, ultra-specialization is likely to re-emerge for specific use cases in AI, as evidenced by the first wave of AI chip startups.
The Anton supercomputer architecture is a prime example of special-purpose supercomputing. This custom system is dedicated to solving complex problems in molecular dynamics with unprecedented speed and fidelity, surpassing even top exascale supercomputers.
The Anton system and its creator were recently recognized at the Supercomputing Conference (SC23) with a Test of Time Award. David Shaw, founder of the research firm D.E. Shaw, accepted the award and discussed the evolution of the Anton system architecture and its algorithms, which have adapted over time since the system’s unveiling in 2008.
Shaw represents a departure from tradition across the board. First, the Test of Time Awards have been largely centered on academic achievements. However, Shaw took a circuitous route to computational biology. After finishing his PhD at Stanford before teaching computer science at Columbia (while working on the NON-VON parallel system architecture) Shaw joined Morgan Stanley in the mid-1980s. He then founded a hedge fund, D.E. Shaw & Company, which focused initially on optimized trading algorithms before founding D.E. Shaw Research.
His work, whether for trading or grand-scale science challenges, emphasized speed and optimization on large parallel systems.
The prototype system, first described in an ACM paper in 2007, claimed that the massively parallel machine, “should be capable of executing millisecond-scale classical MD simulations of such biomolecular systems.”
The original paper also explained how the system, which was set to emerge in 2008 with “512 identical MD-specific ASICs that interact in a tightly coupled manner using a specialized high-speed communication network” could “dramatically accelerate those calculations that dominate the time required for a typical MD simulation.”
Indeed, Anton 1 was able to accomplish everything that the D.E Shaw team had initially planned. Over a decade of work on both the system architecture and application has led to significant progress in 2023. D.E. Shaw now has six drugs undergoing human clinical trials. Two of these drugs were independently developed from concept to trial, while the other four were developed in collaboration with Relay Therapeutics, a company specializing in protein dynamics for the identification of new drug candidates.
“Our long-term goal has always been to design new molecules that can serve as medications, which is something we’re finally doing, just in the last few years,” Shaw told the SC23 audience. He says that while his team is exploring the intersection of machine learning with the future of drug discovery, the supercomputing architecture piece has “been core all along and is still the most central.”
A central component to the Anton story over the years has been speed and scale—not just of the architecture (Anton 3 is 2X more powerful than the first generation) but of its ability to get the time scales of MD simulations down so far that complex interactions can be observed—well enough to design highly-targeted treatments for a range of treatments. Getting simulations down to the millisecond in 2008 was groundbreaking but Anton 3 can is in the 1-2 femtosecond range. As Shaw explained at SC23:
“In 2008, the fastest supercomputers of the time had simulated about 1/10th of a microsecond of time in a day. The longest that had ever been done was 10 microseconds and that lasted weeks or even months but was heroic computations. Many of the most important biological phenomena—the kind relevant to potential pharma design—all took place on scales of 10 microseconds so we were orders of magnitude away from where we needed to be.”
Those timescales meant that there wasn’t much to see other than molecules vibrating rather than the big changes that would allow for real discoveries.
Shaw showcased simulations that emphasized the importance of ultra-fine timescales in drug discovery. One simulation depicted a protein shifting, creating gaps and pockets that the targeted medication could locate and infiltrate. In another, he revealed a hidden target that didn’t seem to have a binding entry point until simulations demonstrated an opportunity. This would not have been possible without the ability to observe activities that lasted only a very brief period.
Protein dynamics, it turns out, is precisely what the Anton machines excel at. Protein folding, a significant discovery of our emerging century, has opened doors for scientists beyond the pharmaceutical industry. Notably, John Jumper, one of the most recognized names in this field, was part of D.E. Shaw Research during the early stages of Anton development and later spearheaded the AlphaFold project.
“New algorithms running on a conventional supercomputer would have been too slow. And a new supercomputing architecture running conventional algorithms would have been too slow.”
Shaw says that when the massively parallel Anton 1 machine emerged with its custom ASIC and special baked-in capabilities for particle interactions, it “allowed a dramatic increase in simulation length because of its speed—it was 100X faster than the fastest general purposes of the time and allowed continuous, millisecond-long simulations of proteins” which meant many new behaviors and interactions were observed for the first time ever.
Most of the chip area of the original machines (Anton 1 and 2) were dedicated to specialized math that honed in on the most computationally expensive parts of MD simulations. That meant there were (and still are) tradeoffs, including a lack of flexibility and programmability. It was “a very inflexible bunch of fast, stupid logic and nothing was programmable,” Shaw says. “We were brutal to the people who were designing that embedded software and put high priority on having it run fast.”
The data flow nature of the Anton systems meant that data went right to where it was needed, it didn’t stop along the way or pop to global memory. There’s plenty of memory on the systems but it’s distributed across the chip, which meant to high bandwidth and low latency. “At the interchip level, we had some app-specific ways of minimizing latency and the overall throughput but overall, we always had the luxury of doing that because we knew what algorithms we were trying to speed.
By the way, if this architectural discovery sounds familiar, it is of course happening among all the AI chip players who, in some ways, also have the luxury of a defined workload to optimize around.
By 2013 with the introduction of Anton 2, teams reported Anton 2 was an order of magnitude faster than the original with support for 15X as many atoms. They were able to add better flexibility and programmability and support for more accurate physical models with some new algorithms. The capacity, or total number of atoms, was big deal for potential discoveries but it meant more data movement between chips and further refinement to communication strategies.
With Anton 3 last year, D.E. Shaw pushed its 512-node machine into public view, showcasing its ability to simulate biological systems at unprecedented scale—in the ballpark of millions of atoms. “There were a number of architectural changes in this machine due to changes in underlying technologies, including different rates of in advancement in processing versus communication parameters” but it meant a new world of discoveries, including those that led to the drugs in clinical trials now.
The following slide highlights the improvements compared to traditional HPC architectural elements (GPU/CPU).
The x axis is size of the biological system, the simulation speed is on the y axis (microseconds per day). As expected, the curves show what we see elsewhere: as you go up in system size performance goes down (simulating more interactions).
Anton 3 had some of the largest jumps architecturally, including refactoring the ASIC layout to minimize increasingly expensive cross-chip communication by moving to a tile-based architecture that combines sub-tiles for both the “hardwired” particle interaction pipelines and also programmable and more flexible processing units.
Designers also added specialized “bond calculators” or modules that sped up some of the slower parts of working with different bonded atoms and the interactions they propagate. “We could save area, energy, and get speed with this chunk of hardware to take the load off other parts of the chip,” Shaw explains. The team also worked on some novel data compression techniques but ultimately, communication proved the bottleneck. “We learned about our calculations and looked in detail at the underlying physics to find redundancies and opportunities but we have the luxury of looking at just one application.”
“Historically, it’s been hard to have special purpose machines compete with general purpose supercomputers. It was great that Anton got 100X for this application—exciting to us. But the part that relates most to our long-term goal was more to do with the underlying science, learning about molecular systems and curing people. That’s what we wanted to do.”
The Next Platform history of computing general-purpose versus fine-tuned custom systems and software