Supercharging science with high performance computing

What do ancient dolphin DNA, Brazil’s national tree and extinction risks for the world’s flowering plants have in common?

They’re all important areas of scientific research published in the first few months of this year. But they’re all also areas of research that have benefitted from one of Scotland’s largest – and growing – supercomputing clusters that’s just hit the milestone of completing 100 million hours computing time.

It’s the UK CropDiversity High Performance Computing (HPC) cluster managed at and by The James Hutton Institute.

By the end of 2024, the system will have a total of 497,588 combined CPU and GPU cores, with a maximum theoretical peak performance of 2.17 petaflops. That’s the equivalent of about 80,000 standard laptops. It also has significant memory capacity and network speeds of up to 100 gigabits per second.

It’s been built and managed to provide a shared collaborative computing resource that scientists and researchers across the UK and beyond could otherwise only dream of accessing to unlock the big questions around plant and animal genetics, biodiversity, phenomics, bioinformatics and more.

And it has been delivering, crunching masses of data for everything from creating seedless clementines to how antimicrobial resistance can spread among hospital patients depending on their genetic diversity!

Part of the UK CropDiversity High Performance Computing cluster at the Hutton. Photos by Paul Glendell.
Part of the UK CropDiversity High Performance Computing cluster at the Hutton. Photos by Paul Glendell.

“Since the first building blocks of the HPC were put in place in 2017, about 19 million analysis tasks have been run, using more than 100 million hours (or nearly 11,500 years) of computing time and covering more than a petabyte (1,000 terabytes (TBs) or 1,000,000,000,000,000 bytes) of data,” explains Iain Milne, who manages the HPC at the Hutton. “If that was MP3 tracks, it would take you 1,900 years to listen to them all, even if listening 24/7. And these numbers are growing by tens of terabytes per day.

“In the last 12 months alone, it’s handled around 400 terabytes of new data, equivalent to around 100,000 HD movies. That includes, recently, the largest data throughput for a single project for the Biodiversity for Opportunities, Livelihoods and Development (BOLD) project.”

BOLD is a Crop Trust project looking to strengthen global food and nutrition security by supporting the conservation and use of crop diversity.”

Through the BOLD project, unprocessed sequencing data for more than 400 samples were received. These generated 300 million unfiltered raw variants – all the variations across those samples – that had to be processed.

The HPC’s capacity is also often helping to unlock what are otherwise often significant gaps in knowledge about the natural world much faster than could otherwise be done, especially using methods like machine learning.

An example is a project by the Royal Botanic Gardens Kew which sought to address a major knowledge gap around the conservation status of many of the world’s plants.

“Our research group has been looking at ways to increase the scale and speed at which assessments of extinction risk can be generated for plants,” explains Dr Steve Bachman, research leader in ecosystem stewardship, at Kew. “With fewer than 20% of plants represented on the IUCN Red List of Threatened Species, it is vital that we fill this gap to help prioritise conservation efforts.

“After establishing that machine learning approaches could be applied to predicting extinction risk in plant groups, we wanted to scale up the application of this approach to all 330,000 species of flowering plants.

“The HPC at Hutton allowed us to process our analysis in a few days compared to the weeks it would have taken with our standard machines. We were able to rapidly generate for the first time, a prediction of extinction risk for all species of flowering plants. We now intend to re-run the analysis on the cluster as new data on the distribution and status of plants is updated.”

Endangered Clianthus puniceus var. maximus by the entrance to the Duke’s Garden. Image by Ines Stuart Davidson © RBG Kew.

This work was one of more than 20 scientific papers either published or in pre-print up to mid-April just this year whose results relied on the UK CropDiversity HPC. Their topics range from looking at how human genetic diversity influences antibiotic resistance to the development of seedless clementines.

What’s common among them all is a hunt for insights into how animals and plants behave and adapt – and that this is work that’s both generating and involving increasing amounts of data, which in turn drives a continued need for ever more computing power.

“Sequencing in particular is driving a need for more and more computing power,” explains Dr Micha Bayer, a bioinformatics specialist at the Hutton. “As sequencing costs have fallen, the amount of data being generated is rising ever faster, allowing bigger data sets to be created, and hence better science to be done. But it needs that computing power to handle all that data. At the Hutton, the capacity is mostly used for looking at gene expression and genomes – helping to understand how plants adapt to different environments or inputs.”

Machine learning engineer, adds, Fraser Mcfarlane, “It can also churn through ever expanding datasets generated by the likes of high-throughput phenotyping, as we will have in our Advanced Plant Growth Centre at Invergowrie. This process uses a suite of sensors and cameras to monitor and understand how plants grow and develop.

“Additionally, the vast quantities of remote sensing data produced by satellites and aerial platforms like drones can be analysed, helping us understand the world around us. The types of machine learning-based image analysis required to process such information require vast amounts of computing power that’s inaccessible without a resource like the HPC.”

So what about the ancient dolphin DNA, you ask? That was a project looking at how today’s bottlenose dolphins differ, genetically, from their ancestors thousands of years ago and how pelagic dolphins that decided to spend more time in coastal areas have evolved to those environments – and even evolve to be more like their ancestors.

Paubrasilia, by mauroguanandi, CC BY 2.0 https://creativecommons.org/licenses/by/2.0, via Wikimedia Commons.

The work on Brazil’s national tree is focusing on supporting conservation efforts of this endangered tree – found only in the Brazilian Atlantic Forest – by better understanding its genetic diversity and any differences within the species.

Work on all of this and more – winter wheat varieties, what happens to the fruit fly’s genetics when their mating patterns change, that is monogamy or having multiple mates and more, looking at the diseases that attack crops and how these evolve by looking at their wild counterparts so they can be better tackled, how wheat varieties can be developed to better uptake water and nutrients… and more, has been published just this year.

To find out more, visit here.

The UK CropDiversity HPC was funded by Biotechnology and Biological Sciences Research Council and Advanced Life Sciences Research Technology Initiative (ALERT) grants BB/S019669/1 and BB/X019683/1 and The Department of Business, Energy and Industrial Strategy Public Sector Research Establishment Infrastructure Fund.

The HPC partners are NIAB, The Natural History Museum, Scotland’s Rural College, Royal Botanic Gardens, Kew, Royal Botanic Garden Edinburgh, University of Edinburgh and the University of St Andrews.

Disclaimer: The views expressed in this blog post are the views of the author(s), and not an official position of the institute or funder.