Five researchers unlock the power of big data

An illustration of a giant wave filled with colourful, uneven lines like a series of graphs represents a vast amount of collected data.

Five researchers unlock the power of big data

From DNA sequencing, to artificial intelligence, to literary history, high performance computing opens up an incredible breadth and diversity of possibility for researchers across Canada
July 30, 2015

The incredible power of high performance computing to unlock massive data sets in order to answer an impressive range of research questions is a hallmark of computing in the present era. Research today is increasingly driven by massive digitization initiatives, high-throughput devices, sensor platforms and computational modelling and simulation, all of which generate data that are unprecedented in size and complexity. Here are five Canadian researchers whose work relies on advanced computing capabilities.

NEWS: Advanced research computing platform receives $30-million boost from Government of Canada

DNA detective - Guillaume Bourque, McGill University

Select any combination of people from anywhere in the world and you’ll find that 99.5 percent of their genomes are the same. The remaining 0.5 percent accounts for our differences, including our susceptibility to illnesses ranging from cancer to Alzheimer’s disease.

But because the human genome is made up of more than three billion nucleotides, even 0.5 percent is a huge number to sort through, and it doesn’t help that variations in their sequence could show up just about anywhere in the genome.

It’s no surprise then that when researchers look for common variations in a group of people suffering from the same illness, they do so with the help of high-powered computing.

In Canada, many will send genetic blood or tumour samples to the lab of McGill University genomicist Guillaume Bourque.

After DNA sequencing, Bourque and his team take the hundreds of millions of DNA fragments from each person’s sample and reassemble them into a single genome for that individual. They then compare each person’s genome to a reference genome representing a compilation of healthy people.

“We might get samples from 1,000 patients with the same disease,” says Bourque. “We look to see if they have anything in common that is different from most people and that requires a lot of computing.”

Protein sleuth - Régis Pomès, SickKids Hospital

Credit: Pomes Group, Sick Kids Hospital/University of Toronto

Spying on proteins as they go about tasks that keep our bodies functioning can yield crucial insight into treatments for a range of illnesses. The challenge is that proteins often work in disordered clumps and at high speeds.

This disordered state allows the protein elastin, for example, to give skin, lungs and major arteries the ability to stretch and recoil.

“You need lungs to be elastic to breathe, but elastin is poorly understood,” says Régis Pomès, a computational biophysicist at Toronto’s SickKids Hospital.

Pomès wants to better understand elastin because it could lead to treatments for lung disease and the creation of artificial skin for burn victims or vascular grafts for heart patients.

Experiments designed to peer into disordered elastin clumps produce snapshots that are hard to piece together. In other words, experiments can’t give a complete picture. So Pomès mimics elastin, and other proteins, using high-performance computing, or as he puts it, “we make cartoons of biomolecular systems.”

He and his colleagues then scrutinize these cartoons to see how proteins move, how likely they are to take on certain shapes and how fast they can shift from one to another.

“We need a lot of detail about things happening in very small systems on very fast timescales,” says Pomès. “High-performance computing is essential to generate and analyze the huge amounts of data that we need to find out what is useful.”

Artificial intelligence interrogator - Yoshua Bengio, Université de Montréal

Yoshua Bengio wants to understand the mechanisms that give rise to intelligence, in both living creatures and machines.

“Nobody really knows what these mechanisms are, but we are developing theories and trying them out on high-performance computers,” says the Université de Montréal computer scientist.

So far, those theories have been good enough to lead to significant advancements in artificial intelligence. Two of the most notable examples are speech recognition technology and object recognition in images, which is used to search for images linked to word queries and tag images found on the internet.

Developing machine learning algorithms, which are recipes for computers to learn from examples, were essential in these breakthroughs and researching those algorithms could lead to more significant advancements.

Bengio compares the process to learning to play tennis. With the help of an instructor who recommends slight adjustments, the would-be tennis player gets better each time they practice. 

“That’s how computers learn,” he says. “They repeat it millions or billions of times. They need a lot of computing power because we are trying to make machines absorb a lot of knowledge.”

The trick is to help machines capture that knowledge so they can classify or predict correctly when given new information. Ultimately though, Bengio would like to untangle the mystery around what he calls “unsupervised learning,” meaning learning that happens without having access to the right answers.

Particle spotter - Reda Tafirout, TRIUMF

The holy grail of particle physics, the Higgs boson, was discovered in 2012.  But there is still much to be learned by smashing protons together at high energies in the world’s most powerful particle accelerator, the Large Hadron Collider in Meyrin, Switzerland.

Finding the Higgs boson gave physicists more certainty that the Standard Model is correct. The model is a mathematical framework used to describe the fundamental nature of matter and the forces that shape our universe. Because the Higgs boson was the last particle in the model to be found, its discovery made headlines around the world.

The next phase involves fine tuning scientists’ understanding of the particle and searching for new phenomena, such as dark matter. TRIUMF particle physicist Reda Tafirout says that this means doubling the energy in the collider and producing more Higgs boson samples to refine their measurements.

“The Standard Model makes very precise predictions, so if we have any measurement that is not fully compatible with it — a new interaction or a new force that hasn’t been discovered — we want to know,” he says.

But with a staggering number of protons smashing together at the same time, it’s hard to single out which collisions might be relevant, which is why scientists rely on high-performance computers.

“It selects the collisions that lead to interactions that give some insight,” says Tafirout.

Literary history hound - Susan Brown, University of Guelph

The Orlando Project began as a major history of women’s writing in the British Isles and has grown up to become a leading example of how to integrate text and technology.

It is not a book, nor is it a digital edition of an existing text, explains one of its director, University of Guelph digital literary historian Susan Brown. Rather, it is a trove of information on 1,300 writers — amounting to eight million words — that combines information about their writing careers with chronological and bibliographical information.

“What makes Orlando different from similar scholarly works is the extent to which the material is structured by the encoding of the text to reflect various aspects of literary history,” says Brown. That includes features of literary works such as genre or how they were received, through to writers’ relationships with their publishers, to their intellectual influences, friends, political activities and health concerns.

With the help of advanced research computing, the Orlando Project’s specialized encoding allows materials to be found, sifted and reordered according to researchers’ interests and priorities. It also enables massive visualizations of writers’ networks and relationships that allow researchers to perceive new patterns in cultural history.

“Orlando has been heralded as a model for other such works of digital scholarship to follow, in its use of semantic encoding to create a digital resource that leverages the power of computers in new ways,” says Brown. Orlando’s pioneering model for digital scholarship also underpins the Canadian Writing Research Collaboratory, a new online platform launching in spring 2016 that will make advanced research computing accessible to literary scholars across the country.

Sharon Oosthoek is a freelance journalist who lives in Toronto and writes about science.