Needle in a haystack: Wrangling data to identify host biomarkers of TB progression

Up to 80% of the adult South African population are estimated to be infected with Mycobacterium tuberculosis (Mtb), the bacterium that causes tuberculosis (TB). However, about 90% of healthy individuals believed to be carriers of the bacteria will never get sick or infect anyone else. If researchers could pinpoint what causes TB to progress in some individuals and not in others, they could break the back of the TB epidemic. The South African Tuberculosis Vaccine Initiative (SATVI) is working towards predicting an individual’s likelihood of developing TB, but the size of the data involved is a challenge in itself. UCT eResearch has been working with SATVI from the point of data acquisition and will continue to do so until its eventual publication.

Researchers at SATVI, co-led by SATVI director Professor Mark Hatherill and immunology lead Professor Thomas Scriba, set out to identify what differentiates healthy individuals infected with Mtb who ultimately develop TB from those who remain healthy. A decade ago, SATVI recruited a large cohort of around 6 300 healthy adolescents, about half of whom were infected with Mtb. This cohort was followed up every six months for two years.

“While we recruited all the adolescents at the same point, healthy and without symptoms,” says Dr Virginie Rozot, a postdoctoral researcher at SATVI, “at the end of the study, we had two clear groups: those who were susceptible and developed TB, and those who stayed healthy.”

The researchers’ goal was to identify differences in immune responses between the two groups, and key blood markers that could be used to distinguish which individuals would develop TB, so that they could be treated pre-emptively.

As part of this project, Rozot developed the first mass cytometry (CyTOF) platform in Africa. This technique combines two experimental platforms – flow cytometry and elemental mass spectrometry – to allow researchers to study more properties of individual blood cells than was previously possible.

“I take samples from the cohort that have been cryopreserved, and I thaw and label them with different metatags,” she says. “I can identify up to 45 markers on each cell, so you can imagine the exponential number of combinations of these 45 markers existing in a sample of a million cells.”

Rozot and her colleagues rapidly ran into difficulties due to the amount of data they generated. Every day, the cytometer would analyse and produce data for a few dozen samples, each with a million-odd cells, resulting in a high number of combinations of markers. The resulting data files were massive and needed to be stored until the completion of the project – and beyond – for analysis. The challenge was that the controller computer attached to the cytometer could not store the data generated by the equipment.

Ashley Rustin, senior technical specialist at UCT eResearch, helped the group with their data requirements by ensuring that the data from their instruments, such as the CyTOF, was automatically backed up to the research data central storage repository.

This ensured that the data was secure and that SATVI researchers could access the data from their computers over the network or from anywhere in the world.

“The data sets generated on the CyTOF are massive, and I soon realised that the network was a bottleneck,” says Rustin. “I arranged for the network link between the controller computer and the building switch to be upgraded. This significantly improved the speed of the backup of the data sets to the research data repository located in the Upper Campus data centre.”

Curating the data for open publishing

As the funder of this project, the Bill & Melinda Gates Foundation requires that the data sets be published in a reliable open-access repository. Thus far the team has identified UCT’s institutional repository ZivaHub as a good option for sections of the data, while the NIH-supported ImmPort repository will house the remainder.

The curation of the data is not completely straightforward, though, says Dr Mbandi Kimbung, a SATVI researcher who is working on preparing this data set so that it can eventually be published and shared with the public. Using one raw data set, the SATVI researchers are undertaking a number of different research projects. The outcome is a range of processed data sets, each looking at different aspects of the same blood samples.

“This has resulted in very rich sets of data, with a range of layers that can now be integrated,” explains Kimbung. “It is an attempt to model the immune system, to see whether that can help us to understand the development of TB.”

In addition to curating the data, Kimbung and the team also had to factor in the requirements of ethics committees around the use of samples from human participants in research. The data needed to be completely de-identified before it could be published, for instance.

“The question of who would control the data was important to the ethics committee,” says Kimbung. ZivaHub was an attractive option for hosting the clinical database, as it assigns a UCT digital object identifier (DOI) to the data that establishes its ownership. “Also, with ZivaHub, we can edit the data during the lifecycle of the project and have complete control over when the data is made public.”

Large gaps remain in our understanding of the complex interactions between the TB bacterium and the human immune system. Improving this is critical to developing better interventions that will halt TB transmission.

“Systems biology and big data are key to unravelling the complex interplay between Mtb and the human host,” says Scriba. “The recent advances in technology and data science are already bearing fruit and I am very excited about the new biological insights and innovative medical interventions that are emerging.”

Needle in a haystack: Wrangling data to identify host biomarkers of TB progression

Strengthening Research Software Quality Through Global Exchange

UCT contributes to Pan-African AI innovation

Invitation to take part in the 2026 International RSE Survey

UCT joins Eurostat as an accredited research institution

Help shape South Africa’s contribution to OECD policy on research software

Accelerating research with advanced digital infrastructure