Loading

 

 

3.6.6 Background and Significance

Back to Phase II Grant Documents

Tree Star now has a user base of over 10,000 flow cytometry analysts, and many are overwhelmed by the requirements of processing the gigabytes of data in large experiments. The number of colors (biomarkers) continues to grow, and integration of cytometric data with other techniques in genomics and proteomics necessitates enterprise-level data management and higher levels of automation. Yet the current gold standard in analysis involves manual compensation and gating. Robotic plate loaders and high-throughput cytometers make it possible to collect data faster than the lab can analyze it. The size and scope of experiments in cytometry continue to grow.

The analysis problem is greatly complicated by the lack of a ground truth set of data with known characteristics, and by the large variance in immune characteristics between subjects. Experts have scores of anecdotal cases of instrument malfunction, operator error, and biological contamination of every sort, and how to distinguish them in a dot plot.

In order to study this problem, we have created a methodology to apply classification algorithms to selected data sets in high enough volume to study the quality of classifiers. We have created a set of intermediate file formats, which map to the steps of the research design.  The emphasis on modularity and extensibility, as covered in the Architectural Plan[58], enables collaborators to contribute directly to the project.

Use Cases

To develop the foundation for initial analysis of algorithms, we have constructed synthetic data sets that describe different levels of noise in different distributions. We composed Python scripts  to build and combine data sets, so any number of simulated classes can be mixed into increasingly complex sets. This will permit us to quantitatively measure algorithms’ or humans’ abilities to recognize expected patterns.

Synthetic data are useful in developing metrics, but have only limited applicability in simulating real-world flow cytometry data sets.  The current generation of synthetic data is not rich enough to reverse-engineer the creation of compensation and calibration controls to model the users’ ability to compensate correctly.   Longitudinal studies are important to detecting changes within samples collected on a single instrument.   The issues of inter-instrument, and especially inter-laboratory, comparisons add too many additional dimensions to be accurately simulated.  Immune Tolerance Network[59] is a non-profit government-funded consortium of researchers working to normalize across experimental conditions. Their results confirm our expectations of additional complexity.

We are working with academic collaborators in two specific cases studies, Graft vs. Host Disease  (GvHD) and Simian Immunodeficiency Virus (SIV) effects on immune response.  Both are time studies with multiple time points, treatments, and subjects.  They contain a variety of quality control problems and limitations in the results, that reflect the problems of real-world flow cytometry data.  Better standardization and quantitation will the results of automated classifiers, but our short term goal remains focused on analysis of these pre-chosen data sets, precisely so we cannot preselect files to improve the results.

Classifiers

We apply established classification algorithms to these data sets.  The original proposal included five families of supervised and unsupervised classifiers.  Early development iterations revealed that the quantitative tools we are developing to measure classifiers’ concordance with experts are well-suited to studying the sets of events where experts converge and diverge in their gating.  Using this measure of consensus, we are finding that we can isolate training sets that convey the assay to a supervised classifier.   We can amplify or filter certain areas to model an individual agent's predispositions.  We can create workspaces for training purposes and score trainees on their similarity to experts.  We can score the quality of training on an assay by measuring convergence of expert opinion before and after training.  Or we can study the characteristic of events where an agent diverges from consensus.  Most importantly, we can provide an objective and impartial metric for ranking and comparing the upcoming generation of algorithmic flow cytometry classifiers.

Collaborators

The key dynamic of this project, and the reason its significance has grown since project inception, is the emergence of a broader academic, government, and industrial interest in solving this classification problem.

A distinct set of projects has arisen to apply bioinformatics processes to this domain. Most are using the R programming language[11], Bioconductor packages[51], Flowcyt [47], flowCore [57], to apply statistical processes. We have prototyped piping data from FlowJo to R and have a project in the timeline to automate the application of third-party classifiers within our framework. Cytobank[75] is a repository with analysis capabilities that has a different set of emphases, especially its strong focus on plate-based assays. We found MATLAB neural net implementations and were able to work with classifiers run there. Changes made in the elaboration of the project have defined the intermediate file formats that make it easy to transfer classifier results between any other tools, or in and out of FCS format, which is needed to compare it to the raw collection file.

Significance

  • FlowDx fits the “translational medicine” model of the NIH Roadmap. [56] FlowDx will reduce error in the diagnosis of diseases.
  • FlowDx will speed results to physicians, offering the opportunity for patients to learn the outcome more quickly and facilitating faster therapeutic intervention.
  • FlowDx will better accommodate large-scale research by allowing greater volumes of complex data to be much more quickly examined, compared, and quantified.
  • FlowDx will reduce the expense of analysis. Tree Star estimates a reduction of fifty percent in the cost of cell analysis, based purely by triaging out the first 90% of normal results.
  • FlowDx will bring the algorithms for population selection to the customer in a friendly and customizable way. Many clustering algorithms are written in R programming language, which is not accessible for the majority of cytometry labs or researchers in the clinical environment.

Cytometry is a key component of many biomedical studies. The software that can classify biological populations based on cytometry data does not yet exist.