3.6.2 Research Design and Methods
(As submitted in Grant Application)
Back to Phase II Grant Documents
We are investigating human pattern recognition in flow cytometry data, with the intent of using models to create classification algorithms that can match or exceed human performance. We have prepared data sets from which we can quantitatively evaluate the human and algorithmic performance, and objectively compare them. The outcome of this comparison directs the development to improve the algorithm.
Ongoing work is needed to develop the experimental protocol, whereby a researcher can compare two or more classifications of identical data sets to study the differences, biases, and effectiveness of human and algorithmic classifiers. Early rounds of human analysis showed huge differences in population sizes. Better operating guidelines, tighter control over the workspaces, and more standardized processes are prerequisite to being able to effectively analyze the data.
With improved protocols, we can re-run the experiments on an increased scale to quantify human gating patterns. We are building a database that will randomize subsets of the experiment, construct FlowJo workspaces, and e-mail them to participating experts for gating. FlowJo will automatically submit the gates to the FlowDx database, which will support comparison to other gates from different analyses
on the same data. Comparisons can be based on summary statistics, or event-wise comparison. Any number of classifications can be combined to form consensus masks, which can be used to select out events.
Research has uncovered additional statistical techniques used in analogous problems. V-Measure, Mallows distance, and ROC are similar to Match Ratio, and provide more options for comparing classifiers. We’ve abstracted calculating the metric from the processing of the populations to allow any combination of metrics.
With each iteration of the development cycle, the software is simplifying the workflow, and revealing the significance of this technique. Tagging each event level with the probability that it belongs to each class enables us to use our existing tools to analyze and visualize the results of the classifiers. As we incrementally add data to the existing sets, and add new assays to the experiment, we will better understand what is involved in the gating process, and how to perform it algorithmically.
Our hypothesis is that the metrics we use to rate the classifiers will also be useful in filtering and sorting events, based on level of agreement within the group. This provides a mechanism to build training sets for supervised classifiers which reflect consensus, emphasizing the events where the experts are in highest agreement.
Workflow
The fundamental question is, simply put:know who is more correct?
In problems where the ground truth is known, this answer is straightforward. In flow cytometry data, where populations move around due to biological, laboratory, and instrumental variation, there is no single known answer. It is an interpretative process strongly influenced by the experts’ experience and understanding of the experimental context.
To a degree, consistency is a mitigator of these imprecisions – as long as the process is held constant, quantitative results are valid relative to the controls set up within the experiment. The statistical measure of frequency relative to parent population, as opposed to the frequency of the total sample event count, serves to remove the influence of debris and non-specific staining.
In many cases, the border events of a gate are not critical – the center of the population is influencing the statistics.

Figure 1: Large differences in gate frequency may show insignificant differences in medians.
But showing a histogram of the Yellow (DNA) dye reveals a big difference in the G2 population, which would not have been apparent in the scatter plot of all events:

Figure 2: Tight gating above has lost the G2 population.
Experiments that compare across time, instrumentation, and institution require more sophisticated quantitation to calibrate, normalize, and compare files. Algorithmic classifiers will often look at all dimensions simultaneously, as opposed to the sequential two-dimensional drill-down that is common inmanual analysis, so visualizing the differences in results will only get more complicated.
Clustering results might be input in a variety of formats. GatingML is a new, ISAC approved standard definition of a gate, but is still relatively unsupported, limited to geometric definitions, and would constrain a probabilistic classifier. Exported FCS files would be a common format for clusters found with cytometry software. Importing CSV files of only the events of interest, or all events with a calculated column to show the clustering, are explicitly supported as well, enabling us to include any clustering algorithm in the future.
Regardless of how populations are found, they all can be converted to a simple table called a popmask. A popmask is the table of all events, with a column for each clustering, and a 1 or 0 to indicate whether the event is in a cluster.
Event Number |
Tight |
Wide |
|---|---|---|
1 |
1 |
1 |
2 |
0 |
1 |
3 |
0 |
0 |
A popmask can be translated into the probability that each event is correctly classified by averaging the different classifications, called a tally.
Event Number |
Tight |
Wide |
P(lymph) |
|---|---|---|---|
1 |
1 |
1 |
1 |
2 |
0 |
1 |
0.5 |
3 |
0 |
0 |
0 |
This step supports any number of classifiers being included in the average, and weighting the confidence in any classifier. By changing the weighting factors, we can tune metrics to match user conditions and requirements.
Once the confidence that each event is in each class has been tallied, the basic metric of any single method is the root mean squared of the differences over all of the events. By changing the relative weights put on false positives vs. false negatives (i.e., consistency vs. completeness), the algorithm can be customized.
Sorting and filtering the events by P, the probability that the event is in a class, displays interesting new information about cytometry analyses:

Figure 3: Visualizing SIV data sets by areas of disagreement in analysis.
Infrastructure: Project Management
The design and methods presented here are abstracted from the FlowDx project plan.
The consensus framework for best practices in software engineering is known as the Rational Unified Process (RUP)[61]. The value of RUP is in its elaboration of all the fundamental principles and disciplines of project management. Although the RUP is actually one implementation marketed by IBM, it has many variants. We use it here to refer to the terminology and general family of frameworks that derive from it. OpenUP, AgileUP, EssentialUP, Oracle Unified Method, etc. are variations on the concepts developed by NASA for the space program, downscaled or refocused to match the project scale. RUP is a big mechanical behemoth, but the concepts and terms are applicable for small, organic projects, too. For the most part, the top-level documents are quite relevant and are used herein, but AgileUP[65] is the closest to our interpretation, which downgrades the importance of tracking small-grained tasks.
The biggest risk of an RUP project is being overwhelmed by the administrative requirements. RUP recognizes this complexity and prescribes that the first step of the project management should be to edit down the requirements list within the context of the goals of the project. Nevertheless it is often tempting to include artifacts from its library as the answer to each risk uncovered. As of Nov. 2009, after cutting the scope of documents we have over 50 documents that make up the project plan, with an additional 18 in the SBIR Grant Application.
The advantage of RUP is in well-designed milestones, processes, and work products. It provides boilerplate structure for a host of reports and checklists. It requires formal review of problem resolution and risk mitigation, and iterative development.
Infrastructure: Database
The pipeline to process files from their original data, through analysis and meta-analysis is simple. Initially anticipated to use a simple file-system-based storage, the project has grown in scope to need a database at the center of its architecture. Support of external classifiers and our workflow necessitates three different intermediate states, as classifications are processed. Comparing classifiers introduces combinatorial explosion. Comparing metrics adds another dimension to the scoring matrix. An automated repository is essential to run large-scale experiments.
Requirements
A: Storage of raw data files, workspaces, reports and artifacts related to analysis.
B: Database associations of files to specific experiments, experts, and time series.
C: Ability to generate reports and statistics on experiments, by logical query.
D: Administrative interface for the administration of experiments and data files.
E: Client neutrality. Usable from a web applet, a plugin in FlowJo, or from R.
Design
The design pattern used in our repository is based on the common LAMP solution stack, using Linux, Apache, MySQL, and PHP in a standard combination. We add Tomcat as a wrapper around FlowJo, our existing cytometry analysis software. This allows us to run analysis on a server in a scripted environment.
A: The MySQL relational database (RDMS) [66] was selected for the secure storage of project data. It is a well-supported, industry-standard database solution.
B: The Apache Web server[67] in concert with the Tomcat Application Server[68] was selected for application hosting. Apache / Tomcat are common technologies that are very well-suited to the needs of reliable, high-performance data-centric applications.
C: The FlowJo Engine has been implemented as a TCP server application [69]. Multiple engine instances run on numerous servers, providing strong scalability and reliability.
D: Open scripting languages, such as Ruby, PHP, or Perl, are used for managing access to engine, analysis results, data collation, etc. The Java language [70] and Eclipse IDE [76] are used in tool creation, and tools are wrapped in Tomcat.
E: The system is designed around Linux[71] servers, but most Unix flavors (Solaris, BSD, OSX), as well as Microsoft platforms, are supported by all of these tools.
First Implementation is found at: http://flowdx.com/flowdx.sql
Database Schematics

Figure 4: Diagram of different analysis agents working with a common data set. Increased interest in this classification problem by bioinformaticians has opened the architecture.
Infrastructure: Repository
There is much more information in the study than the tables in a database. We are generating dozens of procedural documents, requirements, and specifications, and that trend will continue to increase as validation documentation and end-user documentation are created. IBM’s tools for the task are too complicated and expensive for a project this size, and content management solutions are ubiquitous.
Blogs, Wikis, & GoogleDocs
These contemporary media are very useful for distributing information, for soliciting feedback over the Internet, and for building collaborative documents. Cloud computing is a strong trend in software, but we’ve found these tools are easy to adopt, but hard to master. Rational tools have become web-based, so there is very little difference between a high-end project browser and a free blogging account.
These are important resources to the FlowDx project:
Treestar.typepad.com – a blog keeps time-stamped meeting notes and drafts.
FlowDx.com – access to project plan, all documents and data.
Rup.hops-fp6.org – Worksheets, glossary and requirements for software engineering.
FlowJo
A major influence in our design of FlowDx is the library of functionality we have available in our existing commercial software, FlowJo. By creating a command line interface to our normally interactive application, we are able to execute predefined compensation scripts, gates, statistics, and reports from a shell command. This implies that the server-based utilities can execute any number of steps in a FlowJo workspace, without human intervention. We are able to compensate and export calculated parameters to other classifiers, releasing them from the matrix derivations and complicated transformations used in cytometry.
Remote Analysis Support
The project architecture that was intended to talk to remote servers will also provide access to remote humans. Through some very simple XML extensions we are able to extend FlowJo workspaces to include return address information. This allows us to send templates out to labs by e-mail, have them analyze files that reside on our server, and receive a copy of their analysis, when they save the analysis. Not only is this a benefit to collaborators on this phase of the project, but it has great implications for testing, training, validation, collaboration, and support.
This connection has overcome the primary obstacle encountered in early tests, where population naming conventions were not enforced and users labeled populations arbitrarily, instead of following the SOP. With this extension to our software, the user cannot save the analysis until all required populations are defined.
Summary: Issues, Changes, and Future Work
Much of this preliminary work was devoted to learning the Rational Unified Process, administrative planning of the project, and some very laborious manual analyses in Microsoft Excel. The need for a smart repository was discovered during the first grant phase. The design and implementation of this tool has postponed the large-scale research and analysis; however, it has lead to a modular structure that allows new algorithms and comparison metrics to be easily added as desired.
From the synthetic data experiments we observe that all of the automated classifiers that we tested can classify two populations that are discernible to the eye with greater than 99% accuracy. This is an important result for users of FlowDx who would use the software in a high-throughput setting to make many relatively simple classifications automatically and rely on the machine to flag samples with measures outside of the normal. The emergent value of flowsim is to be able to compose synthetic data with distributions modeling real-life data and the ability to add synthetic noise to actual data to model confounding issues with real clinical samples.
The outlying samples resulting from poor preparation or acquisition are crucially important to include in training data sets, but the outlier is often edited out by the well-intentioned operator who recognizes a problem. Therefore as we continue our efforts, we will be working to obtain use case data and create synthetic data that represent outliers and rejected samples of the quality-checking process in our collaborators' labs. Recognition of bad data by an initial screening analysis is an important step in the work flow.
Automated generation and processing of workspaces using human classification will enable larger scale experiments into the nuances of gating. Functionality within FlowJo has enabled us to prototype different levels of automation in gating, but the procedure does not support R scripts or MATLAB analyses, which we recognize as useful. This tool will incorporate external algorithms in batches.
Since the original grant was proposed, the presence of the R / Bioconductor community has grown significantly. There is considerably more momentum along these lines: The Data Standards Task Force and Dr. Brinkman's work on MIFlowCyte[77], ACS[78], and Gating-ML[79] are posted for public comment, in advance of adoption as ISAC standards. Support in the creation of these standards has become an important focus in this project.