Loading

Database/Repository

Project challenges fall into three parts: Clustering - Cluster evaluation - Software Automation.
Clustering research will continue on the model of Phase I: Continue to refine results connecting clustering algorithms with data domains and protocol demands. We need to know more about which algorithms produce the best results on a particular data type. Identifying clustering mechanisms that can create measurable populations is step one. Step two is to tune the algorithm to comform its output to consensus data. Comparing the performance of multiple methods on the same data is the third step.

Dedicated Database
As the project plan grew, it became evident that we want a special sort of repository for our use case data. We want to be able to execute algorithms across all data sets, and as our experiment grew in dimension as well as range, it became imperative to extend the capablity of the data repository. After proposing and investigating multiple libraries, content management systems and LIMS options, we found a very good pattern to follow in a protein classification database from International Centre for Genetic Engineering and Biotechnology (ICGEB)17.
The discovery of this option came by looking at the most serious competitor to FlowDx's innovation, the Bioconductor / R community's recent expansion from gene array analysis into flow cytometry. FLAME, flowFlowJo, flowUtils, etc are exciting new tools that span the gap between manual and mechanical analysis. We have actively been supporting this community by pre-digesting (compensating) data and exporting it to R datasets. We provide our analysis software free of charge to bioinformatics researchers, as, historically, our customer is the biologist.) 
We have designed a variant of the ICGEB database locally, customized to our data format, algorithmic calculations and cluster comparison formulae.

The interface for configuring a research experiment iteration is shown here.


Figure 1 Configure a Neural Network Training Run (Prior to Pressing 'Calculate')


Researcher selects training populations and the chosen classifier learns the classes and applies them to designated data. Colored plots, and the ability to export a vector of classes with or without the data are the needed outputs. 

Researcher can choose one of the built in items from a list, or select "import" to call a script from another program, run it in a parallel client and record in the database/repository a vector of classes for further analysis.

We will include the ability to export the vector of classes so that investigators can do further manipulations in R or matlab thereafter. When the researcher runs multiple algorithms he or she can overlay the results and plot them as a heat map...or psuedocolor plot, where brighter color represents increasing inclusion rate and dimmer colors mean cells that are less frequently included.

An important reason to use an automated classifier is to classify data using more than 3 dimensions, therefore multigraph overview platform is linked with the "M" button. It is possible to track a cell though various parameters and color it by population consistently. Users can scroll through, change parameters, and construct a graphical report.

Figure 2. Results Display After Calculation


Figure 2 shows a static picture explaining what the classifier is, what the inputs and outputs are. Drop-down menus allow users to adjust parameters but prepopulated with default values. We need to allow users to use less than all of the parameters as well.

Database Schema and example files






One of the specific aims from Phase I was the implementation of a specific metric (the Match Ratio) to compare the events between analogous gates. This was done in order to compare a calculated gate to an expert manual gate. Analysis of the phase I data collection did not reassure us concerning the utility of that metric (See Progress Report.) Literaturepublished since the last phase offers several other cluster comparison methods, described below, that are worthy of consideration as alternatives.
This increases our rectangular assessment matrix devised for Phase I (at right) to a cube, defined as the cross product of each use case by each algorithm by each metric. We're hesitant to excessively expand the scope, but there are multiple reasons why we have found this to be imperative:
GvHD variance was too large. The level of controls used in flow is probably not sufficient for reliable gating, human or algorithmic..
V-measure paper (quoted below) raises interesting discussion of homogeneity vs. uniformity. We want both independently, as well as the ratio, to evaluate our use cases.

The spreading effect seen in flow data defeats traditional metrics used in clustering. To prevent this, we expand the metric into a third dimension to allow more sophisticated combinametrics.

Each data point used in flow cytometry analysis is indexed data with each measured parameter corresponding to a location in n-dimensional space. So any group / cluster / gated population can be described as having a centroid from which distance measures can be calculated. Our classifications can be described as right or wrong with respect to a desired answer (usually the "experts," or in our nomenclature, the Consensus Gate) and so all of the following tools for assessing the goodness of a classification are candidates.

Mallow's distance: Mallow's distance is a measure of similarity between two clustering results. It attempts to address both whether a item has been correctly placed in a cluster (in the case where the "right" answer is known, and when its not, then we're assessing when two clustering results are different), and how important it is that a cell has been classified properly based on its distance from the center or each centroid. This methods measures how great a mistake a mislabeling is.


ROC (Receiver Operator Characteristics): ROC analysis is a tool that can be used to assess the quality of a classification tool under a number of conditions, usually when comparing whether sensitivity or specificity are more important (i.e, is it worse to be told you have cancer when you don't or is it worse to be told you are fine when you really have cancer).15.  To build one of these, vary the cost of missclassification from 0 -1 and calculate the "loss", the combination of the cost of each type of error times the rate of making the error, at regular increments.

Goals are to evaluate and test the algorithms described in the Specific Aims as well as new algorithms, as they develop and are published, to determine the best method for each of our use cases.