Loading
Database/Repository
Project challenges fall into three parts: Clustering - Cluster evaluation - Software Automation.
Clustering research will
continue on the model of Phase I: Continue to refine results connecting
clustering algorithms with data domains and protocol demands. We need
to know more about which algorithms produce the best results on a
particular data type. Identifying clustering mechanisms that can create
measurable populations is step one. Step two is to tune the algorithm
to comform its output to consensus data. Comparing the performance of
multiple methods on the same data is the third step.
Dedicated Database
As
the project plan grew, it became evident that we want a special sort of
repository for our use case data. We want to be able to execute
algorithms across all data sets, and as our experiment grew in
dimension as well as range, it became imperative to extend the
capablity of the data repository. After proposing and investigating
multiple libraries, content management systems and LIMS options, we
found a very good pattern to follow in a protein classification
database from International Centre for Genetic Engineering and
Biotechnology (ICGEB)17.
The discovery of this option came by looking at the most serious
competitor to FlowDx's innovation, the Bioconductor /
R community's recent expansion from gene array analysis into flow
cytometry. FLAME, flowFlowJo, flowUtils, etc are exciting new tools
that span the gap between manual and mechanical analysis.
We have actively been supporting this community by pre-digesting
(compensating) data and exporting it to R datasets. We provide our
analysis software free of charge to bioinformatics researchers, as,
historically, our customer is
the biologist.)
We have designed a variant of the ICGEB database locally, customized to our data format, algorithmic calculations and cluster comparison formulae.
The interface for configuring a research experiment iteration is shown here.

Figure 1 Configure a Neural Network Training Run (Prior to Pressing 'Calculate')
Researcher selects
training populations and the chosen classifier learns the classes and
applies them to designated data. Colored plots, and the
ability to export a vector of classes with or without the data are the needed outputs.
Researcher can choose one of the built
in items from a list, or select "import" to call a script from another
program, run it in a parallel client and record in the database/repository a vector of
classes for further analysis.
We will include the ability to export the vector
of classes so that investigators can do further manipulations in R or
matlab thereafter. When the researcher runs multiple algorithms he or she can
overlay the results and plot them as a heat map...or psuedocolor
plot, where brighter color represents increasing inclusion rate and dimmer colors
mean cells that are less frequently included.
An important reason to use an automated
classifier is to classify data using more than 3 dimensions, therefore
multigraph overview platform is linked with the "M" button. It is possible to track a cell though various
parameters and color it by population consistently. Users can scroll
through, change parameters, and construct a graphical report.

Figure 2. Results Display After Calculation
Figure 2 shows a static picture explaining what the classifier is, what
the inputs and outputs are. Drop-down menus allow
users to adjust parameters but prepopulated with default values. We need to allow users to
use less than all of the parameters as well.
Database Schema and example files
One
of the specific aims from Phase I was the implementation of a specific
metric (the Match Ratio) to compare the events between analogous gates.
This was done in order to compare a calculated gate to an expert manual
gate. Analysis of the phase I data collection did not reassure us
concerning the utility of that metric (See Progress Report.) Literaturepublished since the last phase offers several other cluster comparison

methods, described below, that are worthy of consideration as alternatives.
This
increases our rectangular assessment matrix devised for Phase I (at
right) to a cube, defined as the cross product of each use case by each
algorithm by each metric. We're hesitant to excessively expand the
scope, but there are multiple reasons why we have found this to be
imperative:
GvHD
variance was too large. The level of controls used in flow is probably
not sufficient for reliable gating, human or algorithmic..
V-measure paper (quoted below) raises interesting discussion of
homogeneity vs. uniformity. We want both independently, as well as the
ratio, to evaluate our use cases.
The spreading effect seen in flow data defeats traditional metrics used in clustering. To prevent this, we expand
the metric into a third dimension to allow more sophisticated
combinametrics.
Each data point used in flow cytometry analysis
is indexed data with each measured parameter corresponding to a
location in n-dimensional space. So any group / cluster / gated
population can be described as having a centroid from which distance
measures can be calculated. Our classifications can be described as
right or wrong with respect to a desired answer (usually the "experts,"
or in our nomenclature, the Consensus Gate) and so all of the following
tools for assessing the goodness of a classification are candidates.
Mallow's distance:
Mallow's distance is a measure of similarity between two clustering
results. It attempts to address both whether a item has been correctly
placed in a cluster (in the case where the "right" answer is known, and
when its not, then we're assessing when two clustering results are
different), and how important it is that a cell has been classified
properly based on its distance from the center or each centroid. This
methods measures how great a mistake a mislabeling is.
ROC
(Receiver Operator Characteristics): ROC analysis is a tool that can be
used to assess the quality of a classification tool under a number of
conditions, usually when comparing whether sensitivity or specificity
are more important (i.e, is it worse to be told you have cancer when
you don't or is it worse to be told you are fine when you really have
cancer).
15. To build one of these, vary the cost of
missclassification from 0 -1 and calculate the "loss", the combination
of the cost of each type of error times the rate of making the error,
at regular increments.
Goals are to evaluate and test the algorithms described in the Specific Aims as well as new algorithms, as they develop and are published, to determine the best method for each of our use cases.