Loading

 

 

4.3 Algorithm Discussion

2.3 Algorithm Methods
3.3 Algorithm Results Overview

Each of the four identified algorithms has been run on some set of the data. Preliminary results point us to supervised algorithms (ANN, SVM, Magnetic Gating) over unsupervised. A significant reduction (38%) of misclassification compared to human gating provides promising (but very early!) results, as detailed in 3.3.3, the ANN report.

The initially proposed data-mining tool WEKA was evaluated and removed as cumbersome. It has limits on event count and no FCS reader, so it was a complicated workflow. It was dropped from the project plan and replaced by R and Matlab. FlowJo has been extended with a command line interface that makes it a more powerful resource for this sort of experiment.

We feel that we have made significant progress in the two major problems of supervised classifiers:  how to establish the training sets, and how to communicate the "black box" results to the human supervisor.  The early implementation has shown us numerous similarities in the metrics used in this project to evaluate human gating and in the tools needed to developing training sets for supervised classifiers.  There is a strong business model in having the tools to build custom solutions, as discussed in our Commercialization Plan.

Many issues arose with unsupervised classifiers choosing statistical significance over biological significance and favoring large clusters over small. This is best covered in the Probability Cluster progress report.  Ultimately the solution is to build more of the biological model into the constraints, but it has not been pursued yet.  We do not have the ability to constrain algorithms to specific subpopulation calculations in any automated way, but the need for that is made clear in the magnetic gating report.  We still feel that unsupervised classifiers have potential applications in this study, but there needs to be a level of indirection so that small clusters are defined by relative position to the control samples. This requires an additional level of specification of interrelationships between populations that dynamically adjust to the sample.  FJML extensions in the Apple Macintosh version of FlowJo have this capability. 

Only minor work is required on algorithms in the near iterations.  The development of the experimental infrastructure and the analysis metrics will be developed before there is any need to apply algorithms widely.  There are now file format definitions established for the intermediate results of each step, so the development of metrics and algorithms can be done in parallel and kept orthogonal. Larger scale testing of the algorithms is unproductive until we advance the repository. We are focusing on the infrastructure needed to automate this work flow before we deal with the assessment or configuration of any particular algorithm.

The primary risk of the project, that none of the algorithms will perform suitably, has been reduced by externalizing the algorithmic part of the project. We now have a modular framework that supports any external classification methods. All indications show that R is the platform most crucial to support. However, Mt. Sinai Medical Center has a group working on classifiers for PhosFlow data in Fortran, so we can never make assumptions about the tool set.  We have widened the circle of collaborators from our original proposal and will continue to do so as new opportunities appear in the literature and at conferences.  Creating a pipeline of analysis steps provides intermediate states for test points and for supporting new classification tools, such as Flame, published by MIT.