4.4 Metric Discussion
2.4 Metrics Design and Methods
3.4 Metrics Results
Comparison metrics determine the correlation between biological outcomes or diagnostic decision predicted by the algorithms compared to other methods. There are three classes of methods to be evaluated: qualitative outcomes, scalar values, and vector values.
This diagnostic decision is often qualitative, and all decisions that lead to the same treatment are de facto equivalent. The goal of this project is to sufficiently describe each use case to determine which quantitative measures best correlate to a diagnosis, and the confidence level of the diagnostic decision. The advantage of quantitative processing is that the results will be measurably better over time, and the underlying modeling and justifications will be available to support the decision.
Tree Star has shown that the results of each metric agree with each other overall; however, individual metrics may correlate better with different qualitative results such as robustness, specificity, sensitivity, and accuracy. We expect to explore these issues in the next phase of our project. We don't anticipate Pareto optimality -- i.e., we are not going to get all benefits all the time and must decide priorities.
ROC analysis is the classical approach to the problem, and we could not find reasons why it shouldn't apply here. Personal correspondence with the Brinkman lab indicates this was not evaluated in their study, and they concur that it is a good direction in which to proceed. Principle Component Analysis is not considered relevant; it seems oversensitive to the debris and clumps that are conventionally removed by scatter gating. The misclassification rate is another metric we have been using. The Finak, Gottardo paper [50] provides an excellent elaboration on the statistics and talks about the "noisy gold standard" that lies at the heart of this project. The misclassification rate supports the inclusion of a confusion matrix to give different weights to false-positive and false-negative diagnoses.
The deficiency of many of these methods within our domain is that some misclassifications are more important than others, and the preferred metrics are expected to vary by use case. We are developing a way to sort out the events based on the ambiguity of their classification. Rare-event analysis is known to confound most metrics of classification. Our match ratio shares with V-Measure and Mallows Distance the desirable characteristics of following the experts' lead where that is more appropriate than conventional clustering.
The research done in the elaboration phase has revealed a small handful of additional method options. The Aghaeepour/Brinkman report [53] covers one specific analysis of the problem, choosing V-Measure as a single metric of choice [48], but we don't feel they ran the metric across a wide enough scope of data sets. Our revised plan adds the extra dimension to the repository reporting so that any number of metrics can be applied automatically to our tally or vote files; the extra dimension allows experimenters to rerun the same comparisons with their choice of metrics applied.
Initially Tree Star's goal was to calculate the Match Ratio to evaluate automated analysis to expert manual analysis. Even in our first set of calculations on the GvHD use case data we saw potential limitations in Match Ratio as the metric and therefore decided to incorporate additional metrics in our analysis. The refinement of cluster comparison metrics is still evolving, as shown by the interest and participation in meetings such as the Neural Information Processing Systems (NIPS) Workshop during the Clustering: Science or Art? Towards Principled Approaches conference. We will continue to evaluate new metrics. Our revised plan adds the extra dimension to the repository reporting so that any number of metrics can be applied to our tally files; to allow an experimenter to rerun the same comparisons with their choice of metrics.
Tree Star is encouraged by the level of agreement we are seeing among the metrics and will continue to apply them broadly over the scope of our use cases to gain as much information as we can about the characteristics of the various measures, in anticipation of mapping the characteristics to the preferences of the client.