Loading

 

 

2.4 Cluster Comparison Metrics Methods

3.4 Metrics Results
4.4 Metrics Discussion

We started the Elaboration Phase of the project with the Match Ratio Comparison Metric and have expanded this portion of our project to describe, evaluate, and implement the five metrics we have (Match Ratio, Mallow's Distance[49], and V Measure[48], Misclassification[50], and ROC[15]). Having a modular structure will enable adding newly described comparison metrics.

Expert technicians produce different results when they set out to draw the same gate on the same data plot.Each data point of a target cluster is indexed data, with each measured parameter corresponding to a location in n-dimensional space. So any group, cluster, or gated population can be described as having a centroid from which distance measures can be calculated. Our classifications can be described as right or wrong with respect to a desired answer (usually the experts), and so all of these tools for assessing the goodness of a classification can be used. Our initial goal for this elaboration phase of the project was to research alternatives to the Match Ratio, which was identified in the Inception phase of the project. We did identify other methods as described below and ran the first meta-analysis of three different metrics on one population gated by eight different people.

Match Ratio: Match Ratio is a comparison between a single classification result and the consensus results of a group. The closer the Match Ratio is to 1.0, the more closely an individual's gated population is to the consensus. Cells in the manually classified samples are weighted based on the frequency with which they were included by all the persons forming the consensus. The sum of the weights of all the cells in the sample is the total possible score. The consensus is similar to a probabilistic cluster. For a single classification act, each cell in a population is compared to the consensus probability of inclusion. If there is a match between a cell in the individual's gate and that cell in the consensus gate, then the weight of that cell is added to the individual's accumulating score. A sum of these weights is compared with the total possible score of the consensus group to produce a Match Ratio. This method has the added advantage of comparing the automated gate to an accumulated body of expertise rather than to a single “master gate”. Details in PDF format

Match Ratio:
EAI(i) = Average of Experts inclusion of event(i)
Weight(i) = (distance of EAI(i) from 0.5)^2 <- i.e. better resolved events have more weight
SWAE = sum of weights for events whose expert average agreed with candidates inclusion / exclusion
Match Ration = SWAE / sum of weights for all events.
Characteristics -> If candidate agrees with expert average on every event, MR = 1. For every event missed, MR drops away from 1, with less weight given to events less well resolved by the experts.

Mallow's Distance[49]: Mallows Distance is a measure of similarity between two clustering results. It attempts to address whether a item has been correctly placed in a cluster (in the case where the "right" answer is known, and when it's not, then we're assessing when two clustering results are different) and how important it is that a cell has been classified properly based on its distance from the center or from each centroid. A "cluster" is a group of cells either identified by an algorithm as being of the same type or phenotype, or identified by a human with a polygon gate. A New Mallows Distance Based Metric For Comparing Clusterings - Ding Zhou 2005 ACM  New York, NY, USA

Mallow's Distance:
p(i) = probability of being either excluded or included, i.e.
if experts average >= 0.5 -> experts average
else -> 1 - experts average.
cost(i) = 1 for disagreement with experts 0 otherwise.
mallow's distance = sum over all events of p(i) * cost(i)
Characteristics -> similar to Match Ratio, just not normalized.
Only events that agree with the experts average of inclusion / exclusion are counted, and more weight is given to the value the better reolved by the experts.

V Measure[48]:V- Measure is a metric for evaluating how good a classification algorithm did in comparison to a known correct answer. Most assessments of clustering measure either completeness (did you put all the cells of the same type into one cluster) or homogeneity (did you make clusters that contain only one type of cell), and then push your algorithms toward results that either have few clusters to maximize completeness, or many clusters to maximize homogeneity, thus ignoring what the "best" result is. V-Measure is a metric that is a ratio of homogeneity and completeness.

V-Measure compares a target clustering — e.g., a manually annotated representative subset of the available data— against an automatically generated clustering to determine now similar the two are. V-Measure is based upon two criteria for clustering usefulness, homogeneity and completeness, which capture a clustering solution’s success in including all and only datapoints from a given class in a given cluster.

  1. It evaluates a clustering solution independent of the clustering algorithm, size of the data set, number of classes and number of clusters.
  2. It does not require its user to map each cluster to a class. Therefore, it only evaluates the quality of the clustering, not a post-hoc class-cluster mapping.
  3. It evaluates the clustering of every data point, avoiding the “problem of matching.
  4. By evaluating the criteria of both homogeneity and completeness, V-Measure is more comprehensive than those that evaluate only one.

"V-Measure is an entropy-based measure which explicitly measures how successfully the criteria of homogeneity and completeness have been satisfied. V-Measure is computed as the harmonic mean of distinct homogeneity and completeness scores, just as precision and recall are commonly combined into F-Measure (Van Rijsbergen, 1979). As F-Measure scores can be weighted, V-Measure can be weighted to favor the contributions of homogeneity or completeness." rosenberg_hirschberg_07b.pdf

ROC [15]: ROC is a tool that can be used to assess the quality of a classification tool under a number of conditions, usually when comparing whether sensitivity or specificity are more important (i.e, is it worse to be told you have cancer when you don't, or is it worse to be told you are fine when you really have cancer?). To build a ROC analysis, vary the cost of misclassification from 0 to 1 and calculate the "loss," which is the combination of the cost of each type of error, times the rate of making the error, at regular increments.
You can then integrate the area under the curve you produce and rate how good your classifier is across the full spectrum of possible misclassification costs. It's potentially useful to us as a metric for evaluating the efficacy of a classification algorithm in lieu of, or in addition to, the Match Ratio.
ROC analysis: applications to the classification of biological sequences and 3D structures Paolo Sonego,András Kocsor, and Sándor Pongor - Briefings in Bioinformatics Advance Access published January 18, 2008