Loading

Probability Bin Clustering

Description of method

Probability Binning Cluster Analysis (PBCA) – A method for population identification using Chi-squared analysis combined with an innovative binning strategy involving multidimensional spaces created by progressive histogram splitting to derive a model of multi-dimensional decision tree. We have commercial implementation of this published algorithm, and see promising qualitative results in the field, but have never conducted the detailed study to take the algorithms into an unsupervised environment.

Preliminary Results

The goal was to evaluate a subset of the SIV Use Case Data with Probability Bin Clustering to find these target populations:

  • CD4+ T-cells (activated T-cells)
    • CD4+ IFN subset
    • CD4+ IL2 subset
    • CD4+ TNF subset
  • CD8+ T-cells (antigen specific T-cells)
    • CD8+ IFN subset
    • CD8+ IL2 subset
    • CD8+ TNF subset

Maciej Simm spent time looking at re-creating these subsets using FlowJo's probability clustering tool. It is impossible to find the precise subsets, even with extensive manual guidance, for the following reasons:

  1. Some cytokine clusters are "outliers". FlowJo's binning tool prefers "major" clusters, similarly to magnetic gating. Maciej was not able to tweak the platform to find a "minor" but "clearly separated" population of cytokines breaking away from a major clump of negative cells.Histogram image
  2. Sometimes cytokine clusters are not there (negative control). PBCA does not care about context of tube, so it fails to base its analysis on the control and spins each tube individually. FMO (tethered gating) analysis would do much better here
  3. Transformation creates funnel-shaped (sometimes split), non-rectangular distributions, whereas PB gates are rectangles. The best example of this is CD4 vs. CD8 distributions, where PB fails to pick up double-negs or double-pos for the PMA tubes and finds only two clusters, each containing a bit of the other phenotype (usually CD4+CD8+ in one cluster, and CD4-CD8+ in the other.)Clustering Example
  4. Too many parameters to do in one shot. To get "clean" subsets, several stacked PBCA analyses were required:
    - ssc vs CD3 to find T cells
    - CD4 vs CD8 for sub-types of T cells
    - CD4 or CD8 vs each of the cytokines for "target populations".
  5. If SSC & CD3 & CD4 & CD8 were tried in one shot, too many "biologically irrelevant" clusters were formed, and time was required to figure out which clusters needed to be merged. This process (and errors) varied among tubes, so there is not a straightforward way to algorithmize the process.

Suggested enhancements to improve PBCA's utility for FlowDx

  1. Non-rectangle gate shapes are allowed to incorporate ellipsoids or gaussian distributions
  2. Transformation distortion should be taken into consideration - single positive populations are often split along the "negative" axis.
  3. FlowJo recognizes distributions favored by PBCA and those that are not, perhaps by looking at reagent names in data's keyword vs our "knowledge base" of "reagent distributions".
  4. Keep track of context of the tube. Negative Control information
    should be used by PBCA analysis when it clusters the Test Samples and Positive Control.

Discussion and Future Work

Preliminary results showed some clear limitations of some of the unsupervised classifiers, especially in their tendency to favor large clusters over small ones. We do not have the ability to constrain algorithms to specific subpopulation calculations in any automated way, but the need for that is made clear in the magnetic gating report. Ultimately the solution is to build more of the biological model into the constraints, but it has not been pursued yet.

We still feel that unsupervised classifiers have potential applications in this study, but there needs to be a level of indirection so that small clusters are defined by relative position to the control samples.