Loading

 

 

Automated Classifier Report - 09/18/2009

by Dr. John Quinn, Tree Star Application Scientist

Outline
I. Executive Summary
II. Introduction
III. Background
a. Support Vector Machines
b. Artificial Neural Networks
IV. Procedure
a. Synthetic Data
b. SIV data
c. Classification via Support Vector Machines
d. Classification via Artificial Neural Networks
e. Quality assessment
V. Results
a. Synthetic Data
b. SIV Data
c. Quality assessment
VI. Discussion

Executive Summary

A synthetic data set and a data set taken from an SIV study were used to calibrate the performance of automated classifiers, and allow us to assess which may be best to include in FlowDx.  Ultimately the experiments may also serve as a demonstration to the usefulness of automated classifiers for future FlowDx users. 

Classification of the synthetic data set demonstrated that it is trivial to find an automated classifier that can produce better than 99% accurate results when asked to separate two populations that have distinct peaks in at least one dimension.  It may be useful to have our panel of experts classify the synthetic set and then calibrate match ratios to known good and bad classification results, so that we have a basis for deciding whether a match ratio indicated “good” versus “bad” expert agreement.

Classification of the SIV data set allowed us to identify the best classifiers, and to further demonstrate the ability of automated classifiers to operate on non-trivial, non-linear classification problems.  Four of the classifiers averaged 0.93 or better match ratio scores, and marked themselves as the classifiers of the most interest for including natively in FlowDx.  Those networks were support vector machines with polynomial or radial basis functions, a cascade forward back propagation network, and a radial basis function artificial neural network (RBF ANN).  The RBF ANN may have been the single most successful network as it performance was the best on the most difficult data.

We also looked at alternative methods of evaluating the quality of various classification results and the match ratio metric itself.  At the very least, the metric is useful for identifying which events are difficult to classify.  We have used the match ratio to identify those events, and then for each user, and each automated classifier have identified the events classified as “positive” and “negative” within the difficult to classify subsets, and created a profile of the events of each class for each user or classifier based on MFI.  We then created a profile of universally agreed upon “positive” and “negative events” and were able to compare profiles to see if a user or automated classifier defined set of “positive”, for example, among the difficult to classify subset matched the profile of the universally agreed upon “positives”.  In doing this analysis we have shown that the automated classifiers match the patterns most closely.  We were also then able to compare match ratio scores to profile closeness scores and see if they agree.  We found that there wasn’t complete agreement, but the particular data set might influence that.  (THIS IS IN PROGRESS)

Introduction

We have performed a series of experiments using automated classifiers on flow cytometric data for the purpose of:

  • Establishing the performance characteristics of a series of Support Vector Machines (SVMs) and Artificial Neural Networks (ANNs) on data of measurable classification difficulty.
  • Identify which classifiers are best suited for coding into FlowDx, which can be recommended as options for importing, and which seem ill-suited to the task.
  • Demonstrate for potential FlowDx users the ability of automated classifiers.
  • Gauge whether the match ratio is a good performance metric.

We have accomplished these tasks by creating a series of automated classifiers and using them to classify first a synthetic data set with known distributions and classes, and then a set of data collected in an SIV study with unknown “positive” populations.  In this work, positive is used to mean events that have bound a specific fluorochrome above background levels of fluorescent intensity and are usually identified as being so with a gate.

The synthetic data set was used to establish the performance characteristics of each classifier.  With the ground truth class known for these data points, a correct classification score can be easily calculated and compared to the difficulty of the decision so that we can estimate the quality of the result as a function of the separation between two populations.

The SIV data set then served as the test for data of unknown class, using match ratio as our performance metric.  Both studies then can be used to demonstrate to FlowDx users how well an automated classifier can separate positive events from negative events or how well they can be trained to match expert performance.

Finally, we have evaluated the results further to gauge how good a metric the match ratio is.  In the very least the match ratio statistic is a good tool for identifying objectively which events are the “difficult” or borderline events.  We have used the match ratio to identify these events and have evaluated how the automated classifiers performed on these events compared to the expert users.

Background

a. Support Vector Machines
Support Vector Machines (SVMs) are deterministic algorithms designed to identify vectors within training data that define the boundaries between classes of data.  Non-linear boundary problems are addressed using support vector machines by including a basis function that maps the input data into a transformed space that allows a linear discriminant to separate the classes.

In this experiment we have chosen to use SVMs with polynomial basis functions (PBF), the most common SVM, and with a radial basis function (RBF), a choice that seems appropriate for flow cytometric data, which is generally distributed in a Gaussian or radial manner.

b. Artificial Neural Networks
An Artificial Neural Network (ANN) is a weighted directed graph where the nodes are artificial neurons, and weighted directed edges connect neuron outputs with neuron inputs.  For this experiment we considered a traditional feed forward network, a feed forward radial basis function network, a competitive network, a probabilistic network, and a cascade forward network.  Each network is explained in some detail below.

Feed Forward Network
The feed forward network with back propagation of error as the learning mode is the most common ANN.  We have used one that consists of three layers: an input layer, a hidden layer, and an output layer.  The input layer accepts the data as one parameter per node, weights the inputs based on learned importance, and then passes the result to the hidden layer.  The hidden layer passes the data through a transfer function that maps the data to a space where it is linearly separable.  The transfer function in this case is sigmoidal.  The weighted connections between the hidden layer and the output layer, along with the output layers transfer function, then act as a linear classifier.  The transfer function in the output layer is a unit step, or threshold function.  During training the network back propagates error, and adjusts the weights and settings of the transfer functions to improve accuracy on the training data set.

Radial Basis Function network
Radial basis functions (RBFs) are transforms that can be used for interpolation and smoothing based on distance with respect to a centroid.  RBFs can be incorporated as a transfer function into a feed forward format to produce an ANN suited to learn the structure of data with a radial distribution.  The implemented radial basis function networks have three layers: an input layer, a hidden layer with radial basis transfer functions, and a linear output layer.  The training regime for RBF networks is one pass interpolation of the entire decision space based on the training data, and allows the RBF network to classify the training data without error.

Competitive network (Learning Vector Quantization network)
The learning vector quantization (LVQ) network is a hybrid of unsupervised ANN learning (Kohonen’s self-organizing map) and a feed-forward network.  It consists of three layers, an input layer, a competitive layer, and an output layer.  Once data points are input, the hidden layer neurons “compete” for them, meaning that the neuron that is closest by the chosen distance measure (Euclidean in our case) to the input is declared the winner.  The neuron is then adjusted slightly to be closer to the input.  As more data points are introduced they essentially cluster to nodes that have adjusted to best represent that particular class of input.  In the hidden layer there can be many classes of data.  In the output layer the clusters are then grouped and matched to the desired outputs in a supervised manner.

Probabilistic network
Probabilistic neural networks are another network that maps the data space using RBFs.  Our implementations have three layers, an input, a hidden layer using RBFs, and a competitive layer as an output.  The input and hidden layers function just as the radial basis function network.  The difference is the output layer, in which the pattern of weighted radial distances from the training neurons are presented to competitive neurons and classification is made in a competitive manner. The distance pattern is associated with the desired output classes through interpolation during training. The data points are then assigned to the class that “wins” them, i.e. matches the pattern of distances from all of the hidden nodes most closely.

Cascade forward network

 The cascade forward network is a feed forward network, with the exception that it starts with a minimal number of hidden layer neurons and during back propagation; additional neurons are added to the hidden layer as needed to improve classification.  After each iteration the network weights are frozen and the additional neurons are adjusted to allow the network to successfully map the data

 

Figure 1

Procedure

    a. Synthetic Data

Synthetic data was created by the illustrious Aaron Hart.  Twenty five files were created, consisting of five sets of five replicates, with each replicate containing two populations of 5,000 events each.  From set to set the distance between the two populations decreases successively.  Once file from each set is pictured in Figure 1 to illustrate the data. Population one consisted of events with a median fluorescent intensity (MFI) of 1000 intensity units and a standard deviation of 100 intensity units.  Population two’s MFI decreased across the five groups with values of 2000, 1750, 1500, 1250, and 1000 intensity units.  Each population had a standard deviation of 100 intensity units.  The distance between the centroid of the two populations decreased from 10 standard deviations to zero, making the initial classification trivial, and the last classification essentially random. 

b. SIV data
Treestar received data from an SIV experiment containing peripheral blood mononuclear cells (PBMCs) from which ten non-mutually exclusive populations are to be identified.   Data provided by Jörn Schmitz’s group at BIDMC.  Shown hierarchically, those populations are:

  1. Lymphocytes
    1. CD3+ Lymphocytes (T-cells)

 i.      CD4+ T-cells (activated T-cells)

1.      Activated T-cells & IFN+

2.      Activated T-cells & IL2+

3.      Activated T-cells & TNF+

 ii.      CD8+ T-cells (antigen specific T-cells)

1.      Antigen Spec T-cells & IFN+

2.      Antigen Spec T-cells & IL2+

3.      Antigen Spec T-cells & TNF+

The total data set is composed of five subsets, each containing nine files for a total of forty five files.  The nine files per subset are six controls (three unstimulated, and three peptide) and three files that were expected to stain positively for three cytokines: tumor necrosis factor (TNF), interferon gamma, (IFN) and interleukin 2 (IL2).  In this experiment we have used the first subset of this data, nine files total.  The three unstimulated controls are identified as A01 through A03, the three peptide samples are identified as B01 through B03, and the three cytokine positive samples are identified as C01 through C03.  Match ratios were calculated in FlowJo and exported to text files that were then loaded into Matlab.  Data with FSC and SSC values above the measurable threshold were removed in Matlab by eliminating all events in the maximum bin.   Antibody dependant parameters were transferred to a logarithmic scale and then all data were normalized to a unit scale.

c. Creation of training data
For the synthetic data sets, the first of the five replicates per group was used as training data for the remainder of the set.

For the SIV data set, ‘positive’ training events were determined using the average score among experts as a metric.  Events with scores of 1.000 were considered positive events by all experts and were used as exemplars.  Events with average scores of less than 0.5 were considered ‘negative’.  The value 0.5 was chosen because the match ratio equation deems these events as negative.  We did not choose to use events that were universally classified as negative because it was experimentally observed that the training data set became over simplified and the resulting decision boundaries were unintuitive. 

To determine the proper number of cells for a training set, we classified the lymphocyte population using the FSC, SSC, and CD3 parameters.  Classification was performed using a varying number of training events, an SVM classifier, and data not identified as training data.  We performed classification using 100, 500, 1000, 2000, and 5000 events.  Results are shown in Figure 2 below.  It was observed that the match ratio scores fluctuated noticeably between 100 and 1000 samples but then stabilized to near even performance thereafter.  One thousand events were chosen for all training sets thereafter to avoid the less stable sets, and attempt to avoid overly specific training.

Figure 2: Determining the number of training events.

Additionally training the data with a set taken from a single file was compared to using a training set that was a composite of multiple files.  As expected, the results from multiple files were better so this procedure was used.  Data not shown.

d. Classification via Support Vector Machines
SVMs were implemented using Matlab® (Natick, MA) software, using a set of SVMs produced at Ohio State University by Chih-Chung Chang and Chih-Jen Lin.

SVMs were trained to classify the six most specific populations (those lowest on the gating hierarchy), events that were positive for either CD4 or CD8 along with one of three possible cytokines TNF, IL2, or IFN.  For each of these populations two types of SVMs were trained using the parameters FSC-A, SSC-A, CD3, either CD4 or CD8 depending on the subset, and whichever one cytokine antibody, for a total of 5 dimensional analyses.  Data were all preprocessed as described above, and classified using each SVM.  Match ratio was calculated as the summation of included events multiplied by the weight factor of that particular event, divided by the sum of all weights.

d. Classification via Artificial Neural Networks
The training data and classification scheme used for the SVMs were applied to the five ANNs as well.  All fuzzy classification scores were rounded to force positive/negative decisions.   For expediency we chose to classify 10,000 events per data file and assume for the time being that those results extrapolate to the remainder of the population.

e. Quality assessment
Because the ground truth is never know when attempting to assess the quality of a classification of flow cytometric data, various metrics must be employed to create an estimate.  On the FlowDx we have chosen to use the match ratio (explained elsewhere) as our quality metric.  We have in this work employed a separate metric that we use here to both assess the quality of a classification and attempt to validate the choice of the match ratio.  The alternate metric is pattern matching.
For every data file, with each event classified by six experts, the events fall into one of three categories:  events that were agreed by all experts to be negative, events that were agreed to be agreed by all experts to by positive, and events on which the experts disagreed.  Lacking a ground truth answer, we were forced to assume that the events universally classified are classified correctly.  Quality assessment of these events was reduced to confidence in the expert opinion.  For the group of events that the experts disagree on we have identified the positive and negative subset within the group, and calculated the MFI pattern for all parameters that were relevant per group.  We then calculated the MFI pattern for the universally classified positive and universally classified negative cells, and compared them to the expert classification of the events lacking universal agreement.  We assume that an n-dimensional MFI pattern that more closely resembles the appropriate universally classified events reflects a better, i.e. more rational, classification result.  For example cells identified by a given expert as positive should have a pattern of intensities that resemble that average MFIs of the cells that were universally classified as positive.
Further more, we have ranked the expert classifications by pattern matching and by match ratio and looked for correlation.  We suppose that better pattern matching indicates better consistency of classification within one users work, while better match ratio scores indicate more consistent classification between users.  If the match ratio scores correlate with the pattern matching that would indicate that match ratio is also a measure of more consistent in sample classification, due to user agreement reflecting consistent classification.

IN FLUX – We have defined “difficult to classify” events as those that at least two users have disagreed from consensus on.  Those events that a single user classified differently we have labeled as outliers for the time being.  We are considering dropping this group and simply incorporating them into the difficult to classify group.

Results

a. Synthetic data

The synthetic data were processed by seven total classifiers.  The average number of misclassified events per set are displayed in Figure 3, and listed in Table 1.  All classifiers were able to correctly identify over 99% of the events up to the point where the data began to lack having two distinct populations (population centroids separated by 2.5 standard deviations), at which point they were able to correctly identify greater than 85% of the events.  When the populations were completely overlapped and inseparable the correct classification rate went to 50%, or the equivalent of random selection as expected.

fig3

Figure 3: Misclassification rate of the synthetic data set

table1

Table 1: Misclassification rate of the synthetic data set

b. SIV data
Figure 4 and Table 2 display the match ratio scores for the seven classifiers using the SIV data.  These are the average match rations across all six populations, the CD8 or CD4 positive cells gated for positive expression of one cytokine.  The match ratios were similar for all classifiers with the exception of the LVQ network, which consistently under performed compared to the others.  A substantial difference in match ratio scores can be seen between the controls, A01 – B03, and the files that contained many more positive cells, C01- C03.  The cause of this phenomenon is the relative lack of positive cells in the controls.  In all six files over 90% of the events were clearly negative, allowing for easy agreement between experts and algorithms on a much higher fraction of the populations.  The three non-controls had substantially more positive events, were more diverse, and thus had more difficult to classify events.  In almost all cases, excluding the LVQ, the assembled networks achieved match ratio scores above 0.8 even for the more difficult files.

fig4

Figure 4: Match Ratio for SIV data set

table2

Figure 5:  The synthetic data classified by the RBF SVM.  Each row represents the replicates from one group based on data spacing.  The SVMs classified over 90% of the events correctly in all files in the first three rows, over 85% correctly in the 4th row, and split the data in half in the random classification problem of row 5.

fig6

Figure 6:  A typical good classification result using ANNs and the SIV data set.  The three rows correspond to control, peptide only, and stimulated events that we expect to show some positives.  Note that there are very few positives in the first two rows, which agrees with the expert classification.  The bottom row shows a classification result that produced match ratio scores near 0.9.  The figures show all 10,000 events so there are red cells with high levels of IFN because they were excluded by some measure other than IFN, such as CD3 or the scatter parameters.  This demonstrates that the data space is in fact complex in this example.

fig7

Figure 7:  A poor result.  This plot illustrates where an automated classifier can go wrong.  The LVQ classification here included many of the IFN and CD4 high events into the positive group even though the experts defined them as negative.  The LVQ has an unsupervised competitive layer that grouped together all the high expressing events into one class.

c. Quality assessment
IN PROGRESS – have applied this work so far to one single automated classifier on one single file for three of the six populations so far, so the results are still pending.  Mostly I want to examine in some detail the reasons for variability, and the results produced by making different choices in setting up the analysis.

The results of the quality assessment test that we ran revealed that the automated classifiers produce more consistent results than any individual expert.  Figure 8 displays a plot of events that were classified homogeneously and heterogeneously.   Table 3 displays a sample set of MFI patterns, and the distances from the pattern produced by universally classified cells.  As can be seen in the table the automated classified produced the lowest overall distances.

imagge3
Figure 8:  Illustration of classification agreement or disagreement between classifiers.

 

Mean Fluorescent intensity

POSITIVE EVENTS

 

 

FSC-A

SSC-A

CD3

CD4

IFN

Universal

0.471

0.860

0.814

0.637

0.760

ANN

0.367

0.878

0.775

0.629

0.764

Aaron

0.318

0.879

0.770

0.635

0.756

John

0.356

0.878

0.704

0.580

0.800

Junichi

0.340

0.876

0.774

0.626

0.743

Nick

0.482

0.864

0.767

0.622

0.768

Quinjun

0.346

0.878

0.769

0.626

0.758

Sach

0.337

0.877

0.771

0.629

0.763

 

 

 

 

 

 

 

 

 

 

 

 

 

Mean Fluorescent intensity

 

NEGATIVE Events

 

 

FSC-A

SSC-A

CD3

CD4

IFN

Universal

0.397

0.884

0.681

0.608

0.596

ANN

0.337

0.851

0.714

0.590

0.635

Aaron

0.497

0.866

0.771

0.600

0.750

John

0.365

0.876

0.771

0.627

0.753

Junichi

0.413

0.876

0.763

0.626

0.777

Nick

0.315

0.881

0.771

0.628

0.748

Quinjun

0.469

0.862

0.778

0.630

0.735

Sach

0.523

0.868

0.768

0.609

0.705

 

Difference from Universal – Positive events

 

 

 

 

 

FSC-A

SSC-A

CD3

CD4

IFN

SUM

Universal

0.000

0.000

0.000

0.000

0.000

 

ANN

0.105

0.018

0.039

0.008

0.004

0.174

Aaron

0.154

0.019

0.044

0.002

0.004

0.222

John

0.115

0.018

0.110

0.057

0.040

0.339

Junichi

0.131

0.016

0.040

0.011

0.017

0.214

Nick

0.010

0.004

0.046

0.015

0.009

0.084

Quinjun

0.125

0.018

0.045

0.011

0.002

0.202

Sach

0.135

0.017

0.043

0.008

0.003

0.206

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Difference from Universal

Negative events

 

 

 

 

 

FSC-A

SSC-A

CD3

CD4

IFN

SUM

Universal

0.000

0.000

0.000

0.000

0.000

 

ANN

0.060

0.033

0.032

0.018

0.039

0.182

Aaron

0.101

0.018

0.090

0.008

0.155

0.371

John

0.032

0.008

0.090

0.019

0.158

0.307

Junichi

0.017

0.008

0.081

0.018

0.182

0.306

Nick

0.082

0.003

0.090

0.020

0.153

0.347

Quinjun

0.073

0.022

0.097

0.022

0.140

0.353

Sach

0.127

0.016

0.086

0.001

0.110

0.340

Table 3: For sample C03, focusing on the population that is CD4+IFN+.  The top table displays the MFI (normalized to a unit scale) of the events identified as positive for each user.  The next table displays MFIs of events classified as negative and the following two tables show the difference from the universal score for the positive and negatives respectively.

Table 4 below shows the ranking by distance and match ratio.  The rankings don’t totally match up – other than to indicate that John is the worst gater! – but there is a lot of data to analyze and explain before we draw any conclusions.

 

 

Aaron

John

Junichi

Nick

Quinjun

Sach

 

 

0.999575

0.988577

0.996458

0.999246

0.99952

0.998853

By match ratio

1

6

5

3

2

4

By distance from uni

5

6

2

1

4

3

Table 4:  Unfinished and inconclusive results showing relations between distance and match ratio.

 

Discussion

From the synthetic data experiments we can observe that all of the classifiers that we tested can classify two populations that are discernable to the eye with greater than 99% accuracy, even a classifier that later proved to be inferior on more complex problems.  This is an important result for users of FlowDx who would use the software in a high throughput setting to make many relatively simple classifications automatically and ask the machine to flag samples with measures outside of the normal.

From the SIV experiment we can observe that we found six types of networks that can match the performance of an expert reliably, producing match ratio scores on average of 0.875 ± 0.06 or better, with four of them averaging 0.93 or better.  We were also able to observe that the best networks at matching the experts were the two SVMs, the cascade forward network, and the radial basis function ANN.  Of special interest is that the RBF ANN performed best on the most difficult data.  Thus if we were to code a handful of classifiers into FlowJo, these would be the best choices, and if we were picking just one from the lot it may be the RBF ANN.

From the plots of the SIV data set, we can observe that there is significant overlap in much of the test data.  These plots show all events (except those excluded as being above the measurable FSC or SSC threshold), and so it is not surprising that some events that are positive for a cytokine may be a negative example as they were excluded in some prior gating step.  What this does demonstrate is that this is a non-trivial, non-linear classification problem.

We also looked at alternative methods of evaluating the quality of various classification results and the match ratio metric itself.  At the very least, the metric is useful for identifying which events are difficult to classify.  We have used the match ratio to identify those events, and then for each user, and each automated classifier have identified the events classified as “positive” and “negative” within the difficult to classify subsets, and created a profile of the events of each class for each user or classifier based on MFI.  We then created a profile of universally agreed upon “positive” and “negative events” and were able to compare profiles to see if a user or automated classifier defined set of “positive”, for example, among the difficult to classify subset matched the profile of the universally agreed upon “positives”.  In doing this analysis we have shown that the automated classifiers match the patterns most closely.  We were also then able to compare match ratio scores to profile closeness scores and see if they agree.  We found that there wasn’t complete agreement, but the particular data set might influence that.  (THIS IS IN PROGRESS)

As a future experiment we may want to have experts classify the synthetic data to associate match ratio scores to known good and bad classification results, so that we have a basis for deciding whether a match ratio indicated “good” versus “bad” expert agreement.

In summary, these experiments have demonstrated the usefulness of these techniques for gating flow cytometric data rapidly and non-heuristically, and have guided us toward a short list of networks that seem appropriate for our niche.