Synthetic Data Description
In order to achieve Tree Star's goal of developing a process to select optimal systems for the analysis of a diverse array of clinical cytometry assays, we have constructed a flow cytometer simulation utility, Flowsim. This tool allows us to define a model for the contents of an fcs file and then use this model to subsequently generate large numbers of fcs files that contain data that conforms to the specified model(s).
Methods:
A flow cytometry data simulator, Flowsim, has been developed by Aaron Hart. Flowsim allows the generation of synthetic fcs files that contain populations of events. The events generated are based on a mathematical model, such as Gaussian. Populations and their underlying distributions are generated from definitions located in a configuration file. Arbitrary numbers of populations, dimensions, and distributions can be combined to generate a very diverse collection of synthetic data files Download Examples.
To Generate Synthetic Data Using Flowsim:
- Instructions for Installing the Python script, Flowsim:
- Download zip file.
- Extract the zip file into an appropriate location.
- Install the Enthought Python Distribution.
- Instructions for use:
- Open flowsim.config with a text editor.
- Edit the population and distributions definitions, and the parameter number to reflect the desired output (note that scaling is currently limited to linear, and a range of 0-4096).
- Run fcsgen.py with the optional argument of the output filename i.e., "python fcsgen.py [outfile]" where [outfile] is the desired fcs file name. The fcs file extension will be added by the script.
Using Flowsim, an example configuration file that supports a mix of Gaussian distributions, looks like this:
config file for fcs generator
to use fcs generator, modify this config file to define the desired populations, and run fcsgen.py with the optional argument of the output filename i.e., "python fcsgen.py [outfile]"
number of parameters = 3
distributions:
name='one' mean='1000' stdev='100' type='normal'
name='two' mean='1250' stdev='150' type='normal'
name='three' mean='1750' stdev='150' type='normal'
name='four' mean='2500' stdev='450' type='normal'
name='five' mean='750' stdev='75' type='normal'
name='six' mean='3000' stdev='300' type='normal'
populations:
name='one' distributions='five,five,one' numberOfEvents='1000'
name='two' distributions='two,one,two' numberOfEvents='20000'
name='five' distributions='two,one,six' numberOfEvents='50000'
name='three' distributions='three,two,three' numberOfEvents='20000'
name='four' distributions='three,four,two' numberOfEvents='10000'
The first set of data consisted of one FCS file and a prototype analysis showing the differences between manual gating and automated gating.
A second synthetic data experiment was constructed based upon the underlying model of a single-color antibody titration. Twenty-five files were created, consisting of five sets of five replicates, with each replicate containing two populations of 5,000 events each. From set to set the distance between the two populations decreases successively. One file from each set is pictured in Figure 1 to illustrate the data. Population one consisted of events with a median fluorescent intensity (MFI) of 1000 intensity units and a standard deviation of 100 intensity units. Population two’s MFI decreased across the five groups with values of 2000, 1750, 1500, 1250, and 1000 intensity units. Each population had a standard deviation of 100 intensity units. The distance between the centroid of the two populations decreased from 10 standard deviations to zero, making the initial classification trivial, and the last classification essentially random.
These files were written out based on an array that was sorted on the identity of the population, which allowed for the comparison of classification results to the correct answer based upon the sort order of events in the data file Download Files.
This synthetic data set has been analyzed by three methods: manually, using FlowJo's cluster analysis platform on the Macintosh, and using Artificial Neural Networks that can be found in the results document.