Gene expression data sets
We consider 3 gene expression data sets which are provided by Broad Institute as their "Cancer Program Data Sets" and analyzed in (Hoshida et al. 2007). There the samples are clustered using additional data sets and then the clusters are verified by gene set enrichment analysis. The goal is to re-identify these clusters without additional data sets.
The datasets can be downloaded via:
http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
or
http://www.broadinstitute.org/publications/broad823
Following data sets have been downloaded:
- Breast-A
- Multi-A
- DLBCL-B
Data sets and results:
- data: the data setsdata: the data sets
- data_bicat: the data sets in bicat format
- data_plaid: that data sets in plaid format
- densities: density plot for the data sets
- output_software: the original output of software if not in R
- biclusters_R: the biclusters of all methods stored in biclust class of the package biclust
- results: the results of the experiments
Description of the data sets
1. BREAST CANCER:
The first data set is from (Van't Veer et al. 2002), where the
goal was to find a gene signature to predict the outcome of a
cancer therapy. We removed array S54 because it is an outlier
which leads to a data set with 97 samples and 1213 probe sets
with skewness of 0.45 and excess kurtosis of 0.93 after
standardization. In (Hoshida et al. 2007) 3 subclasses have
been found. These classes are biological meaningful because
50/61 cases from class 1 and 2 were estrogen receptor positive
and only in 3/36 from class 3.
2. MULTIPLE TISSUE TYPES:
The second data set is from (Su et al. 2002), where gene expressions
from human and mouse samples across a diverse tissues, organs, and
cell lines have been profiled. The goal was to have a reference for
the normal mammalian transcriptome. The data set contains 102 samples
with 5565 probe sets with skewness of 0.15 and excess kurtosis of 1.3
after standardization. We try to re-identify the tissue types.
3. DIFFUSE LARGE-B-CELL LYMPHOMA (DLBCL):
The third data set is from (Rosenwald et al. 2002) consisting of
180 samples and 661 probe sets with skewness of -0.05 and excess
kurtosis of 0.35 after standardization. The goal was to predict
the survival after chemotherapy. In (Hoshida et al. 2007) 3 classes
were found: "OxPhos" (oxidative phosphorylation), "BCR" (B-cell
response), and "HR" (host response). Our goal is to identify these
subclasses directly by biclustering.
REFERENCES:
- Hoshida Y, Brunet J-P, Tamayo P, Golub TR, Mesirov JP, "Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets", PLoS ONE 2(11): e1195, 2007.
- van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, et al. "Gene expression profiling predicts clinical outcome of breast cancer", Nature 415:530-536, 2002.
- Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, et al. "The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma", New Engl. J. Med. 346: 1937-1947, 2002.
- Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, et al. "Large-scale analysis of the human and mouse transcriptomes", Proc. Natl. Acad. Sci. USA 99:4465-4470, 2002.