FABIA - gene_expression

Gene expression data sets

We consider 3 gene expression data sets which are provided by Broad Institute as their "Cancer Program Data Sets" and analyzed in (Hoshida et al. 2007). There the samples are clustered using additional data sets and then the clusters are verified by gene set enrichment analysis. The goal is to re-identify these clusters without additional data sets.

The datasets can be downloaded via:
http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
or
http://www.broadinstitute.org/publications/broad823

Following data sets have been downloaded:

Breast-A
Multi-A
DLBCL-B

Data sets and results:

data: the data setsdata: the data sets
data_bicat: the data sets in bicat format
data_plaid: that data sets in plaid format
densities: density plot for the data sets
output_software: the original output of software if not in R
biclusters_R: the biclusters of all methods stored in biclust class of the package biclust
results: the results of the experiments

Description of the data sets

1. BREAST CANCER:
The first data set is from (Van't Veer et al. 2002), where the goal was to find a gene signature to predict the outcome of a cancer therapy. We removed array S54 because it is an outlier which leads to a data set with 97 samples and 1213 probe sets with skewness of 0.45 and excess kurtosis of 0.93 after standardization. In (Hoshida et al. 2007) 3 subclasses have been found. These classes are biological meaningful because 50/61 cases from class 1 and 2 were estrogen receptor positive and only in 3/36 from class 3.

2. MULTIPLE TISSUE TYPES:
The second data set is from (Su et al. 2002), where gene expressions from human and mouse samples across a diverse tissues, organs, and cell lines have been profiled. The goal was to have a reference for the normal mammalian transcriptome. The data set contains 102 samples with 5565 probe sets with skewness of 0.15 and excess kurtosis of 1.3 after standardization. We try to re-identify the tissue types.

3. DIFFUSE LARGE-B-CELL LYMPHOMA (DLBCL):
The third data set is from (Rosenwald et al. 2002) consisting of 180 samples and 661 probe sets with skewness of -0.05 and excess kurtosis of 0.35 after standardization. The goal was to predict the survival after chemotherapy. In (Hoshida et al. 2007) 3 classes were found: "OxPhos" (oxidative phosphorylation), "BCR" (B-cell response), and "HR" (host response). Our goal is to identify these subclasses directly by biclustering.

REFERENCES:

Hoshida Y, Brunet J-P, Tamayo P, Golub TR, Mesirov JP, "Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets", PLoS ONE 2(11): e1195, 2007.
van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, et al. "Gene expression profiling predicts clinical outcome of breast cancer", Nature 415:530-536, 2002.
Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, et al. "The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma", New Engl. J. Med. 346: 1937-1947, 2002.
Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, et al. "Large-scale analysis of the human and mouse transcriptomes", Proc. Natl. Acad. Sci. USA 99:4465-4470, 2002.