<?xml version="1.0" encoding="UTF-8"?><mods xmlns="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="3.2" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-2.xsd"><titleInfo><title>Improving breast cancer outcome prediction by combining multiple data sources</title></titleInfo><name><namePart>Van Vliet, M.H.</namePart></name><name><namePart>Reinders, M.J.T.</namePart></name><name><namePart>TU Delft</namePart></name><name><namePart>TU Delft, Delft University of Technology</namePart></name><subject lang="nl"><topic>breast cancer</topic><topic>disease outcome prediction</topic><topic>microarray</topic><topic>signature</topic></subject><accessCondition></accessCondition><location><url>http://resolver.tudelft.nl/uuid:c50cc1ba-35df-4db3-ad49-189e6e6c74bf</url></location><language><languageTerm type="text">en</languageTerm></language><genre authority="local">thesis</genre><originInfo><publisher>MH van Vliet</publisher><dateIssued>2010-04-06</dateIssued></originInfo><identifier type="isbn">9789090251783</identifier><abstract>Cancer has recently become the number one cause of death in The Netherlands. Breast cancer is the most prevalent form of cancer among females, with a lifetime risk of 12.8% (i.e. the 1 in 8 rule). As a result, more and more research is devoted to getting a better insight into cancer, and to further develop strategies for personalized treatment.

Cancer is often induced by damage to the DNA, a molecule present in all cells. Usually, damage to the DNA is not that much of a problem, because it gets repaired by one of several repair mechanisms in the cell. However, sometimes the repair is incorrect, or the repair mechanisms themselves might get damaged, resulting in aberrations to the DNA that could eventually lead to abnormal cell behavior. If the DNA is aberrated in a suitable way, a cell can start to replicate uncontrollably, which causes cells to accumulate. This results in the formation of a tumor, and if the cell growth is allowed to continue long enough this can result in a palpable lump in the breast. Eventually, cells from the tumor will spread trough the body and invade other tissues and organs (with  metastases as a result), which is often fatal. Upon diagnosis of breast cancer, the tumor is surgically removed, followed by chemotherapy treatment to eradicate any tumorigenic cells that may still be present. This strategy has led to overtreatment of patients in the past, since some patients would have remained metastasis free even without additional therapy.

Knowing beforehand whether or not a patient will eventually develop a metastasis (i.e. the outcome of the disease), would be of great value in the clinic. Low-risk patients could then be spared unnecessary treatment, for example. Hence, several models have been developed that predict outcome based on clinical and pathological parameters that are being recorded at the time of diagnosis. More recently, microarray analysis has enabled the measurement of the mRNA levels of cells (a derivative of the genes on the DNA). Since research institutes have deep-frozen databanks with breast tumors, this presented the opportunity to generate rich microarray datasets. Based on these microarray datasets new predictors of breast cancer outcome, so called classifiers, can be developed that outperform those on clinical data. These classifiers select a subset of genes, often referred to as a `signature&apos;, that show distinctively different behavior between patients that do develop metastases versus those that do not.

The work in this dissertation focuses on ways to improve the performance of the classifiers that can be derived from microarray datasets. Each of these classifiers is hampered by the fact that we  have a relatively small number of patients, and a huge number of genes (the `small sample size problem&apos;). This causes classifiers to be easily overtrained by picking up spurious associations present by chance in a small dataset. As a counter measure, we increased the sample size by constructing compendia of several microarray datasets and analyzing these datasets simultaneously. Secondly, we incorporated additional, independent data (data that is sample specific or generally applicable to all samples). Finally, we considered using data from model systems, which are extensively used to study various aspects of cancer and cellular behavior.

Combining several datasets into a compendium is a straightforward way to increase the statistical power. However, this comes at the price of increasing the biological heterogeneity within the compendium, due to differences in clinical composition of the datasets. At the same time, the technical heterogeneity will also increase when datasets were obtained using different microarray techniques or normalization protocols. The specifics of the trade-off between gain in statistical power, and the effect of increased heterogeneity is as of yet unknown. Using several breast cancer microarray datasets that were publicly available, we found that pooling datasets does indeed result in better performance of the classifier. Notably, the performance gain seems to level off beyond 500 samples. Thus, further improvements by adding even more datasets is unlikely, at least for the nearest mean classifier used here. More complex classifiers might still prove to be able to improve the performance with more samples. This is an indication that the gain in statistical power is larger than the detrimental effect of increased heterogeneity, and that it is useful to construct a breast cancer compendium.

From dataset to dataset, the sets of genes that are selected in the signatures show a remarkably low degree of overlap. Similarly, signatures from classifiers that are trained on overlapping subsets of samples from a single dataset (repeated random resampling strategy) show little overlap. However, there is no `golden standard&apos; to compare the signatures against. Therefore, we set up an experiment with artificial datasets (for which we know the optimal ranking of genes). By doing so, we were able to show that the resampling method actually produces inferior rankings of genes compared to methods that use all samples. Each breast cancer dataset can be seen as a resampling from the underlying breast cancer population. Therefore, we compared signatures derived from single datasets and those from multiple datasets. We found that the average overlap increases significantly when signatures are derived from multiple datasets. Thus, we pinpointed the small sample size problem as a key cause underlying the limited overlap effect. Several other explanations for the limited overlap effect have been proposed, but were never rigorously tested. From our analyses we found no evidence to support these explanations. These results show clear advantages for using a breast cancer compendium.

Apart from microarray data, various other data sources are available. These might contain complementary information, which could help to increase the classification performance. We make a distinction between general knowledge from databases and additional measurements from the same samples. First of all, we used gene sets from pathway databases (e.g. the Gene Ontology), to derive pathway based features from the microarray datasets. Using this higher level feature representation proved to help very little in getting a better classifier performance. However, these gene sets do help in getting a much richer representation of the underlying biology. This may help biologists in pinpointing genes and/or pathways for follow up experiments. Secondly, extensive annotation of clinical and pathological variables are available for a few datasets. These variables are ignored when training a classifier on microarray data only. Therefore, we investigated the use of training a classifier on both data sources simultaneously. Three different integration strategies were considered: early integration, where the two sets of features are concatenated; intermediate integration where the outputs from the two individual classifiers are combined in a weighted fashion; and late integration where the binary outputs of the two individual classifiers are combined using Boolean logic. For a range of classifiers, the integration strategies result in classifiers with a better performance compared to those derived from a single dataset. This is an indication that additional sample specific measurements hold the promise to increase the performance even further.

Many genes that are important in tumor development have been extensively studied in model systems. For instance, these genes have been over-expressed or knocked-down in cell lines, leading to particular changes in activity of the other genes. The expression patterns of human tumors result from the interplay of various over-expressions and knock-downs. Under the assumption that the individual effects have a linear contribution to the overall expression pattern, we set up a regularized linear decomposition framework. Using such a model we were able to predict the mutations most likely present in a given tumor. In a similar fashion, we used gene sets and hallmarks as components in the decomposition. Our analysis unveiled specific patterns in the weights (representing the degree of presence of a hallmark) for the five subtypes of breast cancer. Although this setup doesn&apos;t increase the classification performance, this type of analysis does provide a new angle in interpreting breast cancer.</abstract></mods>
