Cancer genome sequencing is the whole genome sequencing of a single, homogeneous or heterogeneous group of cancer cells. It is a biochemical laboratory method for the characterization and identification of the DNA or RNA sequences of cancer cell(s).
Unlike whole genome (WG) sequencing which is typically from blood cells, such as J. Craig Venter's [1] and James D. Watson’s WG sequencing projects,[2] saliva, epithelial cells or bone - cancer genome sequencing involves direct sequencing of primary tumor tissue, adjacent or distal normal tissue, the tumor micro environment such as fibroblast/stromal cells, or metastatic tumor sites.
Similar to whole genome sequencing, the information generated from this technique include: identification of nucleotide bases (DNA or RNA), copy number and sequence variants, mutation status, and structural changes such as chromosomal translocations and fusion genes.
Cancer genome sequencing is not limited to WG sequencing and can also include exome, transcriptome, micronome sequencing, and end-sequence profiling. These methods can be used to quantify gene expression, miRNA expression, and identify alternative splicing events in addition to sequence data.
The first report of cancer genome sequencing appeared in 2006. In this study 13,023 genes were sequenced in 11 breast and 11 colorectal tumors.[3] A subsequent follow up was published in 2007 where the same group added just over 5,000 more genes and almost 8,000 transcript species to complete the exomes of 11 breast and colorectal tumors.[4] The first whole cancer genome to be sequenced was from cytogenetically normal acute myeloid leukaemia by Ley et al. in November 2008.[5] The first breast cancer tumor was sequenced by Shah et al. in October 2009,[6] the first lung and skin tumors by Pleasance et al. in January 2010,[7][8] and the first prostate tumors by Berger et al. in February 2011.[9]
Historically, cancer genome sequencing efforts has been divided between transcriptome-based sequencing projects and DNA-centered efforts.
The Cancer Genome Anatomy Project (CGAP) was first funded in 1997[10] with the goal of documenting the sequences of RNA transcripts in tumor cells.[11] As technology improved, the CGAP expanded its goals to include the determination of gene expression profiles of cancerous, precancerous and normal tissues.[12]
The CGAP published the largest publicly available collection of cancer expressed sequence tags in 2003.[13]
The Sanger Institute's Cancer Genome Project, first funded in 2005, focuses on DNA sequencing. It has published a census of genes causally implicated in cancer,[14] and a number of whole-genome resequencing screens for genes implicated in cancer.[15]
The International Cancer Genome Consortium (ICGC) was founded in 2007 with the goal of integrating available genomic, transcriptomic and epigenetic data from many different research groups.[16][17] As of December 2011, the ICGC includes 45 committed projects and has data from 2,961 cancer genomes available.[16]
The process of tumorigenesis that transforms a normal cell to a cancerous cell involve a series of complex genetic and epigenetic changes.[18][19][20] Identification and characterization of all these changes can be accomplished through various cancer genome sequencing strategies.
The power of cancer genome sequencing lies in the heterogeneity of cancers and patients. Most cancers have a variety of subtypes and combined with these ‘cancer variants’ are the differences between a cancer subtype in one individual and in another individual. Cancer genome sequencing allows clinicians and oncologists to identify the specific and unique changes a patient has undergone to develop their cancer. Based on these changes, a personalized therapeutic strategy can be undertaken.[21][22]
A big contribution to cancer death and failed cancer treatment is clonal evolution at the cytogenetic level, for example as seen in acute myeloid leukaemia (AML).[23][24] In a Nature study published in 2011, Ding et al. identified cellular fractions characterized by common mutational changes to illustrate the heterogeneity of a particular tumor pre- and post-treatment vs. normal blood in one individual.[25]
These cellular factions could only have been identified through cancer genome sequencing, showing the information that sequencing can yield, and the complexity and heterogeneity of a tumor within one individual.
The two main projects focused on complete cancer characterization in individuals, heavily involving sequencing include the Cancer Genome Project, based at the Wellcome Trust Sanger Institute and the Cancer Genome Atlas funded by the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI). Combined with these efforts, the International Cancer Genome Consortium (a larger organization) is a voluntary scientific organization that provides a forum for collaboration among the world's leading cancer and genomic researchers.
The Cancer Genome Projects goal is to identify sequence variants and mutations critical in the development of human cancers. The project involves the systematic screening of coding genes and flanking splice junctions of all genes in the human genome for acquired mutations in human cancers. To investigate these events, the discovery sample set will include DNA from primary tumor, normal tissue (from the same individuals) and cancer cell lines. All results from this project are amalgamated and stored within the COSMIC cancer database. COSMIC also includes mutational data published in scientific literature.
The TCGA is a multi-institutional effort to understand the molecular basis of cancer through genome analysis technologies, including large-scale genome sequencing techniques. Hundreds of samples are being collected, sequenced and analyzed. Currently the cancer tissue being collected include: central nervous system, breast, gastrointestinal, gynecologic, head and neck, hematologic, thoracic, and urologic.
The components of the TCGA research network include: Biospecimen Core Resources, Genome Characterization Centers, Genome Sequencing Centers, Proteome Characterization Centers, a Data Coordinating Center, and Genome Data Analysis Centers. Each cancer type will undergo comprehensive genomic characterization and analysis. The data and information generated is freely available through the projects TCGA data portal.
The ICGC’s goal is “To obtain a comprehensive description of genomic, transcriptomic and epigenomic changes in 50 different tumor types and/or subtypes which are of clinical and societal importance across the globe”.[16]
Cancer genome sequencing utilizes the same technology involved in whole genome sequencing. The history of sequencing has come a long way, originating in 1977 by two independent groups - Fredrick Sanger’s enzymatic didoxy DNA sequencing technique [26] and the Allen Maxam and Walter Gilbert chemical degradation technique.[27] Following these landmark papers, over 20 years later ‘Second Generation’ high-throughput next generation sequencing (HT-NGS) was born followed by ‘Third Generation HT-NGS technology’ in 2010.[28] The figures to the right illustrate the general biological pipeline and companies involved in second and third generation HT-NGS sequencing.
Three major second generation platforms include Roche/454 Pyro-sequencing, ABI/SOLiD sequencing by ligation, and Illumina’s bridge amplification sequencing technology. Three major third generation platforms include Pacific Biosciences Single Molecule Real Time (SMRT) sequencing, Oxford Nanopore sequencing, and Ion semiconductor sequencing.
As with any genome sequencing project, the reads must be assembled to form a representation of the chromosomes being sequenced. With cancer genomes, this is usually done by aligning the reads to the human reference genome.
Since even non-cancerous cells accumulate somatic mutations, it is necessary to compare sequence of the tumor to a matched normal tissue in order to discover which mutations are unique to the cancer. In some cancers, such as leukemia, it is not practical to match the cancer sample to a normal tissue, so a different non-cancerous tissue must be used.[25]
It has been estimated that discovery of all somatic mutations in a tumor would require 30-fold sequencing coverage of the tumor genome and a matched normal tissue.[29] By comparison, the original draft of the human genome had approximately 65-fold coverage.[30] To facilitate further improvement in somatic mutation detection in cancer, the Sequencing Quality Control Phase 2 Consortium has established a pair of tumor-normal cell lines as community reference samples and data sets for the benchmarking of cancer mutation detections.[31]
A major goal of cancer genome sequencing is to identify driver mutations: genetic changes which increase the mutation rate in the cell, leading to more rapid tumor evolution and metastasis.[32] It is difficult to determine driver mutations from DNA sequence alone; but drivers tend to be the most commonly shared mutations amongst tumors, cluster around known oncogenes, and are tend to be non-silent.[29] Passenger mutations, which are not important in the progression of the disease, are randomly distributed throughout the genome. It has been estimated that the average tumor carries c.a. 80 somatic mutations, fewer than 15 of which are expected to be drivers.[33]
A personal-genomics analysis requires further functional characterization of the detected mutant genes, and the development of a basic model of the origin and progression of the tumor. This analysis can be used to make pharmacological treatment recommendations.[21][22] As of February 2012, this has only been done for patients clinical trials designed to assess the personal genomics approach to cancer treatment.[22]
A large-scale screen for somatic mutations in breast and colorectal tumors showed that many low-frequency mutations each make small contribution to cell survival.[33] If cell survival is determined by many mutations of small effect, it is unlikely that genome sequencing will uncover a single "Achilles heel" target for anti-cancer drugs. However, somatic mutations tend to cluster in a limited number of signalling pathways,[29][33][34] which are potential treatment targets.
Cancers are heterogeneous populations of cells. When sequence data is derived from a whole tumor, information about the differences in sequence and expression pattern between cells is lost.[35] This difficulty can be ameliorated by single-cell analysis.
Clinically significant properties of tumors, including drug resistance, are sometimes caused by large-scale rearrangements of the genome, rather than single mutations.[36] In this case, information about single nucleotide variants will be of limited utility.[35]
Cancer genome sequencing can be used to provide clinically relevant information in patients with rare or novel tumor types. Translating sequence information into a clinical treatment plan is highly complicated, requires experts of many different fields, and is not guaranteed to lead to an effective treatment plan.[21][22]
The incidentalome is the set of detected genomic variants not related to the cancer under study.[37] (The term is a play on the name incidentaloma, which designates tumors and growths detected on whole-body imaging by coincidence).[38] The detection of such variants may result in additional measures such as further testing or lifestyle management.[37]