American Statistical Association
Large scale cancer genome projects have sequenced tens of thousands of tumor genomes. The question we pose in this article is how much more information we need to have a satisfactorily complete characterization of the mutational variants we expect to observe in future cancer patients. We approach this issue by focusing on two quantitative goals: estimating for a new tumor the probability of observing a variant that has never previously been observed; and estimating the total number of variants that have not yet been observed. We draw upon statistical methodology that has been developed in other fields of study, notably in species estimation in ecology, and word frequencies in computational linguistics. These methods are applied to the TCGA dataset encompassing whole-exome sequencing of 10,000 tumor genomes and validated on a clinical cohort of 10,000 tumors sequenced by a targeted cancer gene panel. We find that the predicted number of new variants in the coding regions of a gene is highly influenced by the proportion of singletons (variants only seen once previously) and by the skewness of the observed variant distribution. Genes with the mass of the variant distribution concentrated at singletons tend to have larger numbers of expected new variants while genes with a longer right-tail and variants concentrated at "hotspots" tend to have a lower number. We observe substantial variability in variant richness, even among genes with similar mutation rates. Our analyses also show that gene-specific variant frequencies can be potentially used to distinguish cancer genes from reference genes.
|Date:||Wednesday, April 10, 2019|
|Time:||4:00 - 5:00 P.M.|
Memorial Sloan Kettering Cancer Center
Department of Epidemiology and Biostatistics
485 Lexington Avenue
(Between 46th & 47th Streets)
2nd Floor, Conference Room B
New York, New York
**Outside visitors please email firstname.lastname@example.org for building access. You must be on the security list to enter the floor.