Selected Research Papers

Cao L, Kong Y, Fan Y, Ni M, Tourancheau A, Ksiezarek M, Mead EA, Koo T, Gitman M, Zhang X-S & Fang G# mEnrich-seq: Methylation-guided enrichment sequencing of bacterial taxa of interest from microbiome, Nature Methods, 2024 (link)

Kong Y, Mead EA & Fang G#, Navigating the pitfalls of mapping DNA and RNA modifications, Nature Reviews Genetics, 2023 (link)

Kong Y, Cao L, Deikus G, Fan Y, Mead EA, Lai W, Zhang Y, Yong R, Sebra R, Wang H, Zhang X-S & Fang G# Critical assessment of DNA adenine methylation across eukaryotes using quantitative deconvolution, Science, 2022 (link)

Tourancheau A, Mead EA, Zhang X-S and Fang G#, Discovering multiple types of DNA methylation from individual bacteria and microbiome using nanopore sequencing, Nature Methods, 2021 (link)

Oliveira PH, Kim A, Sekulovic O, Garrett EM, Trzilova D, Mead EA, Pak T, Zhu S, Deikus S, ..., Patel G, Wallach F, Hamula C, Huprikar S, Roberts RJ, Schadt EE, Sebra R, van Bakel H, Kasarskis A, Tamayo R, Shen A# & Fang G#, Epigenomic characterization of Clostridioides difficile finds a conserved DNA methyltransferase that mediates sporulation and pathogenesis, Nature Microbiology, 2020 (link)

Beaulaurier J, Schadt EE & Fang G#, Deciphering bacterial epigenomes using modern sequencing technologies, Nature Reviews Genetics, 20, pages 157–172 (2019) (link)

Flaherty E*, Zhu S*, Barretto N, Cheng E, Deans MP, Fernando M, …, Fitzgerald M, Ladran I, Gochman P, Rapoport J, Tsankova N, Mccullumsmith R, Hoffman GE, Sebra R, Fang G# & Brennand KJ#, Neuronal impact of patient-specific aberrant NRXN1 splicing, Nature Genetics, 2019 (link)

Beaulaurier J, Zhu S, Deikus G, Mogno I, Zhang XS, Davis-Richardson A, Canepa R, Triplett EW, Faith JJ, Sebra R, Schadt EE & Fang G#,  Metagenomic binning and association of plasmids with bacterial host genomes using DNA methylation, Nature Biotechnology, 2018 (link)

Fang, G#*, Wang W*, Paunic V, Heydari H, Costanzo M, Liu X, Liu X, VanderSluis B, Oately B, Steinbach M, van Ness B, Schadt EE, Pankratz N, Boone C, Kumar V#, Myers C#, Discovering genetic interactions bridging pathways in genome-wide association studies, Nature Communications, (link)

Fang G*, Munera D*, Friedman DI, Mandlik A, Chao MC, Banerjee O, Feng Z, Losic B, Mahajan MC, Jabado OJ, Deikus G, et al. , Genome-wide map of methylated adenine residues using single-molecule real-time sequencing in pathogenic Escherichia coli, Nature Biotechnology, 2012. (link)

Full list (since 2012)








mEnrich-seq: methylation-guided enrichment sequencing of bacterial taxa of interest from microbiome
Lei Cao, Yimeng Kong, Yu Fan, Mi Ni, Alan Tourancheau, Magdalena Ksiezarek, Edward A. Mead, Tonny Koo, Melissa Gitman, Xue-Song Zhang and Gang Fang#

Metagenomics has enabled the comprehensive study of microbiomes. However, many applications would benefit from a method that sequences specific bacterial taxa of interest, but not most background taxa. We developed mEnrich-seq (in which ‘m’ stands for methylation and seq for sequencing) for enriching taxa of interest from metagenomic DNA before sequencing. The core idea is to exploit the self versus nonself differentiation by natural bacterial DNA methylation and rationally choose methylation-sensitive restriction enzymes, individually or in combination, to deplete host and background taxa while enriching targeted taxa. This idea is integrated with library preparation procedures and applied in several applications to enrich (up to 117-fold) pathogenic or beneficial bacteria from human urine and fecal samples, including species that are hard to culture or of low abundance. We assessed 4,601 bacterial strains with mapped methylomes so far and showed broad applicability of mEnrich-seq. mEnrich-seq provides microbiome researchers with a versatile and cost-effective approach for selective sequencing of diverse taxa of interest. 

Nature Methods, (2024)







Getty image: zhuweiyi49

Navigating the pitfalls of mapping DNA and RNA modifications
Yimeng Kong, Edward A. Mead, and Gang Fang#

Chemical modifications to nucleic acids occur across the kingdoms of life and carry important regulatory information. Reliable highresolution mapping of these modifications is the foundation of functional and mechanistic studies, and recent methodological advances based on next-generation sequencing and long-read sequencing platforms are critical to achieving this aim. However, mapping technologies may have limitations that sometimes lead to inconsistent results. Some of these limitations are technical in nature and specific to certain types of technology. Here, however, we focus on common (yet not always widely recognized) pitfalls that are shared among frequently used mapping technologies and discuss strategies to help technology developers and users mitigate their effects. Although the emphasis is primarily on DNA modifications, RNA modifications are also discussed. 

Nature Reviews Genetics, (2023)

Critical assessment of DNA adenine methylation across eukaryotes using quantitative deconvolution
Yimeng Kong, Lei Cao, Gintaras Deikus, Yu Fan, Edward A. Mead, Weiyi Lai, Yizhou Zhang, Raymund Yong, Robert Sebra, Hailin Wang, Xue-Song Zhang, Gang Fang#

The discovery of N6-methyldeoxyadenine (6mA) across eukaryotes led to a search for additional epigenetic mechanisms. However, some studies have highlighted confounding factors that challenge the prevalence of 6mA in eukaryotes. We developed a metagenomic method to quantitatively deconvolve 6mA events from a genomic DNA sample into species of interest, genomic regions, and sources of contamination. Applying this method, we observed high-resolution 6mA deposition in two protozoa. We found that commensal or soil bacteria explained the vast majority of 6mA in insect and plant samples. We found no evidence of high abundance of 6mA in Drosophila, Arabidopsis, or humans. Plasmids used for genetic manipulation, even those from Dam methyltransferase mutant Escherichia coli, could carry abundant 6mA, confounding the evaluation of candidate 6mA methyltransferases and demethylases. On the basis of this work, we advocate for a reassessment of 6mA in eukaryotes.

Science, 10.1126/science.abe7489, 2022

Discovering multipletypes of DNA methylation from individual bacteria and microbiome using nanopore sequencing
Alan Tourancheau, Edward A. Mead, Xue-Song Zhang, Gang Fang#

Bacterial DNA methylation occurs at diverse sequence contexts and plays important functional roles in cellular defense and gene regulation. Existing methods for detecting DNA modification from nanopore sequencing data do not effectively support de novo study of unknown bacterial methylomes. In this work, we observed that nanopore sequencing signal displays complex heterogeneity across methylation events of the same type. To enable nanopore sequencing for broadly applicable methylation discovery, we generated a training dataset from an assortment of bacterial species and developed a method, named nanodisco (, that couples the identification and fine mapping of the three forms of methylation into a multi-label classification framework. We applied it to individual bacteria and mouse gut microbiome for reliable methylation discovery. In addition, we demonstrated the use of DNA methylation for binning metagenomic contigs, associating mobile genetic elements with their host genomes, and identifying misassembled metagenomic contigs.

Nature Methods, 10.1038/s41592-021-01109-3(2021)

Epigenomic characterization of Clostridioides difficile finds a conserved DNA methyltransferase that mediates sporulation and pathogenesis
Pedro H. Oliveira, John W. Ribis, Elizabeth M. Garrett, Dominika Trzilova, Alex Kim, Ognjen Sekulovic, Edward A. Mead, Theodore Pak, Shijia Zhu, Gintaras Deikus, Marie Touchon, Martha Lewis-Sandari, Colleen Beckford, Nathalie E. Zeitouni, Deena R. Altman, Elizabeth Webster, Irina Oussenko, Supinda Bunyavanich, Aneel K. Aggarwal, Ali Bashir, Gopi Patel, Frances Wallach, Camille Hamula, Shirish Huprikar, Eric E. Schadt, Robert Sebra, Harm van Bakel, Andrew Kasarskis, Rita Tamayo, Aimee Shen# & Gang Fang#

Clostridioides (formerly Clostridium) difficile is a leading cause of healthcare-associated infections. Although considerable progress has been made in the understanding of its genome, the epigenome of C. difficile and its functional impact has not been systematically explored. Here, we perform a comprehensive DNA methylome analysis of C. difficile using 36 human isolates and observe a high level of epigenomic diversity. We discovered an orphan DNA methyltransferase with a well-defined specificity, the corresponding gene of which is highly conserved across our dataset and in all of the approximately 300 global C. difficile genomes examined. Inactivation of the methyltransferase gene negatively impacts sporulation, a key step in C. difficile disease transmission, and these results are consistently supported by multiomics data, genetic experiments and a mouse colonization model. Further experimental and transcriptomic analyses suggest that epigenetic regulation is associated with cell length, biofilm formation and host colonization. These findings provide a unique epigenetic dimension to characterize medically relevant biological processes in this important pathogen. This study also provides a set of methods for comparative epigenomics and integrative analysis, which we expect to be broadly applicable to bacterial epigenomic studies.

Nature Microbiology, 5, pages 166–180 (2020)

Media coverage: GenomeWeb, Technology Networks, Medical News, PHYS, MEDPAGE Today, PacBio Blog

Conserved DNA Methyltransferases: A Window into Fundamental Mechanisms of Epigenetic Regulation in Bacteria
Pedro H. Oliveira# & Gang Fang#

An increasing number of studies have reported that bacterial DNA methylation has important functions beyond the roles in restriction-modification systems, including the ability of affecting clinically relevant phenotypes such as virulence, host colonization, sporulation, biofilm formation, among others. Although insightful, such studies have a largely ad hoc nature and would benefit from a systematic strategy enabling a joint functional characterization of bacterial methylomes by the microbiology community. In this opinion article, we propose that highly conserved DNA methyltransferases (MTases) represent a unique opportunity for bacterial epigenomic studies. These MTases are rather common in bacteria, span various taxonomic scales, and are present in multiple human pathogens. Apart from well-characterized core DNA MTases, like those from Vibrio cholerae, Salmonella enterica, Clostridioides difficile, or Streptococcus pyogenes, multiple highly conserved DNA MTases are also found in numerous human pathogens, including those belonging to the genera Burkholderia and Acinetobacter. We discuss why and how these MTases can be prioritized to enable a community-wide, integrative approach for functional epigenomic studies. Ultimately, we discuss how some highly conserved DNA MTases may emerge as promising targets for the development of novel epigenetic inhibitors for biomedical applications.

Trends in Microbiology, 29:1, pp 28-40 (2020)

Neuronal impact of patient-specific aberrant NRXN1α splicing
Erin Flaherty*, Shijia Zhu*, Natalie Barretto, Esther Cheng, P. J. Michael Deans, Michael B. Fernando, Nadine Schrode, Nancy Francoeur, Alesia Antoine, Khaled Alganem, Madeline Halpern, Gintaras Deikus, Hardik Shah, Megan Fitzgerald, Ian Ladran, Peter Gochman, Judith Rapoport, Nadejda M. Tsankova, Robert McCullumsmith, Gabriel E. Hoffman, Robert Sebra, Gang Fang# & Kristen Brennand#

NRXN1 undergoes extensive alternative splicing, and non-recurrent heterozygous deletions in NRXN1 are strongly associated with neuropsychiatric disorders. We establish that human induced pluripotent stem cell (hiPSC)-derived neurons well represent the diversity of NRXN1α alternative splicing observed in the human brain, cataloguing 123 high-confidence in-frame human NRXN1α isoforms. Patient-derived NRXN1+/− hiPSC-neurons show a greater than twofold reduction in half of the wild-type NRXN1α isoforms and express dozens of novel isoforms from the mutant allele. Reduced neuronal activity in patient-derived NRXN1+/− hiPSC-neurons is ameliorated by overexpression of individual control isoforms in a genotype-dependent manner, whereas individual mutant isoforms decrease neuronal activity levels in control hiPSC-neurons. In a genotype-dependent manner, the phenotypic impact of patient-specific NRXN1+/− mutations can occur through a reduction in wild-type NRXN1α isoform levels as well as the presence of mutant NRXN1α isoforms.

Nature Genetics, 51, pages 1679–1690 (2019)

Media coverage: ScienceDaily, Medical News, Medical Express, PacBio Blog.

Deciphering bacterial epigenomes using modern sequencing technologies
John Beaulaurier, Eric Schadt & Gang Fang#

Prokaryotic DNA contains three types of methylation: N6-methyladenine,
N4-methylcytosine and 5-methylcytosine. The lack of tools to analyse the frequency and distribution of methylated residues in bacterial genomes has prevented a full understanding of their functions. Now , advances in DNA sequencing technology , including single- molecule, real- time sequencing and nanopore- based sequencing, have provided new opportunities for systematic detection of all three forms of methylated DNA at a genome- wide scale and offer unprecedented opportunities for achieving a more complete understanding of bacterial epigenomes. Indeed, as the number of mapped bacterial methylomes approaches 2,000, increasing evidence supports roles for methylation in regulation of gene expression, virulence and pathogen–host interactions.

Nature Reviews Genetics,​, 2019

Metagenomic binning and association of plasmids with bacterial host genomes using DNA methylation
John Beaulaurier, Shijia Zhu, Gintaras Deikus, Ilaria Mogno, Xue-Song Zhang, Austin Davis-Richardson, Ronald Canepa, Eric Triplett, Jeremiah Faith, Robert Sebra, Eric Schadt & Gang Fang#

Shotgun metagenomics methods enable characterization of microbial communities in human microbiome and environmental samples. Assembly of metagenome sequences does not output whole genomes, so computational binning methods have been developed to cluster sequences into genome ‘bins’. These methods exploit sequence composition, species abundance, or chromosome organization but cannot fully distinguish closely related species and strains. We present a binning method that incorporates bacterial DNA methylation signatures, which are detected using single-molecule real-time sequencing. Our method takes advantage of these endogenous epigenetic barcodes to resolve individual reads and assembled contigs into species- and strain-level bins. We validated our method using synthetic and real microbiome sequences. In addition to genome binning, we show that our method links plasmids and other mobile genetic elements to their host species in a real microbiome sample. Incorporation of DNA methylation information into shotgun metagenomics analyses will complement existing methods to enable more accurate sequence binning. 

Nature Biotechnology, 10.1038/nbt.4037​, 2018

Highlighted in Nature Methods (link)

Media coverage: GEN NewsPacBioGenomeWebMD MagazineBioITWorldScience DailyInfection Control TodayPHYS.

Mapping and characterizing N6-methyladenine in eukaryotic genomes using single molecule real-time sequencing
Shijia Zhu, John Beaulaurier, Gintaras Deikus, Tao Wu, Maya Strahl, Ziyang Hao, Guanzheng Luo, James A Gregory, Andrew Chess, Chuan He, Andrew Xiao, Robert Sebra, Eric E Schadt, and Gang Fang#

N6-methyladenine (m6dA) has been discovered as a novel form of DNA methylation prevalent in eukaryotes, however, methods for high resolution mapping of m6dA events are still lacking. Single-molecule real-time (SMRT) sequencing has enabled the detection of m6dA events at single-nucleotide resolution in prokaryotic genomes, but its application to detecting m6dA in eukaryotic genomes has not been rigorously examined. Herein, we identified unique characteristics of eukaryotic m6dA methylomes that fundamentally differ from those of prokaryotes. Based on these differences, we describe the first approach for mapping m6dA events using SMRT sequencing specifically designed for the study of eukaryotic genomes, and provide appropriate strategies for designing experiments and carrying out sequencing in future studies. We apply the novel approach to study two eukaryotic genomes. For green algae, we construct the first complete genome-wide map of m6dA at single nucleotide and single molecule resolution. For human lymphoblastoid cells (hLCLs), it was necessary to integrate SMRT sequencing data with independent sequencing data. The joint analyses suggest putative m6dA events are enriched in the promoters of young full-length LINE-1 elements (L1s), but call for validation by additional methods. These analyses demonstrate a general method for rigorous mapping and characterization of m6dA events in eukaryotic genomes. 

Genome Research, doi: 10.1101/gr.231068.117, 2018    

MatrixEpistasis: ultrafast, exhaustive epistasis scan for quantitative traits with covariate adjustment
Shijia Zhu# and Gang Fang#

For many traits, causal loci uncovered by genetic mapping studies explain only a minority of the heritable contribution to trait variation. Multiple explanations for this ‘missing heritability’ have been proposed. Single nucleotide polymorphism (SNP)–SNP interaction (epistasis), as one of the compelling models, has been widely studied. However, the genome-wide scan of epistasis, especially for quantitative traits, poses huge computational challenges. Moreover, covariate adjustment is largely ignored in epistasis analysis due to the massive extra computational undertaking. In the current study, we found striking differences among epistasis models using both simulation data and real biological data, suggesting that not only can covariate adjustment remove confounding bias, it can also improve power. Furthermore, we derived mathematical formulas, which enable the exhaustive epistasis scan together with full covariate adjustment to be expressed in terms of large matrix operation, therefore substantially improving the computational efficiency (∼104× faster than existing methods). We call the new method MatrixEpistasis. With MatrixEpistasis, we re-analyze a large real yeast dataset comprising 11 623 SNPs, 1008 segregants and 46 quantitative traits with covariates fully adjusted and detect thousands of novel putative epistasis with P-values < 1.48e-10.

Bioinformatics, Volume 34, Issue 14, Pages 2341–2348, 2018

DNA methylation on N6-adenine in mammalian embryonic stem cells
Tao P. Wu, Tao Wang, Matthew G. Seetin, Yongquan Lai, Shijia Zhu, Kaixuan Lin, Yifei Liu, Stephanie D. Byrum, Samuel G. Mackintosh, Mei Zhong, Alan Tackett, Guilin Wang, Lawrence S. Hon, Gang Fang, James Swenberg & Andrew Xiao

It has been widely accepted that 5-methylcytosine is the only form of DNA methylation in mammalian genomes. Here we identify N6-methyladenine as another form of DNA modification in mouse embryonic stem cells. Alkbh1 encodes a demethylase for N6-methyladenine. An increase of N6-methyladenine levels in Alkbh1-deficient cells leads to transcriptional silencing. N6-methyladenine deposition is inversely correlated with the evolutionary age of LINE-1 transposons; its deposition is strongly enriched at young (<1.5 million years old) but not old (>6 million years old) L1 elements. The deposition of N6-methyladenine correlates with epigenetic silencing of such LINE-1 transposons, together with their neighbouring enhancers and genes, thereby resisting the gene activation signals during embryonic stem cell differentiation. As young full-length LINE-1 transposons are strongly enriched on the X chromosome, genes located on the X chromosome are also silenced. Thus, N6-methyladenine developed a new role in epigenetic silencing in mammalian evolution distinct from its role in gene activation in other organisms. Our results demonstrate that N6-methyladenine constitutes a crucial component of the epigenetic regulation repertoire in mammalian genomes. 

Nature, 10.1038/nature17640, 2016  

Dysregulation of miRNA-9 in a Subset of Schizophrenia Patient-Derived Neural Progenitor Cells
Aaron Topol*, Shijia Zhu*, Brigham J. Hartley, Jane English, Mads E. Hauberg, Ngoc Tran, Chelsea Ann Rittenhouse, Anthony Simone, Douglas M. Ruderfer, Jessica Johnson, Ben Readhead, Yoav Hadas, Peter A. Gochman, Ying-Chih Wang, Hardik Shah, Gerard Cagney, Judith Rapoport, Fred H. Gage, Joel T. Dudley, Pamela Sklar, Manuel Mattheisen, David Cotter, Gang Fang# & Kristen J. Brennand#

Converging evidence indicates that microRNAs (miRNAs) may contribute to disease risk for schizophrenia (SZ). We show that microRNA-9 (miR-9) is abundantly expressed in control neural progenitor cells (NPCs) but also significantly downregulated in a subset of SZ NPCs. We observed a strong correlation between miR-9 expression and miR-9 regulatory activity in NPCs as well as between miR-9 levels/activity, neural migration, and diagnosis. Overexpression of miR-9 was sufficient to ameliorate a previously reported neural migration deficit in SZ NPCs, whereas knockdown partially phenocopied aberrant migration in control NPCs. Unexpectedly, proteomic- and RNA sequencing (RNA-seq)-based analysis revealed that these effects were mediated primarily by small changes in expression of indirect miR-9 targets rather than large changes in direct miR-9 targets; these indirect targets are enriched for migration-associated genes. Together, these data indicate that aberrant levels and activity of miR-9 may be one of the many factors that contribute to SZ risk, at least in a subset of patients. 

Cell Reports,, 2016  

(*co-first author; #co-corresponding authors)

Single molecule-level detection and long read-based phasing of epigenetic variations in bacterial methylomes
John Beaulaurier, Xue-Song Zhang, Shijia Zhu, Robert Sebra, Chaggai Rosenbluh, Gintaras Deikus, Nan Shen, Diana Munera, Matthew K Waldor, Martin J Blaser, Andrew Chess, Eric E Schadt#, Gang Fang#

Beyond its role in host defense, bacterial DNA methylation also plays important roles in the regulation of gene expression, virulence and antibiotic resistance. Bacterial cells in a clonal population can generate epigenetic heterogeneity to increase population-level phenotypic plasticity. Single molecule, real-time (SMRT) sequencing enables the detection of N6-methyladenine and N4-methylcytosine, two major types of DNA modifications comprising the bacterial methylome. However, existing SMRT sequencing-based methods for studying bacterial methylomes rely on a population-level consensus that lacks the single-cell resolution required to observe epigenetic heterogeneity. Here, we present SMALR (single-molecule modification analysis of long reads), a novel framework for single molecule-level detection and phasing of DNA methylation. Using seven bacterial strains, we show that SMALR yields significantly improved resolution and reveals distinct types of epigenetic heterogeneity. SMALR is a powerful new tool that enables de novo detection of epigenetic heterogeneity and empowers investigation of its functions in bacterial populations. 

Nature Communications, doi:10.1038/ncomms8438, 2015

Press Release;
Media coverage: GEN, GenomeWeb, PHYS, Infection Control, ScienceDaily among others.

A Cytosine Methytransferase Modulates the Cell Envelope Stress Response in the Cholera Pathogen
Michael C. Chao, Shijia Zhu, Satoshi Kimura, Brigid M. Davis, Eric E. Schadt, Gang Fang,# Matthew K. Waldor#

Methylation of DNA is used by numerous organisms to regulate a wide variety of cellular processes, but specific roles for most DNA methyltransferases have not been defined. We studied one such enzyme in Vibrio cholerae, the cholera pathogen, using genome-wide approaches to compare DNA methylation, gene expression, and the sets of genes required or dispensable for growth in bacterial strains that produced or lacked this enzyme. These studies allowed us to identify numerous cellular processes regulated, either directly or indirectly, by this cytosine methyltransferase. In particular, we found that an absence of enzyme activity was associated with reduced levels of a bacterial stress response; consequently, a stress response pathway that is essential in wild type bacteria is not needed for survival of the mutant lacking the methyltransferase. Similar genome-wide analyses can likely to be used to define the cellular roles of many additional uncharacterized DNA methyltransferases. 

PLoS Genetics, doi:10.1371/journal.pgen.1005666, 2015

(#co-corresponding authors)

Autotransporters but not pAA are critical for rabbit colonization by Shiga toxin-producing Escherichia coli O104:H4
Diana Munera, Jennifer M. Ritchie, Stavroula K. Hatzios, Rod Bronson, Gang Fang, Eric E. Schadt, Brigid M. Davis & Matthew K. Waldor

The outbreak of diarrhoea and haemolytic uraemic syndrome that occurred in Germany in 2011 was caused by a Shiga toxin-producing enteroaggregative Escherichia coli (EAEC) strain. The strain was classified as EAEC owing to the presence of a plasmid (pAA) that mediates a characteristic pattern of aggregative adherence on cultured cells, the defining feature of EAEC that has classically been associated with virulence. Here we describe an infant rabbit-based model of intestinal colonization and diarrhoea caused by the outbreak strain, which we use to decipher the factors that mediate the pathogen's virulence. Shiga toxin is the key factor required for diarrhoea. Unexpectedly, we observe that pAA is dispensable for intestinal colonization and development of intestinal pathology. Instead, chromosome-encoded autotransporters are critical for robust colonization and diarrhoeal disease in this model. Our findings suggest that conventional wisdom linking aggregative adherence to EAEC intestinal colonization is false for at least a subset of strains.

Nature Communications, doi:10.1038/ncomms4080, 2014

Altered WNT Signaling in Human Induced Pluripotent Stem Cell Neural Progenitor Cells Derived from Four Schizophrenia Patients
Aaron Topol, Shijia Zhu, Ngoc Tran, Anthony Simone, Gang Fang, Kristen J. Brennand

Schizophrenia (SZ) is a devastating psychiatric disorder hypothesized to be a neurodevelopmental condition arising as a consequence of dysregulation of brain development. WNT signaling is important for neural patterning, proliferation and migration, and synapse formation; converging postmortem, rodent, and pharmacologic evidence suggests that WNT signaling may contribute to SZ. We used human induced pluripotent stem cell (hiPSC) derived forebrain patterned neural progenitor cells (NPCs) to investigate canonical WNT activity in a pilot cohort of four patients with SZ. Future studies comprising larger patient cohorts are necessary to determine whether aberrant canonical WNT signaling is a causal molecular factor contributing to aberrant neural patterning and neuronal maturation in SZ or simply a noncell autonomous consequence of increased oxidative stress.

Biological Psychiatry, doi: 10.1016/j.biopsych, 2015

Phenotypic differences in hiPSC NPCs derived from patients with schizophrenia
Kristen Brennand, Jeffrey Savas, Yongsung Kim, Ngoc Tran, Anthony Simone, Kazue Hashimoto-Torii, Kristin Beaumont, Hyung Joon Kim, Aaron Topol, Ian Ladran, Mohammed Abdelrahim, Bridget Matikainen-Ankney, Shih-hui Chao, Milan Mrksich, Pasko Rakic, Gang Fang, Bin Zhang, John Yates III, Fred H. Gage

Consistent with recent reports indicating that neurons differentiated in vitro from human-induced pluripotent stem cells (hiPSCs) are immature relative to those in the human brain, gene expression comparisons of our hiPSC-derived neurons to the Allen BrainSpan Atlas indicate that they most resemble fetal brain tissue. This finding suggests that, rather than modeling the late features of schizophrenia (SZ), hiPSC-based models may be better suited for the study of disease predisposition. We now report that a significant fraction of the gene signature of SZ hiPSC-derived neurons is conserved in SZ hiPSC neural progenitor cells (NPCs). We used two independent discovery-based approaches—microarray gene expression and stable isotope labeling by amino acids in cell culture (SILAC) quantitative proteomic mass spectrometry analyses—to identify cellular phenotypes in SZ hiPSC NPCs from four SZ patients. From our findings that SZ hiPSC NPCs show abnormal gene expression and protein levels related to cytoskeletal remodeling and oxidative stress, we predicted, and subsequently observed, aberrant migration and increased oxidative stress in SZ hiPSC NPCs. These reproducible NPC phenotypes were identified through scalable assays that can be applied to expanded cohorts of SZ patients, making them a potentially valuable tool with which to study the developmental mechanisms contributing to SZ.

Molecular Psychiatry, doi: 10.1038/mp.2014.22, 2014

Modeling Kinetic Rate Variation in Third Generation DNA Sequencing Data to Detect Putative Modifications to DNA Bases
Eric E. Schadt*, Onureena Banerjee*, Gang Fang*, Zhixing Feng, Wing H. Wong, Xuegong Zhang, Andrey Kislyuk, Tyson A. Clark, Khai Luong, Vipin Kumar, Alice Chen-Plotkin, Neal Sondheimer, Jonas Korlach, Andrew Kasarskis.

While significant inroads have been made identifying small nucleotide variation and structural variations in DNA that impact phenotypes of interest, progress has not been as dramatic regarding epigenetic changes and base-level damage to DNA, largely due to technological limitations in assaying all known and unknown types of modifications at genome scale. Recently single molecule real time (SMRT) sequencing has been reported to identify kinetic variation (KV) events that have been demonstrated to reflect epigenetic changes of every known type, providing a path forward for detecting base modifications as a routine part of sequencing. However, to date, no statistical framework has been proposed to enhance the power to detect these events while also controlling for false positive events. By modeling enzyme kinetics in the neighborhood of an arbitrary location in a genomic region of interest as a conditional random field, we provide a statistical framework for incorporating kinetic information at a test positions of interest as well as at neighboring sites that help enhance the power to detect KV events. The performance of this and related models is explored, with the best performing model applied to plasmid DNA isolated from Escherichia coli and mitochondrial DNA isolated from human brain tissue. We highlight widespread kinetic variation events, some of which strongly associate with known modification events while others represent putative chemically modified sites of unknown types.   

Genome Research, doi:10.1101/gr.136739.111, 2012 (*co-first authors)

Comprehensive methylome characterization of Mycoplasma genitalium and Mycoplasma pneumoniae, at single-base resolution
Maria Lluch Senar, Khai Luong, Veroica Llorens, Javi Delgado, Gang Fang, Kristi Spittle, Tyson Clark, Eric Schadt, Steve Turner, Jonas Korlach, Luis Serrano

We define the methylome of two closely related bacteria, M. genitalium and M. pneumoniae, by single-molecule real-time (SMRT) DNA sequencing. In M. pneumoniae we found two previously unknown N6-methyladenine methyltransferase specificities, one of which is also found in M. genitalium. The common methyltransferase is a Dam-like methylase, and was attributed to its corresponding gene using cloned plasmids in a methyltransferase-free E. coli strain, while the second methylase is of type I and uniquely present in M. pneumoniae. Analysis of the distribution of methylation sites across the genome of M. pneumoniae at exponential and stationary growth suggests a potential role for methylation in regulating the cell cycle as well as in gene regulation.
PLoS Genetics, 9(1): e1003191. doi:10.1371/journal.pgen.1003191, 2013

Detecting DNA modifications from SMRT sequencing data by modeling sequence context dependence of polymerase kinetic
Zhixing Feng, Gang Fang, Jonas Korlach, Tyson Clark, Khai Luong, Xuegong Zhang, Wing Wong, and Eric Schadt

DNA modications such as methylation and DNA damage can play critical regulatory roles in biological systems. Single molecule, real time (SMRT) sequencing technology generates DNA sequences as well as DNA polymerase kinetic information that can be used for the direct detection of DNA modications. We demonstrate that local sequence context has a strong impact on DNA polymerase kinetics in the neighborhood of the incorporation site during the DNA synthesis reaction, allowing for the possibility of estimating the expected kinetic rate of the enzyme at the incorporation site using kinetic rate information collected from existing SMRT sequencing data (historical data) covering the same local sequence contexts of interest. We develop a Empirical Bayesian hierarchical model for incorporating historical data. Our results show that the model could greatly increase DNA modication detection accuracy, and reduce requirement of control data coverage. For some DNA modications that have a strong signal, a control sample is even not needed by using historical data as alternative to control. Thus, sequencing cost can be greatly reduced by using the model.
PLoS Computational Biology,9(3): e1002935. doi:10.1371/journal.pcbi.1002935, 2013

High-order SNP Combinations Associated with Complex Diseases: Efficient Discovery, Statistical Power and Functional Interactions
Gang Fang*, Majda Haznadar, Wen Wang, Haoyu Yu, Michael Steinbach, Tim Church, William Oetting, Brian Van Ness and Vipin Kumar*.

There has been increased interest in discovering combinations of single-nucleotide polymorphisms (SNPs) that are strongly associated with a phenotype even if each SNP has little individual effect. Efficient approaches have been proposed for searching two-locus combinations from genome-wide datasets. However, for high-order combinations, existing methods either adopt a brute-force search which only handles a small number of SNPs (up to few hundreds), or use heuristic search that may miss informative combinations. In addition, existing approaches lack statistical power because of the use of statistics with high degrees-of-freedom and the huge number of hypotheses tested during combinatorial search. We designed an efficient and effective framework for high-order combinations in case-control datasets. The substantially improved efficiency and scalability demonstrated on synthetic and real datasets with several thousands of SNPs allows the study of several important mathematical and statistical properties of SNP combinations with order as high as eleven. We further explore functional interactions in high-order combinations and reveal a general connection between the increase in discriminative power of a combination over its subsets and the functional coherence among the genes comprising the combination, supported by multiple datasets. Finally, we study several significant high-order combinations discovered from a lung-cancer dataset and a kidney-transplant-rejection dataset in detail to provide novel insights on the complex diseases. Interestingly, many of these associations involve combinations of common variations that occur in small fractions of population. Thus, our approach is an alternative methodology for exploring the genetics of rare diseases for which the current focus is on individually rare variations.
PLoS ONE, 7(4): e33531. doi:10.1371/journal.pone.0033531, 2012 (*co-corresponding authors) (software)

Mining Low-support Discriminative Patterns from Dense and High-dimensional Data
Gang Fang, Gaurav Pandey, Wen Wang, Manish Gupta, Michael Steinbach and Vipin Kumar.

Discriminative patterns can provide valuable insights into data sets with class labels, that may not be available from the individual features or the predictive models built using them. Most existing approaches work efficiently for sparse or low-dimensional data sets. However, for dense and high-dimensional data sets, they have to use high thresholds to produce the complete results within limited time, and thus, may miss interesting low-support patterns. In this paper, we address the necessity of trading off the completeness of discriminative pattern discovery with the efficient discovery of low-support discriminative patterns from such data sets. We propose a family of antimonotonic measures named SupMaxK that organize the set of discriminative patterns into nested layers of subsets, which are progressively more complete in their coverage, but require increasingly more computation. In particular, the member of SupMaxK with K ¼ 2, named SupMaxPair, is suitable for dense and high-dimensional data sets. Experiments on both synthetic data sets and a cancer gene expression data set demonstrate that there are low-support patterns that can be discovered using SupMaxPair but not by existing approaches. Furthermore, we show that the low-support discriminative patterns that are only discovered using SupMaxPair from the cancer gene expression data set are statistically significant and biologically relevant. This illustrates the complementarity of SupMaxPair to existing approaches for discriminative pattern discovery.

IEEE Transaction on Knowledge and Data Engineering, vol 24(2), p 279-294, 2012 (software)

Discovering genetic interactions bridging pathways in genome-wide association studies
Gang Fang#*, Wen Wang*, Vanja Paunic, Hamed Heydari, Michael Costanzo, Xiaoye Liu, Xiaotong Liu, Benjamin VanderSluis, Benjamin Oately, Michael Steinbach, Brian Van Ness, Eric E. Schadt, Nathan D. Pankratz, Charles Boone, Vipin Kumar# & Chad L. Myers#

Genetic interactions have been reported to underlie phenotypes in a variety of systems, but the extent to which they contribute to complex disease in humans remains unclear. In principle, genome-wide association studies (GWAS) provide a platform for detecting genetic interactions, but existing methods for identifying them from GWAS data tend to focus on testing individual locus pairs, which undermines statistical power. Importantly, a global genetic network mapped for a model eukaryotic organism revealed that genetic interactions often connect genes between compensatory functional modules in a highly coherent manner. Taking advantage of this expected structure, we developed a computational approach called BridGE that identifies pathways connected by genetic interactions from GWAS data. Applying BridGE broadly, we discover significant interactions in Parkinson’s disease, schizophrenia, hypertension, prostate cancer, breast cancer, and type 2 diabetes. Our novel approach provides a general framework for mapping complex genetic networks underlying human disease from genome-wide genotype data.

Nature Communications, 10, Article number: 4274

Genome-wide map of methylated adenine residues using single-molecule real-time sequencing in pathogenic Escherichia coli
Gang Fang, Diana Munera, David I. Friedman, Anjali Mandlik, Michael C. Chao, Onureena Banerjee, Zhixing Feng, Bojan Losic, Milind C. Mahajan, Omar J. Jabado, Gintaras Deikus, Tyson A. Clark, Khai Luong, Iain A. Murray, Brigid M. Davis, Alona Keren-Paz, Andrew Chess, Richard J. Roberts, Jonas Korlach, Steve W. Turner, Vipin Kumar, Matthew K. Waldor, Eric E. Schadt

Single-molecule real-time (SMRT) DNA sequencing allows the systematic detection of chemical modifications such as methylation but has not previously been applied on a genome-wide scale. We used this approach to detect 49,311 putative 6-methyladenine (m6A) residues and 1,407 putative 5-methylcytosine (m5C) residues in the genome of a pathogenic Escherichia coli strain. We obtained strand-specific information for methylation sites and a quantitative assessment of the frequency of methylation at each modified position. We deduced the sequence motifs recognized by the methyltransferase enzymes present in this strain without prior knowledge of their specificity. Furthermore, we found that deletion of a phage-encoded methyltransferase-endonuclease (restriction-modification; RM) system induced global transcriptional changes and led to gene amplification, suggesting that the role of RM systems extends beyond protecting host genomes from foreign DNA. 

Nature Biotechnology, doi:10.1038/nbt.2432, 2012

Also highlighted in Nature Reviews Genetics and
Nature Reviews Microbiology; Media coverage includes: Bio-IT World, TheScientist, PHYS.