We develop and support open-source, "big data"–enabled AI analysis tools designed to be accessible to biology researchers without extensive computational expertise. Our group is fully committed to FAIR compliance in all of our software development. We engage with the national community to develop, support, and disseminate novel technologies of benefit to all. All code is available here: https://github.com/EpiGenomicsCode
We build scalable, reproducible, and FAIR-aligned platforms to support domain specific computation and analysis. We recently developed a high-throughput containerized solution for large-scale protein structure modeling and design. These containers are architected to make efficient use of available GPU and CPU resources. Our focus on interoperability and accessibility is echoed in our ScriptManager platform which provides an interactive genomics analysis environment which abstracts away underlying compute complexity and fosters reproducible FAIR workflows for users of varied computational literacy. Attempting to address the reproducibility crisis in genomics, we developed GenoPipe to peform genotype of origin detection from (epi)genomic datasets. Focusing on methodological robustness, we also developed the Adversarial Robustness and Explainability framework to provide explainable, adversarial resistant validation checks into machine learning pipelines. This helps to promoting trustworthy, reproducible model evaluation that aligns with FAIR principles by documenting provenance and interpretability features.
Protein structure prediction and design for high-throughput computing.
Mathew VS, Kellogg GD, Lai WKM.
bioArxiv 2025, https://doi.org/10.1101/2025.07.18.665594
Adversarial Robustness and Explainability of Machine Learning Models.
Gafur J, Goddard S, Lai WKM.
Practice and Experience in Advanced Research Computing 2024, https://doi.org/10.1145/3626203.3670522
GenoPipe: identifying the genotype of origin within (epi)genomic datasets.
Lang O, Srivastava D, Pugh BF, Lai WKM.
Nucleic Acids Research 2023, 51 (22), 12054-12068. PMID: 37933851; PMCID: PMC10711449.
ScriptManager: an interactive platform for reducing barriers to genomics analysis.
Lang O, Pugh BF, Lai WKM Lai WKM.
Practice and Experience in Advanced Research Computing 2022, https://doi.org/10.1145/3491418.3535161
ArchTEx: accurate extraction and visualization of next-generation sequence data.
Lai WKM, Bard JE, Buck MJ.
Bioinformatics 2012, 1;28(7):1021-3. doi: 10.1093/bioinformatics/bts063. PMID: 22302569.
Current 'Omic projects generate hundreds of terabytes of raw data and associated metadata. Properly tracking, analyzing, and visualizing all of that data requires extensive management. We developed the STENCIL platform to curate a variety of Galaxy workflow into an integrated space while providing interactive analysis tools for further analysis. This platform was crucial to understanding and interpreting the data generated and provided a mechanism for biochemists with minimal genomic training to understand the results of the data generated. Additionally, in response to the NIH’s continued support for enhanced experimental rigor and reproducibility, we also use and develop the PEGR platform to track the metadata associated with each experiment as it was performed. PEGR tracks all aspects of an experiment (i.e., enzyme catalog #, user ID) in real-time over the course of an experiment and embeds that information into a searchable interface that links directly to the downstream sequencing results and the Galaxy platform for linking reproducible bioinformatics all the way back to the original experimental design.
AlphaFold accessibility: an optimized open-source OOD app for Protein Structure Prediction.
Mathew VS, Hansen M, Lai WKM
GOOD25 Conference - Harvard
PEGR: a flexible management platform for reproducible epigenomic and genomic research.
Shao D, Kellogg G, Nematbakhsh A, Kuntala PK, Mahony S, Pugh BF, Lai WKM.
Genome Biology 2022, 19;23(1):99. PMID: 35440038 PMCID: PMC9016988
STENCIL: A web templating engine for visualizing and sharing life science datasets.
Sun Q, Nematbakhsh A, Kuntala PK, Kellogg G, Pugh BF, Lai WKM.
PLoS Computational Biology 2022, 9;18(2):e1009859. doi: 10.1371/journal.pcbi.1009859. PMID: 35139076; PMCID: PMC8863220.