top of page

The COVID-19 pandemic, caused by the SARS-CoV-2 virus, has fundamentally changed the world, yet its ultimate impact is unknown. The virus evolves in response to both host immune systems and intervention strategies. To diminish both the short-term and long-term impacts of COVID-19, we are developing robust, repeatable, and accessible tools to integrate and analyze the diversity of data becoming available in the face of the COVID-19 pandemic. Also, we apply these techniques to study genetic variability in the SARS-CoV-2 coronavirus, which causes COVID-19, and to identify associations with clinical variables of health outcomes. We also investigate a combination of metabolomics and proteomics data to characterize their change across a spectrum of COVID-19 severities. 

Here we outline our outcomes as resources to the research community, including ongoing projects, developed software, papers (coming soon), and our team information. Please feel free to contact us if you need any help to use these tools, collaborations, and meetings for discussions. 


SARS-CoV2 genome:​

  •  Using a combination of phylogenetic approaches and novel, unsupervised machine learning methods, we have identified heretofore unrecognized clusters of SARS-CoV-2 genomes. We have additionally highlighted genomic features with correlated variance, elucidating the function and interconnectedness of SARS-CoV-2 coding regions. These methods and findings are useful for characterizing the SARS-CoV-2 genome's evolution patterns, monitoring disease epidemiology, and providing vital information for guiding vaccine development.

Phylogram depicting the evolutionary relationships between 2,000 SARS-CoV-2 genomes. The phylogram is annotated with colored rings that show the sample’s region of origin, time of collection, phylogenetic categorization, and grouping based on OmeClust categorization. 
  • Furthermore, we are in the process of exploring the role of 3D protein structure in the context of disease etiology. One region in the virus's RNA, for example, codes for the three-dimensional proteins around the virus envelope that allow it to bind to receptors in the human body and begin replicating. But humans, like the virus, have different genetic backgrounds, which may lead to different health outcomes. An integrated approach that explores viral genome variation in the context of host genome variation is under development.

Heatmap depicting nucleotide distance between ORF1AB gene regions across SARS-CoV-2 genomes.
Aligned nucleotide sequence of the SARS-CoV-2 envelope coding region
Predicted structure of the S protein in the hCov ancestor

Metabolomics and Proteomics of COVID-19:​

  •  The study aims to classify and prioritize the important features, given a dataset of both proteomic and metabolomic features. A deep neural network was trained on these features and used to predict the labels of the patients. The resulting model was used to find the weights of the features. These methods provide insights into the role that various biological features play in disease progression and outcome.

  • Furthermore, the platform we developed provides biomarker discovery techniques to characterize the virus’s effects on the human body to explain the observed diversity of health outcomes with COVID-19.


Deep sequencing analysis and methods:​

  • Using Convolution Neural Network to find regions in the gene sequence that are responsible for the detection of the COVID- 19. The idea is based on k-mers which uses Natural Language Processing to detect the regions which are similar in all the patients, and then creates a model based on it. 

  • Advances in omics technologies provide a broad and deep range of genotypic and phenotypic data to integrate with clinical phenotypes. Machine learning techniques such as clustering using phylogenetic distance and Deep Neural Networks (DNNs) are suitable techniques to link these DNA level changes to clinical metadata for human disease prediction, diagnosis, and therapeutics.


This project develops tools within an open-source platform for documented, repeatable analyses that can be conducted in real-time allowing integration of data from patients with new treatments/vaccines strategies. This deep learning bioinformatics platform will allow the prioritization of genes associated with outcome predictors, including health, therapeutic, and vaccine outcomes, as well as inform improved DNA tests for predicting disease status and severity. The computational algorithms being developed in this project help sort through this genetic (host and virus) and clinical variation to determine key features in health status.

  • omeClust

    • Landing page:

    • Github:

    • We developed omeClust, a generic tool for omics community detection, and applied it to the dissimilarity matrix of the genetic sequences of previous strains of this coronavirus to the newest ones, identifying the sites of genetic difference. These differences are then tested for associations with clinical variables, including geographic origin, health outcomes, etc. Such analyses help explain why SARS-CoV-2 is so uniquely infectious, powerful, and could point to ways to defeat it.

  • btest​

    • Landing page: (coming soon)

    • Github:

    • We developed btest, a tool for block-wise association testing between paired omics datasets, with high dimensionality and collinearity among features. btest boosts statistical power and reports significant associations in a more informative summary way.  

  • Deep sequencing analysis

    • We are developing a CNN approach to predict the severity of COVID based on the various genome from infected individuals, and also prioritizing important regions in the viral genome to characterize the SARS-CoV-2 strains.   ​


  • Ali Rahnavard, Principal Investigator

  • Keith A. Crandall, Co-PI

  • Marcos Pérez-Losada, Co-PI

  • Ranojoy Chatterjee, Senior Research Assistant

  • Tyson Dawson, PhD Student

  • Rebecca Clement, PhD Student

  • Nathaniel Starrett, PhD Student

  • Abhigya Giri, Master’s Student 

  • Sanika Kulkarni, Undergraduate Student


  • Ali Rahnavard, Suvo Chatterjee, Bahar Sayoldin, Keith A Crandall, Fasil Tekola-Ayele, Himel Mallick, Omics community detection using multi-resolution clustering, Bioinformatics, 2021;, btab317, (PMID: 33974004)

  • Ali Rahnavard, Tyson Dawson, Rebecca Clement et al. Epidemiological associations with genomic variation in SARS-CoV-2, 18 May 2021, PREPRINT (Version 1) available at Research Square []


This material is based upon work supported by the National Science Foundation under Grant Number (2028280) and Research Credit awarded by Amazon AWS Diagnostic Development Initiative (DDI).

bottom of page