ACR Meeting Abstracts

ACR Meeting Abstracts

  • Meetings
    • ACR Convergence 2025
    • ACR Convergence 2024
    • ACR Convergence 2023
    • 2023 ACR/ARP PRSYM
    • ACR Convergence 2022
    • ACR Convergence 2021
    • 2020-2009 Meetings
    • Download Abstracts
  • Keyword Index
  • Advanced Search
  • Your Favorites
    • Favorites
    • Login
    • View and print all favorites
    • Clear all your favorites
  • ACR Meetings

Abstract Number: 0032

Protein Language Model-Guided Homology Identifies Microbial Enzymes Linked to Fibrosis-Prone IgG4-RD and Crohn’s Disease

Kumar Thurimella1, Ahmed Mohamed2, Chenhao Li3, Tommi Vatanen4, Daniel Graham3, Roisin Owens5, Sabina Leanti La Rosa6, Damian Plichta3, Sergio Bacallado5 and Ramnik Xavier7, 1University of Colorado School of Medicine, Aurora, CO, 2Broad Institute, Boston, 3Broad Institute, Cambridge, MA, 4University of Helsinki, Helsinki, Finland, 5University of Cambridge, Cambridge, United Kingdom, 6NMBU, As, Norway, 7Harvard Medical School, Boston, MA

Meeting: ACR Convergence 2025

Keywords: genomics, IgG4 Related Disease, microbiome, Statistical methods, Systemic sclerosis

  • Tweet
  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print
Session Information

Date: Sunday, October 26, 2025

Title: (0019–0048) Genetics, Genomics & Proteomics Poster

Session Type: Poster Session A

Session Time: 10:30AM-12:30PM

Background/Purpose: Uncharacterized microbial enzymes in metagenomics are difficult to annotate, especially in fibrosis-prone conditions like IgG4-related disease (IgG4-RD) and Crohn’s disease (CD), where microbial carbohydrate metabolism may influence disease. Traditional sequence homology-based tools often miss enzymes with low sequence similarity. Protein language models (pLMs) provide a new AI-driven approach. We present CAZyLingua, the first pLM-based deep learning tool for CAZyme annotation. Applied to IgG4-RD and CD metagenomes, CAZyLingua uncovered hundreds of previously unannotated CAZymes-especially carbohydrate esterases (CEs)-broadening our understanding of the microbial enzymatic landscape in fibrotic diseases.

Methods: CAZyLingua was trained on a non-redundant CAZy database clustered at 60% sequence identity, with sequence embeddings generated using ProtT5 pLM. The pipeline used a quadratic discriminant analysis (QDA) classifier for CAZyme detection, followed by a four-layer neural network for family and subfamily classification, employing weighted cross-entropy loss to address class imbalance. Hyperparameters were optimized with RayTune across 100 epochs and 20 parallel models. Performance was benchmarked against dbCAN2 using three gold-standard bacterial genomes. CAZyLingua was then applied to metagenomic gene catalogs from CD and IgG4-RD, with linear modeling to identify differentially abundant CAZymes. Functional validation of a predicted CE17 from the CD cohort was performed via recombinant protein expression and MALDI-ToF mass spectrometry.

Results: CAZyLingua outperformed dbCAN2 in precision, recall, and F1 score for CAZyme identification, with up to 10% improvement in F1 for certain strains. In disease-associated metagenomes, CAZyLingua identified hundreds of CAZymes not detected by dbCAN2, with notable enrichment of CE families-especially CE1, CE3, CE4, CE12, and CE17. In IgG4-RD, CAZyLingua predicted 437 additional CAZymes, 34% of which were CEs, a class underrepresented in reference databases. In CD, CAZyLingua identified a subset of CAZymes more abundant in disease, including CE17. Functional assays confirmed that CE17 catalyzed deacetylation of acetylated mannooligosaccharides, validating the model’s annotation and the biological relevance of these newly identified enzymes.

Conclusion: CAZyLingua leverages pLMs to expand CAZyme annotation in fibrotic disease metagenomes, uncovering rare and structurally divergent enzymes overlooked by homology-based methods. The tool’s ability to predict and validate novel enzymatic activities, such as CE17 in CD, demonstrates its value for elucidating the microbiome’s role in fibrosis and inflammation. In IgG4-RD, CAZyLingua identified hundreds of additional CAZymes-particularly CEs-missed by standard tools, revealing an expanded repertoire of microbial enzymes potentially relevant to disease pathogenesis. These findings highlight CAZyLingua’s utility for uncovering hidden microbial functions in both CD and IgG4-RD, supporting its broader application for comprehensive protein function discovery in fibrosis-prone disease states.

Supporting image 1The workflow of CAZyLingua starts with raw embeddings from ProtT5 followed by the use of those embeddings as input through two classifiers to distinguish 1) whether the embedding was a CAZyme and if so, 2) to which CAZyme family it belongs to.

Supporting image 2a) Genes enriched and depleted in the gene catalogs of patients with IgG4-RD selected on the fringe of the volcano plot (see Methods for labeling criteria). b) Predicted CEs in the enriched IgG4-RD gene set, stratified to analyze only the genes CAZyLingua predicted. c) The proportion of dbCAN2-predicted CAZymes also predicted by CAZyLingua as the decision function between CAZyme/non-CAZyme of the QDA classifier in CAZyLingua was varied. The Venn diagram shows the numbers of CAZymes predicted by CAZyLingua, dbCAN2, and both on our current model benchmarks of the QDA. d) Genes enriched and depleted in the gene catalogs of patients with CD selected on the fringe of the volcano plot (see Methods for labeling criteria). CE17 is highlighted in the circle.

Supporting image 3e) The enriched genes in CD predicted by CAZyLingua only were prioritized based on a combination of the log fold change and the probability of the CAZyme annotation from CAZyLingua. The plot is ordered from the highest fold change and CAZyLingua prediction probability (red) to the lowest fold change and prediction probability (blue). CE17 is highlighted in bold. f) Functional characterization of CE17 using MALDI-ToF mass spectrometry. Peaks are labeled by degree of polymerization (DP) and number of acetyl (Ac) groups. The annotated m/z values indicate sodium adducts. Intensity is shown in arbitrary units (a.u.). Both the KTCE17 enzyme (middle) and a previously validated CE17, FpCE17 (bottom, (60)) showed the same activity on a RiGH26-pretreated β-mannan substrate, with disappearance of peaks signifying double and triple acetylated oligosaccharides, and decrease in the intensities of peaks signifying mono-acetylated oligosaccharides (containing 3-O-acetylations) and accumulation of deacetylated oligosaccharides.


Disclosures: K. Thurimella: None; A. Mohamed: None; C. Li: None; T. Vatanen: None; D. Graham: None; R. Owens: Electra Bio Ltd, 8; S. Leanti La Rosa: None; D. Plichta: Novonesis, 3; S. Bacallado: None; R. Xavier: Arena BioWorks, 4, ConvergenceBio, 8, Jnana Therapeutics, 8, Magnet Biomedicine, 4, MoonLake Immunotherapeutics, 4, Nestle, 4.

To cite this abstract in AMA style:

Thurimella K, Mohamed A, Li C, Vatanen T, Graham D, Owens R, Leanti La Rosa S, Plichta D, Bacallado S, Xavier R. Protein Language Model-Guided Homology Identifies Microbial Enzymes Linked to Fibrosis-Prone IgG4-RD and Crohn’s Disease [abstract]. Arthritis Rheumatol. 2025; 77 (suppl 9). https://acrabstracts.org/abstract/protein-language-model-guided-homology-identifies-microbial-enzymes-linked-to-fibrosis-prone-igg4-rd-and-crohns-disease/. Accessed .
  • Tweet
  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print

« Back to ACR Convergence 2025

ACR Meeting Abstracts - https://acrabstracts.org/abstract/protein-language-model-guided-homology-identifies-microbial-enzymes-linked-to-fibrosis-prone-igg4-rd-and-crohns-disease/

Advanced Search

Your Favorites

You can save and print a list of your favorite abstracts during your browser session by clicking the “Favorite” button at the bottom of any abstract. View your favorites »

Embargo Policy

All abstracts accepted to ACR Convergence are under media embargo once the ACR has notified presenters of their abstract’s acceptance. They may be presented at other meetings or published as manuscripts after this time but should not be discussed in non-scholarly venues or outlets. The following embargo policies are strictly enforced by the ACR.

Accepted abstracts are made available to the public online in advance of the meeting and are published in a special online supplement of our scientific journal, Arthritis & Rheumatology. Information contained in those abstracts may not be released until the abstracts appear online. In an exception to the media embargo, academic institutions, private organizations, and companies with products whose value may be influenced by information contained in an abstract may issue a press release to coincide with the availability of an ACR abstract on the ACR website. However, the ACR continues to require that information that goes beyond that contained in the abstract (e.g., discussion of the abstract done as part of editorial news coverage) is under media embargo until 10:00 AM CT on October 25. Journalists with access to embargoed information cannot release articles or editorial news coverage before this time. Editorial news coverage is considered original articles/videos developed by employed journalists to report facts, commentary, and subject matter expert quotes in a narrative form using a variety of sources (e.g., research, announcements, press releases, events, etc.).

Violation of this policy may result in the abstract being withdrawn from the meeting and other measures deemed appropriate. Authors are responsible for notifying colleagues, institutions, communications firms, and all other stakeholders related to the development or promotion of the abstract about this policy. If you have questions about the ACR abstract embargo policy, please contact ACR abstracts staff at [email protected].

Wiley

  • Online Journal
  • Privacy Policy
  • Permissions Policies
  • Cookie Preferences

© Copyright 2025 American College of Rheumatology