Statistical classifiers for diagnosing disease from immune repertoires: A case study using multiple sclerosis

Jared Ostmeyer; Scott Christley; William H. Rounds; Inimary Toby; Benjamin M. Greenberg; Nancy L. Monson; Lindsay G. Cowell

doi:10.1186/s12859-017-1814-6

Statistical classifiers for diagnosing disease from immune repertoires: A case study using multiple sclerosis

Jared Ostmeyer, Scott Christley, William H. Rounds, Inimary Toby, Benjamin M. Greenberg, Nancy L. Monson, Lindsay G. Cowell

Research output: Contribution to journal › Article › peer-review

47 Scopus citations

Abstract

Background: Deep sequencing of lymphocyte receptor repertoires has made it possible to comprehensively profile the clonal composition of lymphocyte populations. This opens the door for novel approaches to diagnose and prognosticate diseases with a driving immune component by identifying repertoire sequence patterns associated with clinical phenotypes. Indeed, recent studies support the feasibility of this, demonstrating an association between repertoire-level summary statistics (e.g., diversity) and patient outcomes for several diseases. In our own prior work, we have shown that six codons in VH4-containing genes in B cells from the cerebrospinal fluid of patients with relapsing remitting multiple sclerosis (RRMS) have higher replacement mutation frequencies than observed in healthy controls or patients with other neurological diseases. However, prior methods to date have been limited to focusing on repertoire-level summary statistics, ignoring the vast amounts of information in the millions of individual immune receptors comprising a repertoire. We have developed a novel method that addresses this limitation by using innovative approaches for accommodating the extraordinary sequence diversity of immune receptors and widely used machine learning approaches. We applied our method to RRMS, an autoimmune disease that is notoriously difficult to diagnose. Results: We use the biochemical features encoded by the complementarity determining region 3 of each B cell receptor heavy chain in every patient repertoire as input to a detector function, which is fit to give the correct diagnosis for each patient using maximum likelihood optimization methods. The resulting statistical classifier assigns patients to one of two diagnosis categories, RRMS or other neurological disease, with 87% accuracy by leave-one-out cross-validation on training data (N=23) and 72% accuracy on unused data from a separate study (N=102). Conclusions: Our method is the first to apply statistical learning to immune repertoires to aid disease diagnosis, learning repertoire-level labels from the set of individual immune repertoire sequences. This method produced a repertoire-based statistical classifier for diagnosing RRMS that provides a high degree of diagnostic capability, rivaling the accuracy of diagnosis by a clinical expert. Additionally, this method points to a diagnostic biochemical motif in the antibodies of RRMS patients, which may offer insight into the disease process.

Original language	English (US)
Article number	401
Journal	BMC Bioinformatics
Volume	18
Issue number	1
DOIs	https://doi.org/10.1186/s12859-017-1814-6
State	Published - Sep 7 2017

Keywords

Antibody
CDR3
Immune repertoire
Machine learning
Multiple sclerosis
Statistical classifier

ASJC Scopus subject areas

Structural Biology
Biochemistry
Molecular Biology
Computer Science Applications
Applied Mathematics

Access to Document

10.1186/s12859-017-1814-6

Cite this

@article{4010940e28be4bd998758600b09e9e3d,

title = "Statistical classifiers for diagnosing disease from immune repertoires: A case study using multiple sclerosis",

abstract = "Background: Deep sequencing of lymphocyte receptor repertoires has made it possible to comprehensively profile the clonal composition of lymphocyte populations. This opens the door for novel approaches to diagnose and prognosticate diseases with a driving immune component by identifying repertoire sequence patterns associated with clinical phenotypes. Indeed, recent studies support the feasibility of this, demonstrating an association between repertoire-level summary statistics (e.g., diversity) and patient outcomes for several diseases. In our own prior work, we have shown that six codons in VH4-containing genes in B cells from the cerebrospinal fluid of patients with relapsing remitting multiple sclerosis (RRMS) have higher replacement mutation frequencies than observed in healthy controls or patients with other neurological diseases. However, prior methods to date have been limited to focusing on repertoire-level summary statistics, ignoring the vast amounts of information in the millions of individual immune receptors comprising a repertoire. We have developed a novel method that addresses this limitation by using innovative approaches for accommodating the extraordinary sequence diversity of immune receptors and widely used machine learning approaches. We applied our method to RRMS, an autoimmune disease that is notoriously difficult to diagnose. Results: We use the biochemical features encoded by the complementarity determining region 3 of each B cell receptor heavy chain in every patient repertoire as input to a detector function, which is fit to give the correct diagnosis for each patient using maximum likelihood optimization methods. The resulting statistical classifier assigns patients to one of two diagnosis categories, RRMS or other neurological disease, with 87% accuracy by leave-one-out cross-validation on training data (N=23) and 72% accuracy on unused data from a separate study (N=102). Conclusions: Our method is the first to apply statistical learning to immune repertoires to aid disease diagnosis, learning repertoire-level labels from the set of individual immune repertoire sequences. This method produced a repertoire-based statistical classifier for diagnosing RRMS that provides a high degree of diagnostic capability, rivaling the accuracy of diagnosis by a clinical expert. Additionally, this method points to a diagnostic biochemical motif in the antibodies of RRMS patients, which may offer insight into the disease process.",

keywords = "Antibody, CDR3, Immune repertoire, Machine learning, Multiple sclerosis, Statistical classifier",

author = "Jared Ostmeyer and Scott Christley and Rounds, {William H.} and Inimary Toby and Greenberg, {Benjamin M.} and Monson, {Nancy L.} and Cowell, {Lindsay G.}",

note = "Publisher Copyright: {\textcopyright} 2017 The Author(s).",

year = "2017",

month = sep,

day = "7",

doi = "10.1186/s12859-017-1814-6",

language = "English (US)",

volume = "18",

journal = "BMC Bioinformatics",

issn = "1471-2105",

publisher = "BioMed Central",

number = "1",

}

TY - JOUR

T1 - Statistical classifiers for diagnosing disease from immune repertoires

T2 - A case study using multiple sclerosis

AU - Ostmeyer, Jared

AU - Christley, Scott

AU - Rounds, William H.

AU - Toby, Inimary

AU - Greenberg, Benjamin M.

AU - Monson, Nancy L.

AU - Cowell, Lindsay G.

PY - 2017/9/7

Y1 - 2017/9/7

N2 - Background: Deep sequencing of lymphocyte receptor repertoires has made it possible to comprehensively profile the clonal composition of lymphocyte populations. This opens the door for novel approaches to diagnose and prognosticate diseases with a driving immune component by identifying repertoire sequence patterns associated with clinical phenotypes. Indeed, recent studies support the feasibility of this, demonstrating an association between repertoire-level summary statistics (e.g., diversity) and patient outcomes for several diseases. In our own prior work, we have shown that six codons in VH4-containing genes in B cells from the cerebrospinal fluid of patients with relapsing remitting multiple sclerosis (RRMS) have higher replacement mutation frequencies than observed in healthy controls or patients with other neurological diseases. However, prior methods to date have been limited to focusing on repertoire-level summary statistics, ignoring the vast amounts of information in the millions of individual immune receptors comprising a repertoire. We have developed a novel method that addresses this limitation by using innovative approaches for accommodating the extraordinary sequence diversity of immune receptors and widely used machine learning approaches. We applied our method to RRMS, an autoimmune disease that is notoriously difficult to diagnose. Results: We use the biochemical features encoded by the complementarity determining region 3 of each B cell receptor heavy chain in every patient repertoire as input to a detector function, which is fit to give the correct diagnosis for each patient using maximum likelihood optimization methods. The resulting statistical classifier assigns patients to one of two diagnosis categories, RRMS or other neurological disease, with 87% accuracy by leave-one-out cross-validation on training data (N=23) and 72% accuracy on unused data from a separate study (N=102). Conclusions: Our method is the first to apply statistical learning to immune repertoires to aid disease diagnosis, learning repertoire-level labels from the set of individual immune repertoire sequences. This method produced a repertoire-based statistical classifier for diagnosing RRMS that provides a high degree of diagnostic capability, rivaling the accuracy of diagnosis by a clinical expert. Additionally, this method points to a diagnostic biochemical motif in the antibodies of RRMS patients, which may offer insight into the disease process.

AB - Background: Deep sequencing of lymphocyte receptor repertoires has made it possible to comprehensively profile the clonal composition of lymphocyte populations. This opens the door for novel approaches to diagnose and prognosticate diseases with a driving immune component by identifying repertoire sequence patterns associated with clinical phenotypes. Indeed, recent studies support the feasibility of this, demonstrating an association between repertoire-level summary statistics (e.g., diversity) and patient outcomes for several diseases. In our own prior work, we have shown that six codons in VH4-containing genes in B cells from the cerebrospinal fluid of patients with relapsing remitting multiple sclerosis (RRMS) have higher replacement mutation frequencies than observed in healthy controls or patients with other neurological diseases. However, prior methods to date have been limited to focusing on repertoire-level summary statistics, ignoring the vast amounts of information in the millions of individual immune receptors comprising a repertoire. We have developed a novel method that addresses this limitation by using innovative approaches for accommodating the extraordinary sequence diversity of immune receptors and widely used machine learning approaches. We applied our method to RRMS, an autoimmune disease that is notoriously difficult to diagnose. Results: We use the biochemical features encoded by the complementarity determining region 3 of each B cell receptor heavy chain in every patient repertoire as input to a detector function, which is fit to give the correct diagnosis for each patient using maximum likelihood optimization methods. The resulting statistical classifier assigns patients to one of two diagnosis categories, RRMS or other neurological disease, with 87% accuracy by leave-one-out cross-validation on training data (N=23) and 72% accuracy on unused data from a separate study (N=102). Conclusions: Our method is the first to apply statistical learning to immune repertoires to aid disease diagnosis, learning repertoire-level labels from the set of individual immune repertoire sequences. This method produced a repertoire-based statistical classifier for diagnosing RRMS that provides a high degree of diagnostic capability, rivaling the accuracy of diagnosis by a clinical expert. Additionally, this method points to a diagnostic biochemical motif in the antibodies of RRMS patients, which may offer insight into the disease process.

KW - Antibody

KW - CDR3

KW - Immune repertoire

KW - Machine learning

KW - Multiple sclerosis

KW - Statistical classifier

UR - http://www.scopus.com/inward/record.url?scp=85028950537&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85028950537&partnerID=8YFLogxK

U2 - 10.1186/s12859-017-1814-6

DO - 10.1186/s12859-017-1814-6

M3 - Article

C2 - 28882107

AN - SCOPUS:85028950537

SN - 1471-2105

VL - 18

JO - BMC Bioinformatics

JF - BMC Bioinformatics

IS - 1

M1 - 401

ER -

Statistical classifiers for diagnosing disease from immune repertoires: A case study using multiple sclerosis

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this