Euclidian space and grouping of biological objects

Vyacheslav N. Grishin; Nick V. Grishin

doi:10.1093/bioinformatics/18.11.1523

Euclidian space and grouping of biological objects

Vyacheslav N. Grishin, Nick V. Grishin

Research output: Contribution to journal › Article › peer-review

19 Scopus citations

Abstract

Motivation: Biological objects tend to cluster into discrete groups. Objects within a group typically possess similar properties. It is important to have fast and efficient tools for grouping objects that result in biologically meaningful clusters. Protein sequences reflect biological diversity and offer an extraordinary variety of objects for polishing clustering strategies. Grouping of sequences should reflect their evolutionary history and their functional properties. Visualization of relationships between sequences is of no less importance. Tree-building methods are typically used for such visualization. An alternative concept to visualization is a multidimensional sequence space. In this space, proteins are defined as points and distances between the points reflect the relationships between the proteins. Such a space can also be a basis for model-based clustering strategies that typically produce results correlating better with biological properties of proteins. Results: We developed an approach to classification of biological objects that combines evolutionary measures of their similarity with a model-based clustering procedure. We apply the methodology to amino acid sequences. On the first step, given a multiple sequence alignment, we estimate evolutionary distances between proteins measured in expected numbers of amino acid substitutions per site. These distances are additive and are suitable for evolutionary tree reconstruction. On the second step, we find the best fit approximation of the evolutionary distances by Euclidian distances and thus represent each protein by a point in a multidimensional space. The Euclidian space may be projected in two or three dimensions and the projections can be used to visualize relationships between proteins. On the third step, we find a non-parametric estimate of the probability density of the points and cluster the points that belong to the same local maximum of this density in a group. The number of groups is controlled by a σ-parameter that determines the shape of the density estimate and the number of maxima in it. The grouping procedure outperforms commonly used methods such as UPGMA and single linkage clustering.

Original language	English (US)
Pages (from-to)	1523-1533
Number of pages	11
Journal	Bioinformatics
Volume	18
Issue number	11
DOIs	https://doi.org/10.1093/bioinformatics/18.11.1523
State	Published - Nov 1 2002

ASJC Scopus subject areas

Statistics and Probability
Biochemistry
Molecular Biology
Computer Science Applications
Computational Theory and Mathematics
Computational Mathematics

Access to Document

10.1093/bioinformatics/18.11.1523

Cite this

@article{40c676bcc44d4798b91a02adff17e637,

title = "Euclidian space and grouping of biological objects",

abstract = "Motivation: Biological objects tend to cluster into discrete groups. Objects within a group typically possess similar properties. It is important to have fast and efficient tools for grouping objects that result in biologically meaningful clusters. Protein sequences reflect biological diversity and offer an extraordinary variety of objects for polishing clustering strategies. Grouping of sequences should reflect their evolutionary history and their functional properties. Visualization of relationships between sequences is of no less importance. Tree-building methods are typically used for such visualization. An alternative concept to visualization is a multidimensional sequence space. In this space, proteins are defined as points and distances between the points reflect the relationships between the proteins. Such a space can also be a basis for model-based clustering strategies that typically produce results correlating better with biological properties of proteins. Results: We developed an approach to classification of biological objects that combines evolutionary measures of their similarity with a model-based clustering procedure. We apply the methodology to amino acid sequences. On the first step, given a multiple sequence alignment, we estimate evolutionary distances between proteins measured in expected numbers of amino acid substitutions per site. These distances are additive and are suitable for evolutionary tree reconstruction. On the second step, we find the best fit approximation of the evolutionary distances by Euclidian distances and thus represent each protein by a point in a multidimensional space. The Euclidian space may be projected in two or three dimensions and the projections can be used to visualize relationships between proteins. On the third step, we find a non-parametric estimate of the probability density of the points and cluster the points that belong to the same local maximum of this density in a group. The number of groups is controlled by a σ-parameter that determines the shape of the density estimate and the number of maxima in it. The grouping procedure outperforms commonly used methods such as UPGMA and single linkage clustering.",

author = "Grishin, {Vyacheslav N.} and Grishin, {Nick V.}",

year = "2002",

month = nov,

day = "1",

doi = "10.1093/bioinformatics/18.11.1523",

language = "English (US)",

volume = "18",

pages = "1523--1533",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "11",

}

TY - JOUR

T1 - Euclidian space and grouping of biological objects

AU - Grishin, Vyacheslav N.

AU - Grishin, Nick V.

PY - 2002/11/1

Y1 - 2002/11/1

N2 - Motivation: Biological objects tend to cluster into discrete groups. Objects within a group typically possess similar properties. It is important to have fast and efficient tools for grouping objects that result in biologically meaningful clusters. Protein sequences reflect biological diversity and offer an extraordinary variety of objects for polishing clustering strategies. Grouping of sequences should reflect their evolutionary history and their functional properties. Visualization of relationships between sequences is of no less importance. Tree-building methods are typically used for such visualization. An alternative concept to visualization is a multidimensional sequence space. In this space, proteins are defined as points and distances between the points reflect the relationships between the proteins. Such a space can also be a basis for model-based clustering strategies that typically produce results correlating better with biological properties of proteins. Results: We developed an approach to classification of biological objects that combines evolutionary measures of their similarity with a model-based clustering procedure. We apply the methodology to amino acid sequences. On the first step, given a multiple sequence alignment, we estimate evolutionary distances between proteins measured in expected numbers of amino acid substitutions per site. These distances are additive and are suitable for evolutionary tree reconstruction. On the second step, we find the best fit approximation of the evolutionary distances by Euclidian distances and thus represent each protein by a point in a multidimensional space. The Euclidian space may be projected in two or three dimensions and the projections can be used to visualize relationships between proteins. On the third step, we find a non-parametric estimate of the probability density of the points and cluster the points that belong to the same local maximum of this density in a group. The number of groups is controlled by a σ-parameter that determines the shape of the density estimate and the number of maxima in it. The grouping procedure outperforms commonly used methods such as UPGMA and single linkage clustering.

AB - Motivation: Biological objects tend to cluster into discrete groups. Objects within a group typically possess similar properties. It is important to have fast and efficient tools for grouping objects that result in biologically meaningful clusters. Protein sequences reflect biological diversity and offer an extraordinary variety of objects for polishing clustering strategies. Grouping of sequences should reflect their evolutionary history and their functional properties. Visualization of relationships between sequences is of no less importance. Tree-building methods are typically used for such visualization. An alternative concept to visualization is a multidimensional sequence space. In this space, proteins are defined as points and distances between the points reflect the relationships between the proteins. Such a space can also be a basis for model-based clustering strategies that typically produce results correlating better with biological properties of proteins. Results: We developed an approach to classification of biological objects that combines evolutionary measures of their similarity with a model-based clustering procedure. We apply the methodology to amino acid sequences. On the first step, given a multiple sequence alignment, we estimate evolutionary distances between proteins measured in expected numbers of amino acid substitutions per site. These distances are additive and are suitable for evolutionary tree reconstruction. On the second step, we find the best fit approximation of the evolutionary distances by Euclidian distances and thus represent each protein by a point in a multidimensional space. The Euclidian space may be projected in two or three dimensions and the projections can be used to visualize relationships between proteins. On the third step, we find a non-parametric estimate of the probability density of the points and cluster the points that belong to the same local maximum of this density in a group. The number of groups is controlled by a σ-parameter that determines the shape of the density estimate and the number of maxima in it. The grouping procedure outperforms commonly used methods such as UPGMA and single linkage clustering.

UR - http://www.scopus.com/inward/record.url?scp=0036856210&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036856210&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/18.11.1523

DO - 10.1093/bioinformatics/18.11.1523

M3 - Article

C2 - 12424125

AN - SCOPUS:0036856210

SN - 1367-4803

VL - 18

SP - 1523

EP - 1533

JO - Bioinformatics

JF - Bioinformatics

IS - 11

ER -

Euclidian space and grouping of biological objects

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this