Building a PubMed knowledge graph

Jian Xu; Sunkyu Kim; Min Song; Minbyul Jeong; Donghyeon Kim; Jaewoo Kang; Justin F. Rousseau; Xin Li; Weijia Xu; Vetle I. Torvik; Yi Bu; Chongyan Chen; Islam Akef Ebeid; Daifeng Li; Ying Ding

doi:10.1038/s41597-020-0543-2

Building a PubMed knowledge graph

Jian Xu, Sunkyu Kim, Min Song, Minbyul Jeong, Donghyeon Kim, Jaewoo Kang, Justin F. Rousseau, Xin Li, Weijia Xu, Vetle I. Torvik, Yi Bu, Chongyan Chen, Islam Akef Ebeid, Daifeng Li, Ying Ding

Research output: Contribution to journal › Article › peer-review

105 Scopus citations

Abstract

PubMed^® is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguous, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID^®, and identifying fine-grained affiliation data from MapAffil. Through the integration of these credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving an F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities.

Original language	English (US)
Article number	205
Journal	Scientific Data
Volume	7
Issue number	1
DOIs	https://doi.org/10.1038/s41597-020-0543-2
State	Published - Dec 1 2020
Externally published	Yes

ASJC Scopus subject areas

Statistics and Probability
Information Systems
Education
Computer Science Applications
Statistics, Probability and Uncertainty
Library and Information Sciences

Access to Document

10.1038/s41597-020-0543-2

Cite this

@article{fb223e584baf49d4b7fe5448662b571e,

title = "Building a PubMed knowledge graph",

abstract = "PubMed{\textregistered} is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguous, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID{\textregistered}, and identifying fine-grained affiliation data from MapAffil. Through the integration of these credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving an F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities.",

author = "Jian Xu and Sunkyu Kim and Min Song and Minbyul Jeong and Donghyeon Kim and Jaewoo Kang and Rousseau, {Justin F.} and Xin Li and Weijia Xu and Torvik, {Vetle I.} and Yi Bu and Chongyan Chen and Ebeid, {Islam Akef} and Daifeng Li and Ying Ding",

note = "Funding Information: This work was supported by National Social Science Fund of China [18BTQ076], Chinese National Youth Foundation Research [61702564], Natural Science Foundation of Guangdong Province [2018A030313981], Soft Science Foundation of Guangdong Province [2019A101002020], National Research Foundation of Korea [NRF-2019R1A2C2002577] and [NRF-2017R1A2A1A17069645], and US National Institutes of Health [P01AG039347]. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing storage resources that have contributed to the research results reported within this paper. URL: http:// www.tacc.utexas.edu. Funding Information: Project data from NIH ExPORTER. NIH ExPORTER provides data files that contain research projects funded by major funding agencies such as the Centers for Disease Control and Prevention (CDC), the NIH, the Agency for Healthcare Research and Quality (AHRQ), the Health Resources and Services Administration (HRSA), the Substance Abuse and Mental Health Services Administration (SAMHSA), and the U.S. Department of Veterans Affairs (VA). Furthermore, it provides publications and patents citing support from these projects. It consists of 49 data fields, including the amount of funding for each fiscal year, organization information of the PIs, and the details of the projects. According to our investigation, NIH-funded research accounts for 80.7% of all grants recorded in PubMed. Publisher Copyright: {\textcopyright} 2020, This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply.",

year = "2020",

month = dec,

day = "1",

doi = "10.1038/s41597-020-0543-2",

language = "English (US)",

volume = "7",

journal = "Scientific Data",

issn = "2052-4463",

publisher = "Nature Publishing Group",

number = "1",

}

TY - JOUR

T1 - Building a PubMed knowledge graph

AU - Xu, Jian

AU - Kim, Sunkyu

AU - Song, Min

AU - Jeong, Minbyul

AU - Kim, Donghyeon

AU - Kang, Jaewoo

AU - Rousseau, Justin F.

AU - Li, Xin

AU - Xu, Weijia

AU - Torvik, Vetle I.

AU - Bu, Yi

AU - Chen, Chongyan

AU - Ebeid, Islam Akef

AU - Li, Daifeng

AU - Ding, Ying

N1 - Funding Information: This work was supported by National Social Science Fund of China [18BTQ076], Chinese National Youth Foundation Research [61702564], Natural Science Foundation of Guangdong Province [2018A030313981], Soft Science Foundation of Guangdong Province [2019A101002020], National Research Foundation of Korea [NRF-2019R1A2C2002577] and [NRF-2017R1A2A1A17069645], and US National Institutes of Health [P01AG039347]. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing storage resources that have contributed to the research results reported within this paper. URL: http:// www.tacc.utexas.edu. Funding Information: Project data from NIH ExPORTER. NIH ExPORTER provides data files that contain research projects funded by major funding agencies such as the Centers for Disease Control and Prevention (CDC), the NIH, the Agency for Healthcare Research and Quality (AHRQ), the Health Resources and Services Administration (HRSA), the Substance Abuse and Mental Health Services Administration (SAMHSA), and the U.S. Department of Veterans Affairs (VA). Furthermore, it provides publications and patents citing support from these projects. It consists of 49 data fields, including the amount of funding for each fiscal year, organization information of the PIs, and the details of the projects. According to our investigation, NIH-funded research accounts for 80.7% of all grants recorded in PubMed. Publisher Copyright: © 2020, This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply.

PY - 2020/12/1

Y1 - 2020/12/1

N2 - PubMed® is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguous, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID®, and identifying fine-grained affiliation data from MapAffil. Through the integration of these credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving an F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities.

AB - PubMed® is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguous, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID®, and identifying fine-grained affiliation data from MapAffil. Through the integration of these credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving an F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities.

UR - http://www.scopus.com/inward/record.url?scp=85086860682&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85086860682&partnerID=8YFLogxK

U2 - 10.1038/s41597-020-0543-2

DO - 10.1038/s41597-020-0543-2

M3 - Article

C2 - 32591513

AN - SCOPUS:85086860682

SN - 2052-4463

VL - 7

JO - Scientific Data

JF - Scientific Data

IS - 1

M1 - 205

ER -

Building a PubMed knowledge graph

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this