Development and Use of Natural Language Processing for Identification of Distant Cancer Recurrence and Sites of Distant Recurrence Using Unstructured Electronic Health Record Data

Yasmin H. Karimi; Douglas W. Blayney; Allison W. Kurian; Jeanne Shen; Rikiya Yamashita; Daniel Rubin; Imon Banerjee

doi:10.1200/CCI.20.00165

Development and Use of Natural Language Processing for Identification of Distant Cancer Recurrence and Sites of Distant Recurrence Using Unstructured Electronic Health Record Data

Yasmin H. Karimi, Douglas W. Blayney, Allison W. Kurian, Jeanne Shen, Rikiya Yamashita, Daniel Rubin, Imon Banerjee

Research output: Contribution to journal › Article › peer-review

11 Scopus citations

Abstract

PURPOSE Large-scale analysis of real-world evidence is often limited to structured data fields that do not contain reliable information on recurrence status and disease sites. In this report, we describe a natural language processing (NLP) framework that uses data from free-text, unstructured reports to classify recurrence status and sites of recurrence for patients with breast and hepatocellular carcinomas (HCC). METHODS Using two cohorts of breast cancer and HCC cases, we validated the ability of a previously developed NLP model to distinguish between no recurrence, local recurrence, and distant recurrence, based on clinician notes, radiology reports, and pathology reports compared with manual curation. A second NLP model was trained and validated to identify sites of recurrence. We compared the ability of each NLP model to identify the presence, timing, and site of recurrence, when compared against manual chart review and International Classification of Diseases coding. RESULTS A total of 1,273 patients were included in the development and validation of the two models. The NLP model for recurrence detects distant recurrence with an area under the curve of 0.98 (95% CI, 0.96 to 0.99) and 0.95 (95% CI, 0.88 to 0.98) in breast and HCC cohorts, respectively. The mean accuracy of the NLP model for detecting any site of distant recurrence was 0.9 for breast cancer and 0.83 for HCC. The NLP model for recurrence identified a larger proportion of patients with distant recurrence in a breast cancer database (11.1%) compared with International Classification of Diseases coding (2.31%). CONCLUSION We developed two NLP models to identify distant cancer recurrence, timing of recurrence, and sites of recurrence based on unstructured electronic health record data. These models can be used to perform large-scale retrospective studies in oncology.

Original language	English (US)
Pages (from-to)	469-478
Number of pages	10
Journal	JCO Clinical Cancer Informatics
Volume	5
DOIs	https://doi.org/10.1200/CCI.20.00165
State	Published - 2021
Externally published	Yes

ASJC Scopus subject areas

General Medicine

Access to Document

10.1200/CCI.20.00165

Cite this

Development and Use of Natural Language Processing for Identification of Distant Cancer Recurrence and Sites of Distant Recurrence Using Unstructured Electronic Health Record Data. / Karimi, Yasmin H.; Blayney, Douglas W.; Kurian, Allison W. et al.
In: JCO Clinical Cancer Informatics, Vol. 5, 2021, p. 469-478.

Research output: Contribution to journal › Article › peer-review

@article{3280f1693fd24ba9a129dfb4e67ea3a4,

title = "Development and Use of Natural Language Processing for Identification of Distant Cancer Recurrence and Sites of Distant Recurrence Using Unstructured Electronic Health Record Data",

abstract = "PURPOSE Large-scale analysis of real-world evidence is often limited to structured data fields that do not contain reliable information on recurrence status and disease sites. In this report, we describe a natural language processing (NLP) framework that uses data from free-text, unstructured reports to classify recurrence status and sites of recurrence for patients with breast and hepatocellular carcinomas (HCC). METHODS Using two cohorts of breast cancer and HCC cases, we validated the ability of a previously developed NLP model to distinguish between no recurrence, local recurrence, and distant recurrence, based on clinician notes, radiology reports, and pathology reports compared with manual curation. A second NLP model was trained and validated to identify sites of recurrence. We compared the ability of each NLP model to identify the presence, timing, and site of recurrence, when compared against manual chart review and International Classification of Diseases coding. RESULTS A total of 1,273 patients were included in the development and validation of the two models. The NLP model for recurrence detects distant recurrence with an area under the curve of 0.98 (95% CI, 0.96 to 0.99) and 0.95 (95% CI, 0.88 to 0.98) in breast and HCC cohorts, respectively. The mean accuracy of the NLP model for detecting any site of distant recurrence was 0.9 for breast cancer and 0.83 for HCC. The NLP model for recurrence identified a larger proportion of patients with distant recurrence in a breast cancer database (11.1%) compared with International Classification of Diseases coding (2.31%). CONCLUSION We developed two NLP models to identify distant cancer recurrence, timing of recurrence, and sites of recurrence based on unstructured electronic health record data. These models can be used to perform large-scale retrospective studies in oncology.",

author = "Karimi, {Yasmin H.} and Blayney, {Douglas W.} and Kurian, {Allison W.} and Jeanne Shen and Rikiya Yamashita and Daniel Rubin and Imon Banerjee",

year = "2021",

doi = "10.1200/CCI.20.00165",

language = "English (US)",

volume = "5",

pages = "469--478",

journal = "JCO Clinical Cancer Informatics",

issn = "2473-4276",

publisher = "American Society of Clinical Oncology",

}

TY - JOUR

T1 - Development and Use of Natural Language Processing for Identification of Distant Cancer Recurrence and Sites of Distant Recurrence Using Unstructured Electronic Health Record Data

AU - Karimi, Yasmin H.

AU - Blayney, Douglas W.

AU - Kurian, Allison W.

AU - Shen, Jeanne

AU - Yamashita, Rikiya

AU - Rubin, Daniel

AU - Banerjee, Imon

PY - 2021

Y1 - 2021

N2 - PURPOSE Large-scale analysis of real-world evidence is often limited to structured data fields that do not contain reliable information on recurrence status and disease sites. In this report, we describe a natural language processing (NLP) framework that uses data from free-text, unstructured reports to classify recurrence status and sites of recurrence for patients with breast and hepatocellular carcinomas (HCC). METHODS Using two cohorts of breast cancer and HCC cases, we validated the ability of a previously developed NLP model to distinguish between no recurrence, local recurrence, and distant recurrence, based on clinician notes, radiology reports, and pathology reports compared with manual curation. A second NLP model was trained and validated to identify sites of recurrence. We compared the ability of each NLP model to identify the presence, timing, and site of recurrence, when compared against manual chart review and International Classification of Diseases coding. RESULTS A total of 1,273 patients were included in the development and validation of the two models. The NLP model for recurrence detects distant recurrence with an area under the curve of 0.98 (95% CI, 0.96 to 0.99) and 0.95 (95% CI, 0.88 to 0.98) in breast and HCC cohorts, respectively. The mean accuracy of the NLP model for detecting any site of distant recurrence was 0.9 for breast cancer and 0.83 for HCC. The NLP model for recurrence identified a larger proportion of patients with distant recurrence in a breast cancer database (11.1%) compared with International Classification of Diseases coding (2.31%). CONCLUSION We developed two NLP models to identify distant cancer recurrence, timing of recurrence, and sites of recurrence based on unstructured electronic health record data. These models can be used to perform large-scale retrospective studies in oncology.

AB - PURPOSE Large-scale analysis of real-world evidence is often limited to structured data fields that do not contain reliable information on recurrence status and disease sites. In this report, we describe a natural language processing (NLP) framework that uses data from free-text, unstructured reports to classify recurrence status and sites of recurrence for patients with breast and hepatocellular carcinomas (HCC). METHODS Using two cohorts of breast cancer and HCC cases, we validated the ability of a previously developed NLP model to distinguish between no recurrence, local recurrence, and distant recurrence, based on clinician notes, radiology reports, and pathology reports compared with manual curation. A second NLP model was trained and validated to identify sites of recurrence. We compared the ability of each NLP model to identify the presence, timing, and site of recurrence, when compared against manual chart review and International Classification of Diseases coding. RESULTS A total of 1,273 patients were included in the development and validation of the two models. The NLP model for recurrence detects distant recurrence with an area under the curve of 0.98 (95% CI, 0.96 to 0.99) and 0.95 (95% CI, 0.88 to 0.98) in breast and HCC cohorts, respectively. The mean accuracy of the NLP model for detecting any site of distant recurrence was 0.9 for breast cancer and 0.83 for HCC. The NLP model for recurrence identified a larger proportion of patients with distant recurrence in a breast cancer database (11.1%) compared with International Classification of Diseases coding (2.31%). CONCLUSION We developed two NLP models to identify distant cancer recurrence, timing of recurrence, and sites of recurrence based on unstructured electronic health record data. These models can be used to perform large-scale retrospective studies in oncology.

UR - http://www.scopus.com/inward/record.url?scp=85106070053&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85106070053&partnerID=8YFLogxK

U2 - 10.1200/CCI.20.00165

DO - 10.1200/CCI.20.00165

M3 - Article

C2 - 33929889

AN - SCOPUS:85106070053

SN - 2473-4276

VL - 5

SP - 469

EP - 478

JO - JCO Clinical Cancer Informatics

JF - JCO Clinical Cancer Informatics

ER -

Development and Use of Natural Language Processing for Identification of Distant Cancer Recurrence and Sites of Distant Recurrence Using Unstructured Electronic Health Record Data

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this