ABSTRACT

Objective: The primary aim was to compare independent and joint performance of retrieving smoking status through different sources, including narrative text processed by natural language processing (NLP), patient-provided information (PPI), and diagnosis codes (ie, International Classification of Diseases, Ninth Revision [ICD-9]). We also compared the performance of retrieving smoking strength information (ie, heavy/light smoker) from narrative text and PPI.

Materials and methods: Our study leveraged an existing lung cancer cohort for smoking status, amount, and strength information, which was manually chart-reviewed. On the NLP side, smoking-related electronic medical record (EMR) data were retrieved first. A pattern-based smoking information extraction module was then implemented to extract smoking-related information. After that, heuristic rules were used to obtain smoking status-related information. Smoking information was also obtained from structured data sources based on diagnosis codes and PPI. Sensitivity, specificity, and accuracy were measured using patients with coverage (ie, the proportion of patients whose smoking status/strength can be effectively determined).

Results: NLP alone has the best overall performance for smoking status extraction (patient coverage: 0.88; sensitivity: 0.97; specificity: 0.70; accuracy: 0.88); combining PPI with NLP further improved patient coverage to 0.96. ICD-9 does not provide additional improvement to NLP and its combination with PPI. For smoking strength, combining NLP with PPI has slight improvement over NLP alone.

Conclusion: These findings suggest that narrative text could serve as a more reliable and comprehensive source for obtaining smoking-related information than structured data sources. PPI, the readily available structured data, could be used as a complementary source for more comprehensive patient coverage.


Keywords: smoking status, smoking strength, natural language processing, ICD-9, patient-provided information 


Citation: Wang et al. Comparison of Three Information Sources for Smoking Information in Electronic Health Records. Cancer Informatics 2016:15 237–242 doi: 10.4137/CIN.S40604.

TYPE: Original Research

Received: August 01, 2016.

Resubmitted: October 09, 2016.

Accepted for publication: October 20, 2016.

Academic editor: J. T. Efird, Editor in Chief

Peer Review: Five peer reviewers contributed to the peer review report. Reviewers’ reports totaled 1292 words, excluding any confidential comments to the academic editor.

Funding: The authors gratefully acknowledge the support from the National Institute of Health (NIH) grants R01GM102282-03 and R01 LM011934-02. The authors confirm that the funder had no influence over the study design, content of the article, or selection of this journal.

Competing Interests: Authors disclose no potential conflicts of interest.

Correspondence: liu.hongfang@mayo.edu; wang.liwei@mayo.edu

Copyright: © the authors, publisher and licensee Libertas Academica Limited. This is an open-access article distributed under the terms of the Creative Commons CC-BY-NC 3.0 License. Paper subject to independent expert blind peer review. All editorial decisions made by independent academic editor. Upon submission manuscript was subject to anti-plagiarism scanning. Prior to publication all authors have given signed confirmation of agreement to article publication and compliance with all applicable ethical and legal requirements, including the accuracy of author and contributor information, disclosure of competing interests and funding sources, compliance with ethical requirements relating to human and animal study participants, and compliance with any copyright requirements of third parties. This journal is a member of the Committee on Publication Ethics (COPE). Published by Libertas Academica. Learn more about this journal.