Comparison of Three Information Sources for Smoking Information in Electronic Health Records
the ONA take:
Narrative text may be the most reliable and comprehensive source for obtaining smoking-related information, while patient-provided information (PPI) could be used as a complementary source for more comprehensive patient data, according to a study published in Cancer Informatics.
Because limited studies have evaluated the optimal strategy for obtaining smoking status information, researchers sought to compare the performance of retrieving smoking status through narrative text processed by natural language processing (NLP), PPI, and diagnosis code (ie, ICD-9), and assess the performance of retrieving smoking strength information from NLP and PPI.
For the study, investigators reviewed chart data from a lung cancer cohort of 561 patients aged 15 to 45 years who were diagnosed with 1 of the 21 lung cancer subtypes. For NLP-based identification of smoking status, researchers extracted smoking-related information from the Mayo Clinical electronic medical record (EMR). Patient-provided smoking information was obtained from structured PPI in
EMRs, and the diagnosis code was extracted from hospital billing information used to group patients as ever smokers or never smokers.
Results showed that NLP alone has the best overall performance for extracting smoking status information, but combining PPI with NLP further enhanced patient coverage. Investigators found that ICD-9 does not provide improved extraction when added to NLP with or without PPI. For smoking strength, combining NLP with PPI was slightly better than using NLP alone.
Smoking status for people aged 13 years or older is one core criteria for meaningful use of electronic medical records.
Objective: The primary aim was to compare independent and joint performance of retrieving smoking status through different sources, including narrative text processed by natural language processing (NLP), patient-provided information (PPI), and diagnosis codes (ie, International Classification of Diseases, Ninth Revision [ICD-9]). We also compared the performance of retrieving smoking strength information (ie, heavy/light smoker) from narrative text and PPI.
Materials and methods: Our study leveraged an existing lung cancer cohort for smoking status, amount, and strength information, which was manually chart-reviewed. On the NLP side, smoking-related electronic medical record (EMR) data were retrieved first. A pattern-based smoking information extraction module was then implemented to extract smoking-related information. After that, heuristic rules were used to obtain smoking status-related information. Smoking information was also obtained from structured data sources based on diagnosis codes and PPI. Sensitivity, specificity, and accuracy were measured using patients with coverage (ie, the proportion of patients whose smoking status/strength can be effectively determined).
Results: NLP alone has the best overall performance for smoking status extraction (patient coverage: 0.88; sensitivity: 0.97; specificity: 0.70; accuracy: 0.88); combining PPI with NLP further improved patient coverage to 0.96. ICD-9 does not provide additional improvement to NLP and its combination with PPI. For smoking strength, combining NLP with PPI has slight improvement over NLP alone.
Conclusion: These findings suggest that narrative text could serve as a more reliable and comprehensive source for obtaining smoking-related information than structured data sources. PPI, the readily available structured data, could be used as a complementary source for more comprehensive patient coverage.
Keywords: smoking status, smoking strength, natural language processing, ICD-9, patient-provided information
Citation: Wang et al. Comparison of Three Information Sources for Smoking Information in Electronic Health Records. Cancer Informatics 2016:15 237–242 doi: 10.4137/CIN.S40604.
TYPE: Original Research
Received: August 01, 2016.
Resubmitted: October 09, 2016.
Accepted for publication: October 20, 2016.
Academic editor: J. T. Efird, Editor in Chief
Peer Review: Five peer reviewers contributed to the peer review report. Reviewers' reports totaled 1292 words, excluding any confidential comments to the academic editor.
Funding: The authors gratefully acknowledge the support from the National Institute of Health (NIH) grants R01GM102282-03 and R01 LM011934-02. The authors confirm that the funder had no influence over the study design, content of the article, or selection of this journal.
Competing Interests: Authors disclose no potential conflicts of interest.
Correspondence: firstname.lastname@example.org; email@example.com
Copyright: © the authors, publisher and licensee Libertas Academica Limited. This is an open-access article distributed under the terms of the Creative Commons CC-BY-NC 3.0 License. Paper subject to independent expert blind peer review. All editorial decisions made by independent academic editor. Upon submission manuscript was subject to anti-plagiarism scanning. Prior to publication all authors have given signed confirmation of agreement to article publication and compliance with all applicable ethical and legal requirements, including the accuracy of author and contributor information, disclosure of competing interests and funding sources, compliance with ethical requirements relating to human and animal study participants, and compliance with any copyright requirements of third parties. This journal is a member of the Committee on Publication Ethics (COPE). Published by Libertas Academica. Learn more about this journal.