Unstructured data obtained via natural language processing (NLP) is superior to structured data for extracting quantitative information about patient smoking habits, according to a study published in JCO Clinical Cancer Informatics.

This quantitative information can help determine which patients are eligible for low-dose computed tomography (LDCT) scanning for early detection of lung cancer according to US Preventive Services Task Force criteria, researchers reported.

The researchers examined data from 4615 adult patients randomly selected from the Multiparameter Intelligent Monitoring in Critical Care (MIMIC-III) database. Their goal was to obtain pack year and quit date (if applicable) information from each smoking patient for use in identifying those eligible for LDCT.

Continue Reading

Structured data were obtained via queries of the MIMC-III diagnosis tables using the relevant ICD-9 codes, which were in use at the time of the study. Unstructured data were extracted from clinician notes in the electronic health record (EHR).

In the structured data group, there were 4040 patients who either had no smoking history or simply had no smoking data available, but the structured data could not distinguish between them. For the 575 remaining patients, diagnostic codes showed 271 active smokers and 304 former smokers, but because there was no quit date or pack year history, patient eligibility for LDCT could not be determined.

In the unstructured data group, 1930 ever smokers were identified via NLP, of whom 537 were active smokers, 1299 were former smokers, and 94 had unknown quit status. Among the 537 active smokers, pack year information was available for 200 patients; for the former smokers, quit date was available for 952 patients. Using this information, researchers were able to determine that 276 patients met the criteria for LDCT.

“Our results clearly demonstrate that unstructured data obtained from the clinical notes by NLP are superior to the use of structured data for determining a quantitative estimate of smoking behaviors,” the authors concluded. Patients identified via this method “can then be contacted by the LDCT program or their individual physicians with the expectation that they will be eligible for screening,” they added.

Disclosures: One author declared affiliations with biotech, pharmaceutical, and/or device companies. Please see the original reference for a full list of disclosures.


Ruckdeschel JC, Riley M, Parsatharathy S, et al. Unstructured data are superior to structured data for eliciting quantitative smoking history from the electronic health recordJCO Clin Cancer Inform. Published online February 21, 2023. doi:10.1200/CCI.22.00155