EMR/EHR data can be useful for surveillance, healthcare quality measurement, and in research. One limitation is that data extraction can be difficult.1 Censoring of data might not capture all points in time, there may be bias in testing and treatment documentation, there is inconsistent use of coding and standards, and there is personal variance in documentation styles.2 EMR data may also be specific to the site of service, so it may be representative of a community clinic, unless it is captured through an integrated system. For example, we may only see the outpatient or office visits for a patient. Other resource and site usage may have to be gleaned from NLP-based text mining, such as identifying diagnostic tests taken at other facilities, radiation therapies, emergency room visits, or inpatient visits and hospital stays.
In EMR data, structured data may capture fields such as diagnosis, demographic information, clinical information, treatment orders, and laboratory tests. Unstructured data is captured through text mining to search for specific mentions of a term, phrase, or value. Unstructured data may include comment fields, physician- or other healthcare provider– dictated notes, or supplemental data not otherwise in the EMR. NLP uses algorithms to apply rules, such as sentence boundaries; terms associated with one another; and variations, subgroups, and the mention of a token or term.3 We may be able to search for qualitative measures not otherwise captured.