While the aforementioned examples show some of the advantages and applications of machine learning, there are certain considerations that researchers should keep in mind when using these tools in their research. First, there are the questions of data quality and data interpretation. While there are myriad datasets available from various sources that contain large amounts of data, not all of these data are useable for machine learning. Most datasets are not created for HEOR purposes, but rather to meet other specific needs. Administrative claims data, for example, is designed primarily for the purpose of supporting reimbursement; the available fields are structured in a way that facilitates claims adjudication but is not necessarily conducive to research. Before using this type of data, one must take care to understand both the structure of the data and certain artifacts of the data that could skew or create erroneous results (eg, miscoding, order of diagnoses, upcoding, etc).
These limitations also apply to EMR claims data. While these data provide a wide variety of information that was previously not available in claims data, with a frequency that allows for the evaluation of real-world analyses, universal standards for the information within an EMR are lacking; thus, data can vary dramatically even between treatment settings within the same area. Although some standardization has been implemented, it is difficult to create a standard form that covers the myriad healthcare providers and settings across the country.7 Additionally, there are still many fields that exist as free text, which can’t easily be incorporated into an analytic dataset for the purposes of research. There are also still concerns regarding the consistency of this data from patient to patient. Inconsistencies due to missing records and incorrect entries abound, and this fact requires additional curation of the data before using it for research purposes.
Novel data sources, such as social media, pose additional problems as well. Some of the data from these sources is opinion-based and might not be useful beyond a qualitative assessment of a population unless care has been taken to follow rigorous methodology for the elicitation of information (eg, patient-reported outcomes measurement tools). It is also difficult to determine the accuracy of this data, especially in an online setting that allows for anonymity in many cases.
Consideration of these limitations provides a clearer perspective of machine learning. In order to produce accurate results, most machine learning techniques rely on prior information as well as the quality of information in terms of completeness and accuracy of data. Without these two components—completeness and accuracy—any results must be viewed with some skepticism and might require additional input in the form of clinical expertise or additional data sources before conclusions can be made. Even with training and model-fitting techniques, a researcher must always be careful not to over-interpret results from machine learning, especially for techniques that require little human input.