Managing data volume with new methodologies
The past 20 years has seen a dramatic increase in the availability of data that are accessible by stakeholders within the healthcare industry, which has provided ever–increasing opportunities for healthcare economics and outcomes research (HEOR).1 This increase in data is represented in the increased volume of data from payer claims as well as from electronic medical records (EMR) and other sources.2 Claims and EMR data are also being linked to sociodemographic and consumer data, which gives a broader perspective on patients and physicians and how healthcare services are being provided. EMR data can be updated in near real time, whereas some claims data can be refreshed weekly. The speed at which this data can be accessed allows for new opportunities for understanding the healthcare system and addressing questions from various stakeholders and vendors regarding how best to position products in order to serve consumers.
One of the challenges with this data is how it can best be analyzed. Traditionally, standard comparative statistical methods (eg, t tests, chi-square analyses) have been used to assess populations and make statistically significant comparisons within specific groups of patients or within segments of the data. However, with the wealth of data now available, new techniques developed from machine learning are starting to be leveraged in order to gain a better understanding of complex patterns within the data. Although the technology of machine learning emerged more than 5 years ago, its impact on healthcare remains to be determined. While these new techniques offer the possibility of gaining new insights into the healthcare environment, there are certain caveats that should be considered when using these methodologies.
What Is Machine Learning?
Machine learning refers to a large group of analytic and statistical techniques, including supervised methods (eg, co-training [training of regression models], Bayesian statistics, and decision trees) and unsupervised methods (eg, clustering, neural networks, and principal component analysis), that are focused on predicting outcomes or events based on the available data.3 These techniques provide computers with the ability to learn how to work with the data at their disposal without being explicitly programmed to do so via the use of computer programs that can change on their own when exposed to new sets of data.
These techniques generally look for patterns within the data in order to understand what might be leading to a specific event or outcome. In theory, the more complete or “explanatory” the source data is, the more robust the prediction of outcomes will be and the easier it becomes to isolate specific factors (eg, age, race, weight, clinical history, etc.) that might be predictive of certain outcomes. Conversely, causal relationships between these factors and the outcomes cannot always be determined even when a factor is predictive of a specific outcome. The question is how to disambiguate an association between a specific factor and an outcome from factors that might actually be leading to specific outcomes and can provide actionable intelligence. Current methodologies are still being developed in order to make this distinction by incorporating clinical input and real-world evidence to guide conclusions based on these analyses. There are currently a number of applications for machine learning that have been published using HEOR data as well as certain caveats to applying these methodologies.
Machine Learning Applications in HEOR Data
Understanding a Patient Population
Understanding Treatment Options
Evaluating Treatment Patterns and Outcomes
Considerations When Using Machine Learning
While the aforementioned examples show some of the advantages and applications of machine learning, there are certain considerations that researchers should keep in mind when using these tools in their research. First, there are the questions of data quality and data interpretation. While there are myriad datasets available from various sources that contain large amounts of data, not all of these data are useable for machine learning. Most datasets are not created for HEOR purposes, but rather to meet other specific needs. Administrative claims data, for example, is designed primarily for the purpose of supporting reimbursement; the available fields are structured in a way that facilitates claims adjudication but is not necessarily conducive to research. Before using this type of data, one must take care to understand both the structure of the data and certain artifacts of the data that could skew or create erroneous results (eg, miscoding, order of diagnoses, upcoding, etc).
These limitations also apply to EMR claims data. While these data provide a wide variety of information that was previously not available in claims data, with a frequency that allows for the evaluation of real-world analyses, universal standards for the information within an EMR are lacking; thus, data can vary dramatically even between treatment settings within the same area. Although some standardization has been implemented, it is difficult to create a standard form that covers the myriad healthcare providers and settings across the country.7 Additionally, there are still many fields that exist as free text, which can’t easily be incorporated into an analytic dataset for the purposes of research. There are also still concerns regarding the consistency of this data from patient to patient. Inconsistencies due to missing records and incorrect entries abound, and this fact requires additional curation of the data before using it for research purposes.
Novel data sources, such as social media, pose additional problems as well. Some of the data from these sources is opinion-based and might not be useful beyond a qualitative assessment of a population unless care has been taken to follow rigorous methodology for the elicitation of information (eg, patient-reported outcomes measurement tools). It is also difficult to determine the accuracy of this data, especially in an online setting that allows for anonymity in many cases.
Consideration of these limitations provides a clearer perspective of machine learning. In order to produce accurate results, most machine learning techniques rely on prior information as well as the quality of information in terms of completeness and accuracy of data. Without these two components—completeness and accuracy—any results must be viewed with some skepticism and might require additional input in the form of clinical expertise or additional data sources before conclusions can be made. Even with training and model-fitting techniques, a researcher must always be careful not to over-interpret results from machine learning, especially for techniques that require little human input.
Conclusion
The advent of new data sources and the increase in the volume and availability of up-to-date data have presented increasing opportunities to HEOR researchers for understanding various aspects of healthcare. Machine learning provides tools that can help interpret patterns within large datasets and identify factors that are associated with specific outcomes. However, it is still difficult to prove causal relationships from most results. Various methods have been developed to evaluate individual data sources and link data in novel ways in order to gain different perspectives on the data and regarding outcomes within specific patient populations.
Using this type of machine learning has already generated valuable intelligence for the healthcare community and has the potential to elucidate new insights into patient populations with specific diseases and conditions. The aforementioned examples demonstrate the potential for machine learning to uncover new variables and provide better guidance on treatment practices and patient outreach using existing data.4-6 Furthermore, while most analytic approaches are hypothesis-driven, machine learning opens up new avenues for generating questions from the data rather than mapping the data to existing questions that are often generated from qualitative observations.
The current trend in machine learning is moving increasingly toward less human involvement in the generation of models and the resulting conclusions from an analysis. With the advent of artificial intelligence, more complex algorithms are being implemented to address specific questions within data as well as generate new hypotheses for research. The caveat remains that machines are limited in their ability to interpret data. Computers will often search for the simplest solution given the data at hand, which follows the principle of parsimony, but they may not take into account the complexity of the data or the inherent limitations on data that are still, for the most part, being generated by humans. It is, therefore, paramount that researchers continue to work to validate and assess models before drawing conclusions from them. This practice will better guide healthcare practitioners and provide useful insights for the healthcare community.
References
1 Laney D. 3D Management: Controlling Data Volume, Velocity, and Variety. Stamford, CT: META Group, Inc.; 2001.
2 Curtis L, Brown J, Platt R. Four health data networks illustrate the potential for a shared national multipurpose big-data network. Health Affairs 2014; 33:1178-1186.
3 Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference and Prediction. New York: Springer Verlag; 2009.
4 Razavian N, Blecker S, Schmidt AM, et al. Population-level prediction of type 2 diabetes from claims data and analysis of risk factors. Big Data 2015; 3:277-287.
5 Devinsky O, Dilley C, Ozery-Flato M, et al. Changing the approach to treatment choice in epilepsy using big data. Epilepsy Behav 2016; 56:32-37.
6 Freedman RA, Viswanath K, Vaz-Luis I, Keating NL. Learning from social media: utilizing advanced data extraction techniques to understand barriers to breast cancer treatment. Breast Cancer Res Treat 2016; 158:395-405.
7 Clayton L. Reynolds (March 31, 2006). Paper on Concept Processing (PDF). Accessed December 4, 2013.
FOCUS Magazine
Explore the importance of Health Economics and Outcomes Research (HEOR) and real-world data in meeting the demands of a dynamic healthcare system.