Introduction
Big data is a term currently used by computer science to describe a range of technological tools capable of processing extensive data sets. Most such data are observational -also known as "real-world data"- and, when analyzed, can reveal patterns, trends, and associations related to human behavior and its interactions. These large-scale databases may consist of genetic, medical, environmental, economic, geographical, or social network data; for this reason, they are often so extensive and poorly organized that it is not possible to analyze them using traditional computing techniques.1-4
Despite its great popularity and multiple uses, there is no clear definition of the concept of big data. Therefore, its definition is based on the four "Vs": volume, velocity, variety, and veracity.5,6 Volume refers to the availability of massive amounts of data (which requires flexible and easily expandable management, recovery, and storage systems). Velocity is the feature of the big data infrastructure that enables efficient data management. Variety means that the data comes in many formats. Finally, veracity is about reducing errors and unreliable information that affects data analysis and results. In other words, big data involves a large amount of heterogeneous data that is quickly updated and available for use, but it requires checking. 5-7
Based on the above, this reflection article aims to describe general aspects of the current relevance of big data and its possible application in pharmacoepidemiology and pharmacovigilance. To this end, scientific literature published between 1 January 2000 and 30 November 2018 was searched. The databases consulted were MEDLINE via PubMed, ScienceDirect, and Scopus, and the search strategy included the MeSH terms ["Big data AND Pharmacoepidemiology"; "Big data AND Pharmacovigilance"].
Big data in the health area
Usually, multiple types of data are collected by different health professionals during administrative processes and clinical practice. They include, on the one hand, physicians who record the clinical history of their patients, the prescription of therapies, the results of laboratory tests and the reporting of adverse events, and, on the other hand, pharmacy personnel who record when medications are dispensed. All of this happens routinely.7,8 Since this information is not collected for scientific research purposes, the data is not always "clean" or available for analysis by researchers; therefore, data accumulates over a long period of time, and its value is not fully recognized or exploited.5,6 However, the usefulness of this information in health care is increasingly evident, so it is necessary to manage all this data full of scientific evidence.7
The use of databases in the health sector began to increase in the 1990s, particularly in Europe, North America and, more recently, Asia, where they have been widely used to assess post-marketing prescription patterns, comparative efficacy, and safety of marketed drugs.9,10
The ability to link databases in the health area allows integrating various sources of information to provide an overall picture of the patient's medical history and to carry out collaborative studies through international databases.5,6,11,12 These techniques are convenient, as it would be extremely costly and time-consuming to collect such information otherwise. 13
Large healthcare databases often contain information coded according to international classifications such as the International Classification of Diseases (ICD) and the Anatomical, Therapeutic, Chemical (ATC) classification system for drug information. They can also be found in the form of free, unstructured texts that require the use of artificial intelligence technology such as text mining.7,14 There are two main types of machine learning that have been used in pharmacovigilance for automatic signal generation: supervised learning and unsupervised learning.
Unsupervised machine learning is a computer system that can learn associations between selected data elements on its own, i.e., without being "trained"; this approach has been used to identify complex drug safety signals and discover use patterns. In contrast, supervised machine learning requires "teaching" a computer system how to build an algorithm based on the desired result in advance.6,15
Another potential application for big data includes the so-called mobile health (mHealth) area. For some time, applications for smart electronic devices have been developed to help manage a large number of chronic diseases and conditions -such as diabetes and tobacco cessation- and even to improve nutritional habits.3,16 The information collected from these devices allows for predictive modeling that can result in more efficient and cheaper medical therapies with fewer adverse reactions.17
Medical device manufacturers produce tools for use in routine services that monitor clinical marker levels and automatically submit information to complete electronic health records. This information, altogether, allows healthcare providers and government agencies to adjust the treatment plan by phone or applications, e-mails, or directly using the measurement device, thus promoting healthcare compliance.2,3,5,17
Big data for drugs in the post-marketing phase
In order to market a novel drug, researchers and manufacturers invest a great deal of time, money, and logistics. Moreover, different phases, which go from pre-clinical research to the first clinical application, must be successfully completed before they are finally approved by the regulatory bodies. Once the drugs are available to patients on the market, pharmacoepidemiology comes into play; it studies their use and effects (beneficial or adverse) in large populations in the post-marketing phase. 1,9,18
Epidemiological surveillance has been fundamental in public health for decades, as it reports on the health status of patients based on data directly collected from healthcare institutions. These data include sociodemographic variables, clinical conditions, morbidities, laboratory reports, diagnostic and therapeutic strategies, adverse reactions, outcomes, survival, and mortality. This active surveillance is supported by intelligent electronic devices with internet access, in which patients report symptoms and other data that are updated in real time.1,3 This can be used in the area of pharmacovigilance for reporting adverse drug events.7,8,19
The beginning of the technological revolution in the 1970s impacted surveillance systems by improving accessibility and increasing the speed with which data was transmitted between institutions. Similarly, there was an increase in the number of data sources that can be used in pharmacoepidemiology and pharmacovigilance, covering spontaneous reporting systems, digitized healthcare databases, adverse reaction reports, among others.3,6,8
The creation of data systems that collect information on adverse event reports has been a breakthrough in the area of drug safety. Currently, there are international databases that collect such information, continually review it through signal analysis, and issue constant alerts about possible associations between an adverse event and a drug.8,20,21 This methodology allows the continuous incorporation of data from various sources and its analysis in real time, which in turn allows the detection of possible alerts of unknown adverse reactions or whose magnitude could be greater than expected.9,13
Advances in pharmacoepidemiology and pharmacovigilance
Pharmacovigilance appeared more than 50 years ago in response to the harmful side effects caused by the drug thalidomide. In the early years, this science was based on anecdotal evidence and case series through systematic spontaneous reporting, so it did not provide a reliable estimate of incidence or risk. The second-generation shaped important observational studies that sought to understand the contributions of knowledge about potential adverse effects of new and old drugs. Finally, third-generation pharmacovigilance began with meta-analyses on clinical trials and made important contributions.8
Furthermore, in recent years, the potential for research based on healthcare databases has generated interest in the results of studies that show the risk association between the consumption of a drug and an adverse effect that could not have been identified during the follow-up time of a conventional clinical trial, such is the case of proton-pump inhibitors usage and the risk of myocardial infarction,21 or certain drug interactions in the actual clinical context of patients treated with anticoagulants.22
The study of big data as a pharmacoepidemiology and pharmacovigilance strategy began in 1990, and, to date, it has proven to be cost-effective, fast, and reliable. Therefore, the Food and Drug Administration (FDA) has not only stated that this strategy has many advantages but has expanded its use to analyze the growing number of reports it receives. 7
According to the relevant literature, there are several databases with enough information that allow conducting health studies and have a potential application in drug consumption analysis and pharmacovigilance studies. They include the Danish National Health Service Prescription Database,23,24 the UK's Clinical Practice Research Datalink (CPRD),25 the US FDA Adverse Event Reporting System (FAERS), 26,27 and the Scottish Prescribing Information System. 28
In this context, there is evidence that different companies are increasingly using big data and artificial intelligence techniques to support pharmacovigilance activities. However, there is still a long way to go, 29 especially in Latin America, where this type of technology is underdeveloped in the areas of natural sciences and health. 30-32
Even with the benefits they offer, these techniques have limitations, including the lack of quality standards and validation methods for some of their records, as they may be incomplete, inconsistent, and subject to a great deal of potential bias and confusion. On the other hand, the use of massive amounts of data may cause an existing relationship to go undetected due to the masking or dilution of a phenomenon.7,33
Conclusions
The availability of large amounts of healthcare data increases the power of analysis of this information and creates an opportunity to study drug use and safety. Given the high flow of information, big data techniques that allow performing various analysis procedures and obtaining results applicable to routine medical practice are required for the organization and codification of unstructured, and highly complex data. Managing and exploiting these expanding sources of information is the next challenge for the application of research methods in modern pharmacology.1,6,17,34
Another relevant advantage of the use of big data in pharmacoepidemiology and pharmacovigilance is the diversity of the data since medical records can be analyzed with information on hospitalization, outpatient consultations, drug prescriptions, and laboratory tests, besides opening up the possibility of continuous monitoring using intelligent electronic devices.1,2,6
Due to the limitations of secondary data sources, their interpretation is associated with some important challenges, such as accumulation of estimation errors and spurious correlation. 3 These massive data flows must adjust to changing conditions all the time, so the algorithmic intelligence of digital epidemiology must be harnessed. In this regard, new technologies must be regulated by public health institutions so that data is properly distributed, and high standards of accuracy are maintained.1,6