Since its beginnings, Psychology has been prone to both data generation and understanding of human behavior through data analysis. Back in 1879, Dr. Wilheim Wundt opened the first experimental psychology lab at the University of Leipzig to study reaction times. To many, this is considered the start of Psychology as a separate scientific discipline and the use of data analysis for data-driven decision making in the field (Flis, 2019; Tweney, 2003). In this Editorial, we briefly discuss how Psychology students, clinicians, and researchers may take part of the data revolution and help transforming Psychology, as we know it, into Machine Learning Psychology.
1. Data Explosion in Psychology: A Place for Data Science
Nowadays, there is an explosion of data in different areas, and Psychology is no exception (Mabry, 2011; Zhu et al., 2009). In fact, considering the different branches of modern Psychology today (King University, 2019; Ritchie & Grenier, 2003), it seems that the amount of data generated by psychologists is far away from decreasing. Hence, there is no doubt that psychologists would greatly benefit from combining theoretical models with the right Data Science tools to correctly analyze data from experiments and surveys (Loftus, 1996). Thus, training psychologists in Data Science is essential for understanding and visualizing data, developing predictive models, and, as a consequence, fostering knowledge generation (Neth, 2021a, 2021b). In other words, we need, starting from undergraduate programs, to provide the necessary tools to Psychology students to take part of the data revolution and, in the near future, being able to make data-driven decisions (Jack et al., 2018; Mandinach, 2012; Tolle et al., 2011).
Data Science is an exciting multidisciplinary and broad discipline that allows you to turn raw data into understanding and insight, and involves principles, processes, and techniques for understanding phenomena through the analysis of data using a Galaxy of connected topics ranging from basic Statistics and Probability (i.e., descriptive and inferential statistics) to Machine Learning (ML) and Artificial Intelligence (AI; Provost & Fawcett, 2013). Broadly speaking, there are five types of analytical approaches in Data Science: (1) descriptive analytics, which explains what happened; (2) diagnostic analytics, which explains why things happened; (3) predictive analytics, which, by using predictive models, forecasts what is likely to happen based on observed data; (4) prescriptive analytics, which recommends a course of action based on the results of a predictive model; and (5) cognitive analytics, which exploits the advances in ML and AI (i.e., intelligent systems) through High Performance Computing to develop analytic models with a human-like intelligence (Dey, 2016; Gudivada et al., 2016; Lepenioti et al., 2020). Note that these approaches open new possibilities on analysing Psychology data that go beyond traditional summary statistics (i.e., mean, median, range, and standard deviations), correlation/regression analyses, and the assessment of psychometric properties of a clinical instrument (Cuartas Arias, 2017).
2. ML Psychology: Predictive Models, Clustering, and Intelligent Systems
In Psychology, data can be generated from different and diverse sources: ranging from surveys and clinical instruments/batteries, which asses important aspects of human behavior, to EEG, reaction times, and genetic and omics data that quantify changes in the brain and the frequency distribution of traits or gene/protein expression function and evolution (Bell & Cuevas, 2012; Bragazzi, 2013; Jiménez-Figueroa et al., 2017; Suarez et al., 2020). ML has called the research communitys attention for disclosing patterns, detecting objects, and developing predictive frameworks in several diseases (Dey, 2016; Dhall et al., 2020) as well as in several areas of Psychology, including Psychometrics, Experimental Psychology, Diagnosis, Treatment, follow-up, and Personalized and Predictive Care (Dwyer et al., 2018; Jacobucci & Grimm, 2020; Koul et al., 2018; Lin et al., 2020; Orrù et al., 2020; Rosenfeld et al., 2012; Shatte et al., 2019), demonstrating its usefulness for elucidating important aspects of disease.
When using ML, the data can be of any nature (i.e., binary, multinomial, ordinal or continuous), and the underlying assumptions are minimal. Whether or not we have an outcome variable for each individual in our sample, it defines the type of ML techniques to be applied (i.e., Supervised ML vs. Unsupervised ML). Broadly speaking, supervised ML refers to developing predictive models for an outcome of interest Y based on a set of predictors X = (X1, X2,..., XP )T ; the selection of the predictive model fitting the data best is performed based on an error-related measure (i.e., the root mean squared error ENT#091;RMSEENT#093; and the mean absolute error ENT#091;MAEENT#093; for continuous outcome variables, and the sensitivity, specificity, accuracy, and lift for dichotomous variables; Kuhn, 2008, 2020). Some of the most common supervised ML algorithms include Classification and Regression Trees (CART; Breiman et al., 1984), Random Forrest (RF; Breiman, 2001), Support Vector Machines (SVMs; Cortes & Vapnik, 1995) and eXtreme Gradient Boosting (XGBoost; Chen & Guestrin, 2016).
When the data lacks anoutcomevariable (i.e., case/control status or ‘labels’) while having different measures available (i.e., responses for a clinical battery), unsupervised ML techniques can be used to identify hidden complex structures in the data. Three of the main methods used in unsupervised ML are principal component analysis (PCA), multidimensional scaling (MDS), and clustering. PCA is a dimensionality reduction exploratory technique, based on the eigenvalue decomposition of the variance-covariance matrix, that allows visualizing high-dimensional data (i.e., k3 variables are measured) while preserving as much statistical information as possible (Joliffe & Morgan, 1992; Ringnér, 2008; Ritchie & Grenier, 2003). MDS allows the visualization of the similarity level of individuals in a data set by calculating a dissimilarity or distance function D( X ) such that individuals closely related to each other have low dissimilarity (Mead, 1992). In this sense, the choice of an appropriate dissimilarity function is crucial (Harmouch, 2021). Clustering methods, on the other hand, help to identify, based on a set of features or variables, groups of individuals that would be impossible to spot otherwise. Multiple clustering techniques available in the literature could be applied (i.e., K-means clustering, Hierarchical clustering, and distribution-, modeland density-based clustering techniques; Roman, 2019). However, the choice of which of these methods should be used depends heavily on the data and involves assessing the stability and compactness of the derived clusters using different performance measures (Pedregosa et al., 2011; Scikit-learn Project, 2021).
For high-dimensional data, combining PCA+clustering or MDS+clustering is a go-to recipe to graphically represent individuals relationships and subgroups according to some features. Subsequent work may include to develop ML predictive models that can classify new individuals to such derived groups (Roman, 2019). Interestingly, the combination unsupervised ML techniques may lead to the identification of individuals exhibiting differential clinical profiles (i.e., extreme phenotypes; Acosta et al., 2011; Arcos-Burgos et al., 2019; Elia et al., 2009; Pérez-Gracia et al., 2010; Vidal et al., 2020; Yu et al., 2017; Yu et al., 2018), hence contributing to the development of personalized interventions, treatments, and follow-up strategies. The combination of supervised and unsupervised ML techniques as well as the automation of the data analysis process could allow the development of data-driven Intelligent Systems supporting psychologists to make more accurate and timely decisions (de Mello & de Souza, 2019; Luxton, 2016).
3. R, Python and the Democratization of ML
Despite how promising transitioning to ML Psychology may seem, data-driven decision making requires not only a proficient understanding of Data Science, Data Analytics, and ML/AI techniques, as well as the Psychology component associated to the data at hand, but also a comprehensive computational set of tools that facilitates the implementation, validation, and deploying of ML models. Thus, ML Psychology imposes new challenges in terms of the level of training in computational tools and abstract thinking that Psychology students need to develop. In what follows, we present some open-source free-of-charge alternatives to get started.
For more than three decades, R (www.r-project.org) and Python (www.python.org) have taken the lead to democratize the use of ML algorithms to the general public by making them easily accessible at no cost. More recently, Julia (www.julialang.org) has emerged as a powerful, suitable, and efficient alternative.
Established as an open source project in 1995, R is a language and environment that provides a great variety of statistical and graphical techniques, including classical statistical tests, predictive modelling, clustering, and other ML algorithms, and it is highly extensible (R Core Team, 2021). For ML, R has multiple freely-available packages, which focused on ML, namely caret, dplyr, tensorflow, DataExplorer, ggplot2, kernLab, MICE, mlr3, plotly, randomForest, rpart, e1071, keras, and OneR. For more details, see the Comprehensive R Archive Network (CRAN) Task View (Hothorn, 2021).
On the other hand, Python, which was created by Guido van Rossum in 1991, is a widely-used, interpreted, object-oriented, and high-level programming language with dynamic semantics, used for general-purpose programming (Python Software Foundation, 2021, van Rossum 2009). For ML in Python, the scikit-learn (Pedregosa et al., 2011) is the go-to library. This library offers an open-source collection of simple, reusable and efficient classification, regression, clustering methods, as well as dimensionality reduction techniques, model selection algorithms, and pre-processing routines tools (Pedregosa et al., 2011) accessible to everybody for developing predictive models.
Finally, Julia is a general-purpose, dynamic, highlevel, and high-performance programming language that started in 2012 by Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman. Julia was conceived to be as usable for general programming as Python, as easy for statistics as R, and as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell (Bezanson et al., 2012, para. 4). Similar to R and Python, Julia also offers tools for Data Visualization, Data Science, and ML.
4. Getting Closer and Closer to the Promised Land
With the explosion of data in Psychology, ML methods hold promise for personalized care by tailoring treatment decisions and clustering patients into taxonomies clinically meaningful. In other words, ML methods can be used to take us to a Promised Land where clinicians provide diagnosis and suggest treatment options based on data from an individual, instead of using a ‘onesize-fits-all’ approach (Cuartas Arias, 2019; Joyner & Paneth, 2019).
A recent review identified that depression, schizophrenia, and Alzheimer’s disease were the most common mental health conditions studied via ML methods (Shatte et al., 2019). Other conditions included autism (Bone et al., 2015), frontotemporal dementia (Bachli et al., 2020), cognitive impairment (Na, 2019; Youn et al., 2018), and post-traumatic stress (Wani et al., 2020). Certainly, the challenge in the years to come is to expand the application of ML methods to other pathologies, especially in developing countries. Even more importantly, ML methods, properly applied, may lead to the discovery, for example, of relevant clinical aspects of understudied populations (Fröhlich et al., 2018). In this Promised Land, psychologists provide faster, timely, and more accurate diagnosis, and are able to dissect and identify individuals with subtle forms of the disease, and offer appropriately treatment options.
Despite getting us to this Promised Land where personalized psychological care is a reality for most people, ML can lead to misinformed conclusions in the absence of clinical domain expertise; focusing on Data Science and the application of ML methods only can produce misleading results and conclusions (Bone et al., 2015). Thus, it is not only important to deeply understand the clinical background of the field, but also to differentiate which ML methods can be used and how. In this regard, interdisciplinary collaboration between psychologists and researchers in areas related to Data Science and ML is crucial (Shatte et al., 2019). Because of this continuous interaction, communication is another relevant aspect. In ML Psychology, the practitioner must have excellent communication skills to be able to express his/her research questions to collaborators to synergically work and successfully address them as a team. It is also important for the ML Psychology practitioner to interpret and follow the results of applying ML methods, and be able to gain relevant insights into the psychology aspects of the condition under study (Bone et al., 2015).