1. INTRODUCTION
Author Profiling (AP) consists of recognizing demographic traits of a human being such as age, gender, personality, emotions, and others. Typically, its main aim is to create a user profile based on unstructured data. It has different applications in forensics, security, sales, marketing, healthcare, and many other sectors [1]. In e-commerce scenarios, this type of information gives companies advantages in competitive environments because it allows them to segment customers in order to offer personalized products and services, which strengthens their marketing strategies [2], [3]. Moreover, in chatbot systems, this type of technology is used to segment end users in order to provide them with personalized answers. Although most demographic factors are explicitly collected through a registration process, this approach could be limited given that most potential customers in online stores are anonymous. The automatic recognition of demographic variables such as gender or Language Variety (LV) according to geographic location can help to overcome these limitations [4].
Text data from customers can be obtained via transcripts of voice recordings, chats, surveys, and social media. These text resources can be processed to automatically recognize the gender or LV of the users. Different studies have applied Natural Language Processing (NLP) techniques to recognize Demographic Traits (DTs) of the author based on text data, mainly from social media posts [5]-[7]. Term Frequency-Inverse Document Frequency (TF-IDF) is a classical method to extract features from text data, and it is widely used to resolve different NLP tasks including AP [8], [9]. This feature represents each document based on the frequency of occurrence of the words in the document, weighted by their occurrence in all the documents in the corpus. In [10] and [11], the authors used TF-IDF to extract features from tweets in the PAN17 corpus [12], which has labels for gender and LVs of different Spanish speaking countries such as Argentina, Colombia, Venezuela, and others. By using a Support Vector Machine (SVM), they reported accuracies for gender classification and LV recognition around 81 % and 94 %, respectively. A similar approach was presented in [13], where the frequency of female and male emojis was used to recognize gender and LV. The authors reported an accuracy of 83.2 % in the PAN17 [12] corpus for gender recognition and 96.2 % for LV classification. In spite of the high accuracy reported in [13] and [14], this type of methodology would not be accurate to model text data written in more formal scenarios such as customer reviews, product surveys, opinion posts, and customer service chats, which have a different structure compared to text data available in social media. Moreover, these language features highly depend on the corpus, reducing generalization to other domains. For instance, some studies, such as [15] and [16], have concluded that females use emoticons more often than males, while another study [17] concluded the opposite.
Recently, text representations based on word embeddings have been successful in different applications including AP. In [18], the authors proposed a system to classify the gender of people who wrote 100,000 posts taken from Weibo (a Chinese social network like Tweeter) based on Word2Vec. The system achieved an accuracy of 62.9 %, which is nearly 3 % better than that of the human judgments reported in this corpus. This fact proves that the problem of recognizing gender in written texts is very hard, even for human readers. Regarding LV recognition tasks, Word2Vec has been successfully used. In [19], the authors represented words based on a Word2Vec model. The average word embedding computed along the words that form the post was used to represent each document in the HispaBlogs database, which has posts from five different countries: Argentina, Chile, Mexico, Peru, and Spain. They reported an accuracy of 73.6 % in LV recognition.
Deep Neural Networks (DNNs), such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), have been widely explored for various NLP tasks due to their high performance without the need for engineered features [20]-[22]. CNNs have shown to be efficient in author profiling tasks such as personality and author identification [23] and [24]. In [25], the author used a methodology based on word and sentence level embeddings with CNNs for gender and geographic identification. Word2Vec and FastText, which is a variation of Word2Vec at character-level, were employed. The model was evaluated in the VarDial corpus, which is composed of news articles in different Spanish dialects (Argentine, Venezuelan, Guatemalan, Spanish, and others). The proposed methodology was compared with a machine learning approach based on traditional features and SVM classifiers.
Accuracies up to 73 % and 92 % were achieved in the CNN approach for gender and geographic location identification, respectively. The results indicated that CNN models outperformed traditional machine learning algorithms for this type of AP application. RNNs have also been used to identify the author’s demographic variables. In [26], the authors proposed a methodology based on bidirectional Gated Recurrent Units (GRUs) and an attention mechanism for gender and LV recognition in the PAN17 corpus. They used a Word2Vec model as input for their deep learning architecture and reported accuracies of up to 72.2 % and 91.4 % for gender and LV recognition, respectively. Other studies have considered the use of embedding layers within neural networks to automatically learn text representations of the documents [25], [26]. The advantage of these models is that extracted embeddings are specifically designed for each corpus. However, these types of approaches must deal with out-of-vocabulary problems [27]. Moreover, the number of parameters in a deep architecture increases considerably, therefore a large amount of data is required to correctly fit the model.
According to the reviewed literature, AP has been mainly explored in social media scenarios, where the language is informal, and the documents do not follow a formal structure [28]. There is a gap between models trained on formal and informal written language because a model trained with formal language data for a specific purpose will not achieve comparable results in an informal language scenario, or vice versa [29]. Due to this reason, it is important to validate trained models to estimate demographic variables in both types of language: formal and informal. In addition, AP problems have been under-explored in documents about e-commerce or customer service interactions because, compared to data from social media, it is difficult to collect a large amount of such labeled data.
This paper proposes a methodology based on RNNs and CNNs for gender and LV recognition in informal and formal language scenarios. First, the models were trained and tested within the PAN17 corpus, which is a traditional dataset for AP in tweets [12]. The models originally trained using the PAN17 corpus were re-trained using a transfer learning strategy with data from call-center conversations, which are structured in a more formal language. The aim of this study is twofold: 1) to recognize gender and nationality in the Tweeter corpus and 2) to recognize gender and dialect (Antioqueño vs. Bogotano) in the customer service corpus. Accuracies of up to 75 % and 92 % were obtained for gender and nationality recognition, respectively, in the first corpus, and of up to 68 % and 72 % for gender and dialect recognition, respectively, in the customer service corpus. The results in the customer service corpus outperformed the accuracies obtained with a baseline model widely used in AP. Finally, the best models were used to compare inter- and intra-country LV recognition. In this study, inter-country refers to the recognition of the Colombian dialect among Spanish-speaking countries, and intra-country, to the recognition of LVs from different regions in Colombia. The results indicate that the proposed methodology is accurate for DT recognition in documents written either in formal or in informal language. Moreover, fine-tuned models using transfer learning showed that, despite the noise and lack of structure in documents written in informal language, they can be used to improve the accuracy of DT recognition in documents written in formal language.
2. DATA
2.1 PAN17
We used the Spanish data in the PAN17 corpus [12]. In this database, there are variants of Spanish from seven countries: Argentina, Chile, Colombia, Mexico, Peru, Spain, and Venezuela. The training set was composed of texts by 600 subjects from each country (300 female). Since each subject had 100 tweets, there was a total of 4,200 subjects and 420,000 tweets in the dataset. The test set comprised data from 400 subjects from each country (200 female), for a total of 2,800 subjects and 280,000 tweets. For the sake of comparison with previous studies, we kept the original training and test sets as in [30]. The training set was randomly divided into 80 % for training and 20 % to optimize the hyper-parameters of the models (development set). All the data distribution was performed subject-independent to avoid subject-specific bias and to guarantee a better generalization capability of the models.
2.2 Call-Center Conversations
Conversations between customers and call center agents of a pension administration company were collected. Transcripts of the conversations were manually generated by linguistics experts. Labels for gender and perceived dialect were assigned in each conversation. Formal language is typically used by customers when asking for a service, making a request, asking about certificates, and other questions about the service provided by the company. Nevertheless. This corpus was highly unbalanced. There were Colombian dialects with very few samples, and some conversations did not have a specific Colombian dialect. For these reasons, two sub-databases were built, depending on the DT. First, the gender corpus was composed of 220 samples (110 female). Second, the dialect corpus contained two classes that represent two Colombian dialects: “Antioqueño” (from Antioquia) and “Bogotano” (from Bogotá). In the second corpus, each class had 80 samples. These two sub-databases shared 130 common customers, 76 female and 72 from Bogotá. A summary of the metadata of these corpora is provided in Table 1.
3. METHODS
We implemented two Deep Learning (DL) architectures in this case: an RNN with bidirectional Long Short-Term Memory (LSTM) cells, and a CNN with multiple temporal resolutions. These networks were trained with data from the PAN17 corpus. Then, a transfer learning strategy was applied to recognize the trait (gender or dialect) from the call-center conversation data.
3.1 Bidirectional Long Short-Term Memory
The main idea of RNNs is to model a sequence of feature vectors based on the assumption that the output depends on the input features at the present time-step and on the output at the previous time-step. Conventional RNNs have a causal structure, i.e., the output at the present time-step only contains information from the past. However, many applications require information from the future [31]. Bidirectional RNNs are created to address such requirement by combining a layer that processes the input sequence forward through time with an additional layer that moves backwards the input sequence. Traditional RNNs also exhibit a vanishing gradient problem, which appears when long temporal sequences are modeled. LSTM layers have been proposed to solve this vanishing gradient problem by including a long-term memory to produce paths where the gradient flows for long duration sequences, such as sentences of a tweet or the ones that appear in a conversation with a call-center agent [32]. We propose the use of a Bidirectional LSTM (Bi-LSTM) network for our application. These architectures are widely used for different NLP tasks such as sentiment analysis in social media and product reviews [33], [34], and [35]. Figure 1 presents a scheme of the implemented architecture. Words from the data are represented using a word-embedding layer. The input to the Bi-LSTM layer consists of k d-dimensional word-embedding vectors, where k is the length of the sequence. The final decision about the DTs of the subject is made at the output layer by using the Softmax activation function.
3.2 Convolutional Neural Network
CNN-based architectures are designed to extract sentence representations by a composition of convolutional layers and a max-pooling operation over all the word embeddings that compose the embedding matrix. We propose the use of a parallel CNN architecture with different filter orders to exploit different temporal resolutions at the same time. Details of the architecture can be found in Figure 2. The output of the word-embedding layer is convolved with filters of n different orders, where n denotes the number of elements in an n-gram. The proposed CNN computes the convolution only in the temporal dimension. After convolution, a max-pooling operation is applied to reduce redundant information. Finally, a fully connected layer is employed for classification using a Softmax activation function.
3.3 Training
The networks described in this article were implemented in Tensorflow 2.0 and trained with a sparse categorical cross-entropy loss function using an Adam optimizer. An early stopping criterion was used to stop training when the validation loss was not improved after 10 epochs. The embedding dimension d was set to 100. The vocabulary size for the tokenizer was set to 5,000 for experiments with the PAN17 corpus and to 1,500 for experiments with call-center conversations. This number was computed as the number of words with a frequency higher than 5 % of the number of documents in the training set of each corpus. The hyper-parameters were optimized based on the validation accuracy and the simplest model.
3.4 Transfer learning
We tested two approaches for the call-center conversation data: (1) training the network only using the data from the corresponding corpus and (2) training the model via transfer learning by using a pre-trained model generated with the PAN17 corpus. Regarding the transfer learning experiment, the most accurate model for the PAN17 data was fine-tuned but freezing the embedding layer in order to keep the tokenizer and a larger vocabulary.
Experiments without freezing the embedding layer were also performed, but the results were not satisfactory. The motivation for using transfer learning here is to test whether the knowledge acquired by a model trained with text data in informal language is useful to improve AP systems based on texts with formal language.
4. RESULTS AND DISCUSSION
Two experiments were performed to recognize each DT. The first one consisted of evaluating short sequences of texts; thus, the architectures were trained and the DT (gender or LV) of the subject was computed based on the average classification scores of all short texts by the same subject. In this experiment, in the PAN17 corpus, each tweet was a short text; in turn, each call-center transcript was divided into 60-word chunks, like in [26]. The second experiment consisted of evaluating long texts. In this case, the complete text data of the subjects was fed to the network at the same time. In the PAN17 corpus all tweets by the same subject were concatenated, and in the call-center corpus, the complete transliteration of each conversation was included. This strategy was evaluated using only the CNN-based approach because longer segments produced vanishing gradient problems in the Bi-LSTM network.
The experiments with the call-center conversation data were compared with a baseline model, where the documents were represented using TF-IDF and an SVM with a Gaussian kernel used to classify the DTs. The vocabulary used in the baseline model was the same as that used in the embedding layer for the proposed models. This baseline model was only tested in the long text approach following the methodology used in different studies. This type of baseline has been used successfully as a benchmark in several databases for AP, including the PAN17 corpus [10] and [11].
Additionally, in this paper we present a cluster analysis based on k-means in order to implement a customer segmentation strategy using the prediction scores of our neural networks. We focused especially on Colombian DT recognition for inter-country assessment using the PAN17 corpus, and intra-country recognition using the call-center conversation corpus. The number of clusters was defined using the elbow method and the Kneedle algorithm [36].
4.1 AP in informal structured language (PAN17 corpus)
The results obtained for the PAN17 corpus considering only Spanish data are shown in Table 2.
The analysis based on long texts to classify gender showed an improvement of 4 % compared to that based on short texts. The improvement when classifying LV was about 3 %.
4.2 AP in formal structured language (call-center conversation corpus)
All the experiments performed with this corpus were validated following a 10-fold cross-validation strategy due to the small size of the dataset. Table 3 shows the results of the baseline model, where accuracies of up to 57 % and 63 % were obtained for gender and LV classification, respectively.
Table 4 reports the AP results of the proposed models for the call-center conversations obtained with and without applying transfer learning. The highest accuracy was obtained with long texts, in the same way as in the PAN17 corpus. In addition, note that, for both DTs, the accuracy improved by up to 13 % when the transfer learning strategy was applied. In gender recognition, the base model was not very accurate; thus, the knowledge transferred to the target model did not cause a significant improvement, considering the complexity of the architecture and the number of samples available in the target model.
Additionally, the models that used transfer learning outperformed the results obtained with the baseline model for both DTs. However, the results obtained with the baseline model show lower standard deviations, which likely indicates that baseline methods baseline models.
4.3 Analysis of recognized Colombian DTs for user segmentation (inter- vs. intra-country)
Figure 3 shows the results of a cluster analysis using all the samples of the test set from the PAN17 corpus. We used the best resulting model from the previous experiments in order to employ user segmentation strategies. We plotted the gender score in the horizontal axis vs. the probability of being classified as Colombian in the vertical axis. These data were obtained from the output of the CNNs after the Softmax activation function. The results indicate the presence of three clusters, where 95.2 % of subjects in Cluster 1 are Colombian, while 97 % of the subjects in Clusters 2 and 3 are non-Colombian. Regarding gender, Cluster 2 is mainly formed by female subjects (75.5 %), while Cluster 3 is formed by 75.2 % male subjects. Cluster 1 does not have a dominant gender. In addition, note that the Colombian dialect recognition based on text is more accurate than gender recognition; however, in relation to non-Colombian samples, each cluster is composed of at least 75 % subjects with a specific gender.
Figure 4 shows the results of the intra-country analysis of the samples in the call-center conversation corpus. There was a total of 130 subjects, distributed as 72 Bogotanos and 58 Antioqueños, and 77 female and 53 male users. According to Figure 4, Cluster 1 is composed mainly of subjects from Bogotá, Cluster 2 of subjects from Antioquia, and Cluster 3 is slightly balanced in terms of dialect with a larger number of subjects from Bogotá. Regarding gender, only Cluster 3 has a larger percentage of female subjects. The other two clusters are not gender specific.
In both approaches, the subjects tend to be grouped according to their LV. This is better observed in the inter-country analysis, but it also occurs in its intra-country counterpart. This can be explained by the fact that the differences in dialects in the same country are more subtle than those observed among different countries that share the same native language.
In addition, gender-dependent clusters are created in the inter-country scenario. Conversely, in the intra-country analysis, the clusters are more gender-balanced, although in some clusters there is a slight tendency toward a specific gender.
5. CONCLUSIONS
In this study, we proposed a methodology for AP in which two DTs, i.e., gender and LV, are automatically recognized in informal texts from social media posts and formal texts of call-center conversations. Different deep learning models were evaluated, including CNNs and LSTMs. We implemented a transfer learning approach where base models are pre-trained with data collected from social networks and then fine-tuned with call-center conversation data, which have a more formal structure than the social media posts used for pre-training.
The results indicate that it is possible to classify the gender and LV of a subject based on his/her social media posts, with accuracies of up to 75 % and 92 %, respectively. Regarding formal scenarios, we obtained accuracies of up to 68 % and 72 % for gender and dialect recognition, respectively. These results outperformed those obtained with a baseline model using TF-IDF in combination with an SVM classifier. The use of a transfer learning strategy improved the accuracy in scenarios where it is more difficult to collect data, like in call-center conversations, which suggests that such strategy is suitable for companies or sectors where it is not possible to create large datasets from scratch. The models that use transfer learning are also more stable and generalize better than others where the neural networks are trained from scratch. Furthermore, the knowledge acquired by the models to recognize LVs of Spanish-speaking countries can be successfully used to fine-tune models that recognize more subtle LVs, such as those inside the same country.
We believe that these results are very positive because they show that AP can benefit from large amounts of text data available in other domains, such as social media. Even though the accuracy of the models does not seem to be very high, (especially for gender recognition in call-center conversations), it is relevant because other studies, such as [18], have reported accuracies under 61 % using human judgment for gender recognition based on text data. The proposed approaches can be extended to other applications related to AP such as age, personality, and educational attainment, which would allow for the building of more complete and specific subject/customer profiles.