I. Introduction
Open data helps government institutions disseminate information of interest to civil society in order to provide transparency and social control, and thus, empower citizens through information access, to the point that today this philosophy of openness has transcended to other areas such as academia and research institutes, who seek the development and improvement of services, plans, programs, projects and standards with the collaborative participation between state-citizen-company.
The “open” data must have technical and legal characteristics to be used, reused and redistributed by any person or entity, without any restriction; These parameters are stipulated in the International Open Data Charter [1].
In favor of this initiative, in some countries standards and portals have been implemented in order to contribute to its use; For example, in Colombia, Law 1712 of 2014 obliges all public entities to disclose their data, and since 2016, the nation adopted the principles established in the International Open Data Charter, making the Colombian State Data Portal available as a space for the dissemination of public information in the country [2]. Likewise, portals were created at the departmental and municipal levels, with the objective that each entity had its own space for data opening. In 2019, a total of 30 portals focused on the dissemination and access of open data were registered.
However, having quality open data portals implies that they fulfill a dynamic role in the data life cycle and that they establish a relationship between producers, publishers and data consumers, through interaction mechanisms that contribute to aspects such as identification of the demand for data, data publication of interest for specific users, the feedback of data sets and the portal, as well as the improvement of their quality.
At international level there are several proposals from experts in the area for the evaluation of open data portals, each with different dimensions, factors or aspects to carry out this process. Therefore, and given that the portal is the means by which the quality of the published data is guaranteed, the search is facilitated by the users, the data is available in usable formats and these are published so that respond to a specific demand in order to meet specific needs that generate value, for which they have an integral evaluation methodology with criteria and dimensions proposed by experts in preliminary work.
II. Portals Evaluation Methodologies
In order to have the necessary basis for the formulation of an evaluation proposal that covers the different perspectives, as well as to create a wider and complete evaluation mechanism, a documentary review of works and research was developed in the recent years for portals evaluation.
In most of the works, an implementation of the Tim Berners-Lee five-star model was found, where it is proposed to evaluate the opening of data from its accessibility and reuse through five levels, represented by stars, that evaluate: 1. If data is only published in any format under an open license, 2. If data is structured, 3. If they are in non-proprietary formats, 4. If URI is used to access specific data directly, and 5. If they are linked to other data generating context [3].
In the case of the Open Data portal of Barcelona, the authors evaluated the quality of the portal data according to its reuse, they complemented the five-star model with the proposal to include factors such as the frequency of updating and geolocation of the data and related the amount of download and themes, according to the number of stars obtained with the model [4]. Similar case to the evaluation of portals of the European Union, where relevance is given to the analysis of the state of the data sets and the standards in which they were published at the time for the implementation of recommendations and the general improvement of portals [5].
In the case of the Barcelona Open Data Portal, the authors evaluated the quality of the portal data according to its reuse, complemented the five-star model with the proposal to include factors such as the frequency of updating and geolocation of the data and related the amount of download and themes, according to the number of stars obtained with the model [4]. Similar case is the one of the European Union, portal evaluation, where relevance is given to the analysis of the state of the data sets and the standards in which they were published at the time for the implementation of recommendations and the general improvement of portals [5].
Although it is evidenced in other works that more robust portals evaluation models are proposed that complement, to a large extent, the model proposed by Berners-Lee, enriching aspects of data and portal quality [6], as well as the Using indicators proposed by organizations such as the Open Knowledge Foundation (OKF) for open data programs [7], it is considered that there are factors that are left out of the scope of the study or not covered in depth, but are necessary for the evaluation of the quality of data and portals, for example, the evaluation of metadata and communication channels offered by the portals.
As for the other studies, there are methodologies such as Meloda, which is used for the exclusive evaluation of data reuse [8]; the evaluation of metadata from its use, availability, completeness, openness and addressability [9]; the analysis of the structural composition of the portal based on its conformation and categorization [10], and the evaluation of national portals through the general characteristics of the portals and the data set [11].
Among the methodologies, models and standards of found evaluation, those presented in Table 1 stand out.
Methodology | Evaluation object | Dimensions / Evaluative Criteria |
Five stars [3] | Openness level and data usability | -Published data in any format - Structured data - Data in non-proprietary formats - Use of URI - Linked Data |
Barcelona [4] | Data quality and reuse | Additional to those contemplated in Five Stars: - Update frequency - Geolocation - Downloads - Thematic |
Meloda [8] | Data reuse | -Technical structure - Access to information - Legal framework - Data publication model |
European Union [5] | Data and portal quality | Additional to those contemplated in Five Stars: - Portal navigation - Search modes - Results presentation - Data sets status - Standards adoption - Publication formats |
Portal maturity [6] | Portal quality and maturity | Additional to those contemplated in Five Stars: - Availability - Reuse capacity - Relevance - Reputation - Granularity - Visualization |
National Level [11] | Portal quality | - Portal General characteristics: - Technical aspects - Availability and access - Communication and participation - Data set general characteristics |
Taiwan [10] | Portal Organizational Structure | - Categorization quality - Structural quality |
Brazil [7] | Portal quality and data opening level | Additional to those contemplated in Five Stars: - General information - Technical services: - Usability - Accessibility - Interoperability - Specific information |
Analytical Hierarchy Process (AHP) [9] | Metadata | - Use - Completeness - Opening - Directionality - Recoverability |
Table 2 shows a consolidation of the dimensions measured by each of the methodologies described in Table 1.
Methodology | Dimensions | |||||||||
Data | Portal | |||||||||
Technical structure | API / URIs | Reuse | Metadata | Opening | Availability and Access | Data visualization | Structure | Communication | ||
Five Stars | Barcelona | X | X | X | X | |||||
European Union | X | X | X | X | X | X | ||||
Portal Maturity | X | X | X | X | X | X | X | |||
Brazil | X | X | X | X | X | X | X | |||
Meloda | X | X | X | X | X | X | ||||
National Level | X | X | X | X | X | X | ||||
Taiwan | X | X | X | |||||||
Analytical Hierarchy Process (AHP) | X |
III. Proposed Methodology
Taking as reference the methodologies presented in Table 1, the evaluation of open data from two approaches is proposed: 1) Published data, covering quality, use and metadata, and 2) Portal, highlighting aspects of its structure, usability and communication mechanisms. Each dimension is composed of several factors, whose general criteria are explained in Table 3.
Element | Dimension | Factor | Description |
Data sets | Quality | Availability | They are available for viewing, downloading, use and reuse. |
Upgrade | They are periodically updated. | ||
Accessibility | Access to data is done through platforms that allow request, visualization and use. | ||
Visualization | Data is presented in ways that facilitate its analysis and understanding for the user. | ||
Publishing formats | Data is in non-proprietary and machine-processable formats. | ||
Completeness | They do not contain empty or null spaces and have a large number of records that allow defining trends or behaviors when analyzed. | ||
Use | Defined demand | It is known to whom the data set is directed and what its scope is. | |
Number of views | The number of views that a set has according to the figures provided by the portal. | ||
Downloads | Data sets downloads number. | ||
API consumption | Data consumption is provided through an API that, in turn, allows data sets to be filtered using query parameters. | ||
Resulting products | A clear and complete view of the resulting products from the use of open data is provided. | ||
Metadata | Use | Medata is used | |
Completeness | Metadata provide enough information to understand the content, scope and purpose of the data, in addition to having information that allows contact with the source. | ||
Recoverability | The use of metadata allows efficient recovery of sets according to search criteria. | ||
Portal | Structure | Categorization | Established categorization in the portal is consistent with the demand and use of data, in addition to maintaining coherence in the relationship between the sets that are in the same category. |
Usability | Search | The user can easily search for specific data sets, obtaining results according to his request. | |
Navigability | The user can easily scroll through the different sections provided by the portal, fully knowing the purpose of each one. | ||
Use / consumption / data download | It offers users various ways to consume the published data, providing download mechanisms in different formats, obtaining data through APIs with queries and visualizations about the sets that allow further analysis. | ||
Communication | Comments and discussion | It provides comment and discussion spaces that allow users to evaluate the status of data sets, establishing feedback spaces that lead to improved quality of the sets. | |
Source-user | It offers mechanisms that allow users to communicate directly with data publishers. | ||
Request | It incorporates spaces for users to request data sets of interest. |
As part of the proposed methodology, a quantitative measurement system is proposed with the objective of scoring each of the presented criteria (Table 3). Each approach, portal and data has a maximum score of 100 points, distributed as shown in Table 4. The final score will be:
𝑆𝑐𝑜𝑟𝑒=(𝑑𝑎𝑡𝑎 𝑠𝑐𝑜𝑟𝑒∗0.6)+(𝑃𝑜𝑟𝑡𝑎𝑙 𝑠𝑐𝑜𝑟𝑒∗0.4)
That is, the score obtained when evaluating the data will be equivalent to 60% of the score, and the portal score will have an equivalence of 40%. Although some of the criteria proposed may have qualitative considerations, the methodology proposes a quantitative approach to the evaluation of factors, with the objective of responding to the use of indicators to evaluate open data initiatives, as organizations such as the World Wide Web Foundation with the Open Data Barometer, or the Open Knowledge Foundation with the Global Open Data Index.
Data (60 %) | Portal (40 %) | |||||||||||
Factor/Criteria | A | B | C | Score | Factor/Criteria | A | B | C | D | Score | ||
Quality | Availability | 2 | 1 | 3 | 6 | Structure | Categorization | 15 | 15 | |||
Upgrade | 2 | 2 | 4 | |||||||||
Accessibility | 3 | 4 | 7 | |||||||||
Visualization | 3 | 2 | 5 | Usability | Search | 5 | 5 | 5 | 15 | |||
Publishing formats | 3 | 3 | 6 | |||||||||
Completeness | 3 | 2 | 2 | 7 | Navigability | 4 | 3 | 3 | 10 | |||
Use | Defined Demand | 7 | 7 | |||||||||
Number of views | 7 | 7 | Use / consumption / data download | 5 | 5 | 5 | 5 | 20 | ||||
Downloads | 2 | 5 | 7 | |||||||||
api | 7 | 7 | Communication | Comments and discussion | 4 | 3 | 3 | 4 | 14 | |||
Resulting products | 2 | 2 | 3 | 7 | ||||||||
Metadata | Use | 10 | 10 | Source-user | 3 | 3 | 4 | 4 | 14 | |||
Completeness | 3 | 4 | 3 | 10 | ||||||||
Recoverability | 5 | 5 | 10 | Request | 6 | 6 | 12 | |||||
100 | 100 |
The maximum score to be obtained in each criterion that makes up each factor is presented in the boxes in Table 4. These criteria are related to the data:
Quality:
-
1. Availability:
-
2. Upgrade:
-
3. Accessibility:
-
4. Visualization:
-
5. Publication formats:
-
6. Completeness:
Use:
1. Defined demand: it is clearly known to whom the data is directed.
2. Number of visualizations: it is possible to determine the number of people who have visualized the data set.
-
3. Download:
4. API: queries can be made through parameterizable addresses that allow obtaining specific fields of a data set.
-
5. Resulting products:
Metadata:
1. Use: metadata is used to detail the characteristics of the data sets.
-
2. Completeness:
-
3. Recoverability:
In relation to the portal:
Structure:
1. Categorization: data sets are consistent with respect to similarity with other sets that are classified in the same category.
Usability:
-
1. Search
-
2. Navigability:
A) The portal has a navigation map available to users, where the structure of the site is evidenced.
B) The portal has a simple navigability that allows users to scroll through the portal and find information quickly.
C) The portal implements different elements to facilitate navigability in the system, such as: help buttons, contact buttons, navigation bars, a general menu.
-
3. Use / consumption / data download:
A) The portal offers the possibility to visualize data in order to facilitate its analysis and understanding.
B) It is possible to download the data from the portal in different formats that allow its versatility of use, without any restriction.
C) The portal makes available to users at least one API that allows the consumption and consultation of data.
D) The portal offers statistics about the users use of data.
Communication:
-
1. Comments and discussion:
A) It is possible to comment on the data sets at their place of publication.
B) The portal has spaces where users can deal with topics related to the data available on the portal.
C) The portal provides support mechanisms between users through forum-like spaces.
D) A space is offered for users to view and learn about the resulting products from the use of data published on the portal.
-
2. Source-user:
A) The publisher is notified when comments are received about the data sets he has published.
B) Users are notified when the data sets on which they showed interest are updated or modified.
C) In the portal there is the contact information of the entities or publishers.
D) The portal offers direct communication between the publisher and the end user, contributing to the improvement of data quality.
-
3. Requests:
In case the score gives a decimal value, it must be adjusted by rounding. Next, Table 5 shows the scores with their corresponding classification.
Score | Clasification |
80 - 100 | Excellent |
60 - 79 | Outstanding |
40 - 59 | Acceptable |
20 - 39 | Insufficient |
0 - 19 | Deficient |
If, when evaluating a portal, the data score was 50 points and that of the portal was 63 points, the following would be obtained:
𝑆𝑐𝑜𝑟𝑒=(50∗0.6)+(63∗0.4)
𝑆𝑐𝑜𝑟𝑒=30+25.2=55.2
According to the classification proposed in the methodology, the portal would have an acceptable quality.
IV. Study case “Colombia Open Data”
With the aim of evaluating the methodology, it was applied in the open data portal provided by the Colombian government (https://www.datos.gov.co/), based on the experience of a group of users, both experts as inexperienced. The qualification obtained is presented in Table 6, which also summarizes the main aspects that justify the evaluation of each factor or criterion.
Element | Dimension | Factor | Evaluation | Justification |
Data sets | Quality | Availability | 3 | There is no way to request adjustments or require clarity of the data set, it only allows you to communicate with the data provider which does not guarantee a response from it. |
Upgrade | 2 | There is no regulation in the clearly established update periods, mainly in the public entity data sets. | ||
Accessibility | 3 | Not all sets are downloadable or do not allow interconnection with APIs. | ||
Visualization | 3 | The visualization of much of the data sets is limited to tables. | ||
Publishing formats | 5 | The portal offers multiple download formats, facilitating user management. | ||
Completeness | 0 | It is in this factor that there is one of the major flaws of the portal, allowing users to load data sets without prior validation, causing the portal to proliferate sets without metadata, with insufficient information (sets with five records), with high fields null, among others. | ||
Use | Defined demand | 3 | A description is not presented according to the data in use, which suggests reflecting whether the portal complies with the open data ecosystem or is only limited to being a site to publish data sets without a specific audience. | |
Number of views | 4 | Although you can know the number of visits that each set of data has, this aspect does not seem to be used by the portal to classify the sets, or at least to be shown in this order and thus be able to evaluate what are the types of data that most interest the final user. | ||
Download | 2 | In large part of the datasets there is not at least one download from the users, which therefore means that acceptable download numbers are not handled. | ||
api consumption | 6 | Allows connection to the Socrates API for most portal data sets. | ||
Resulting products | 5 | In the portal there is information about the uses of the data sets, however, not of all the data sets, especially those in which the downloads are low, and it is not possible to determine for what purpose the data is used and if It is worth keeping these sets. | ||
Metadata | Use | 3 | Not all sets have metadata, so it is not possible to determine what each of the data provided represents, the scope and purpose of the data is not defined. Although information is available for contacting the source provider, the response is not guaranteed. No specific keywords are added. | |
Completeness | 4 | |||
Recoverability | 5 | |||
Portal | Structure | Categorization | 9 | There is a “more relevant” superficial classification that is insufficient or unclear, there is no validation of the category granted to a set for classification, causing that there are sets that are not in their respective category, even in some sets the category is absent. |
Usability | Search | 10 | Search for sets by periods is not included. | |
Navigability | 7 | It is necessary to contemplate web page usability guides to improve the user experience. There is no a "map" or site guide to guide the beginner user. | ||
Use / consumption / download of data | 15 | Alternative display mechanisms are missing. Download statistics are insufficient and are not used for decision making. | ||
Communication | Comments and discussion | 7 | The way of commenting and interacting with other users regarding data sets is not clear. | |
Source-user | 11 | There is communication with the data provider, but it is not clear how to receive automatic updates. | ||
Request | 0 | The existence of this function is not evident |
All the above, gives the portal the following score:
𝑆𝑐𝑜𝑟𝑒=(48∗0.6)+(59∗0.4)
𝑆𝑐𝑜𝑟𝑒=52.4
Consequently, according to Table 5, the portal would have an acceptable rating, which indicates that, although it has different functionalities, it is necessary to add control points that provide greater satisfaction to the end user, eliminating sets that do not comply with minimum quality conditions or allowing to qualify a set by users.
V. Conclusions
The use of methodologies and models to determine the quality of the data contributes to the improvement of these, based on the identification of the status and flaws that may occur, also helps the continuation of the life cycle of open data, whose processes are in constant improvement.
Each methodology provides a different approach to the extent that its evaluation criteria is raised, which may lead to the studied element (portal or data) having different quality levels, depending on the used methodology. However, it is not unknown that the approach to a more real quality result is given by the combination and complement of methodologies and models that allow a greater number of aspects to be covered.
Open data portals play an important role in data opening initiatives, since they are the main point of access and availability of data, mainly published by government entities, which is why the quality of the data, of the structure of the data portal and the characteristics it provides to its users, can determine its level of use, impact and reputation; This is why the responsibility of the portals also lies in their constant improvement to offer users the highest possible quality.
When interacting with the Open Data portal of the Colombian State, it has been found that there are a large number of data sets available, but that many of them present inconsistencies or other flaws that hinder their use, which evidences the need to evaluate the portal with regarding its data and structure, since this type of aspects may raise the question about the use of portal resources.