Search | arXiv e-print repository

The OpenCitations Index

Authors: Ivan Heibi, Arianna Moretti, Silvio Peroni, Marta Soricetti

Abstract: This article presents the OpenCitations Index, a collection of open citation data maintained by OpenCitations, an independent, not-for-profit infrastructure organisation for open scholarship dedicated to publishing open bibliographic and citation data using Semantic Web and Linked Open Data technologies. The collection involves citation data harvested from multiple sources. To address the possibil… ▽ More This article presents the OpenCitations Index, a collection of open citation data maintained by OpenCitations, an independent, not-for-profit infrastructure organisation for open scholarship dedicated to publishing open bibliographic and citation data using Semantic Web and Linked Open Data technologies. The collection involves citation data harvested from multiple sources. To address the possibility of different sources providing citation data for bibliographic entities represented with different identifiers, therefore potentially representing same citation, a deduplication mechanism has been implemented. This ensures that citations integrated into OpenCitations Index are accurately identified uniquely, even when different identifiers are used. This mechanism follows a specific workflow, which encompasses a preprocessing of the original source data, a management of the provided bibliographic metadata, and the generation of new citation data to be integrated into the OpenCitations Index. The process relies on another data collection: OpenCitations Meta, and on the use of a new globally persistent identifier, namely OMID (OpenCitations Meta Identifier). As of July 2024, OpenCitations Index stores over 2 billion unique citation links, harvest from Crossref, the National Institute of Heath Open Citation Collection (NIH-OCC), DataCite, OpenAIRE, and the Japan Link Center (JaLC). OpenCitations Index can be systematically accessed and queried through several services, including SPARQL endpoint, REST APIs, and web interfaces. Additionally, dataset dumps are available for free download and reuse (under CC0 waiver) in various formats (CSV, N-Triples, and Scholix), including provenance and change tracking information. △ Less

Submitted 5 August, 2024; originally announced August 2024.

arXiv:2407.13329 [pdf]

doi 10.5281/zenodo.11841798

Why do you cite? An investigation on citation intents and decision-making classification processes

Authors: Lorenzo Paolini, Sahar Vahdati, Angelo Di Iorio, Robert Wardenga, Ivan Heibi, Silvio Peroni

Abstract: Identifying the reason for which an author cites another work is essential to understand the nature of scientific contributions and to assess their impact. Citations are one of the pillars of scholarly communication and most metrics employed to analyze these conceptual links are based on quantitative observations. Behind the act of referencing another scholarly work there is a whole world of meani… ▽ More Identifying the reason for which an author cites another work is essential to understand the nature of scientific contributions and to assess their impact. Citations are one of the pillars of scholarly communication and most metrics employed to analyze these conceptual links are based on quantitative observations. Behind the act of referencing another scholarly work there is a whole world of meanings that needs to be proficiently and effectively revealed. This study emphasizes the importance of trustfully classifying citation intents to provide more comprehensive and insightful analyses in research assessment. We address this task by presenting a study utilizing advanced Ensemble Strategies for Citation Intent Classification (CIC) incorporating Language Models (LMs) and employing Explainable AI (XAI) techniques to enhance the interpretability and trustworthiness of models' predictions. Our approach involves two ensemble classifiers that utilize fine-tuned SciBERT and XLNet LMs as baselines. We further demonstrate the critical role of section titles as a feature in improving models' performances. The study also introduces a web application developed with Flask and currently available at http://137.204.64.4:81/cic/classifier, aimed at classifying citation intents. One of our models sets as a new state-of-the-art (SOTA) with an 89.46% Macro-F1 score on the SciCite benchmark. The integration of XAI techniques provides insights into the decision-making processes, highlighting the contributions of individual words for level-0 classifications, and of individual models for the metaclassification. The findings suggest that the inclusion of section titles significantly enhances classification performances in the CIC task. Our contributions provide useful insights for developing more robust datasets and methodologies, thus fostering a deeper understanding of scholarly communication. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: 42 pages, 14 figures, 1 table, submitted to Scientometrics Journal

arXiv:2407.02018 [pdf]

A Proposal for a FAIR Management of 3D Data in Cultural Heritage: The Aldrovandi Digital Twin Case

Authors: Sebastian Barzaghi, Alice Bordignon, Bianca Gualandi, Ivan Heibi, Arcangelo Massari, Arianna Moretti, Silvio Peroni, Giulia Renda

Abstract: In this article we analyse 3D models of cultural heritage with the aim of answering three main questions: what processes can be put in place to create a FAIR-by-design digital twin of a temporary exhibition? What are the main challenges in applying FAIR principles to 3D data in cultural heritage studies and how are they different from other types of data (e.g. images) from a data management perspe… ▽ More In this article we analyse 3D models of cultural heritage with the aim of answering three main questions: what processes can be put in place to create a FAIR-by-design digital twin of a temporary exhibition? What are the main challenges in applying FAIR principles to 3D data in cultural heritage studies and how are they different from other types of data (e.g. images) from a data management perspective? We begin with a comprehensive literature review touching on: FAIR principles applied to cultural heritage data; representation models; both Object Provenance Information (OPI) and Metadata Record Provenance Information (MRPI), respectively meant as, on the one hand, the detailed history and origin of an object, and - on the other hand - the detailed history and origin of the metadata itself, which describes the primary object (whether physical or digital); 3D models as cultural heritage research data and their creation, selection, publication, archival and preservation. We then describe the process of creating the Aldrovandi Digital Twin, by collecting, storing and modelling data about cultural heritage objects and processes. We detail the many steps from the acquisition of the Digital Cultural Heritage Objects (DCHO), through to the upload of the optimised DCHO onto a web-based framework (ATON), with a focus on open technologies and standards for interoperability and preservation. Using the FAIR Principles for Heritage Library, Archive and Museum Collections as a framework, we look in detail at how the Digital Twin implements FAIR principles at the object and metadata level. We then describe the main challenges we encountered and we summarise what seem to be the peculiarities of 3D cultural heritage data and the possible directions for further research in this field. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2405.02113 [pdf]

A Workflow for GLAM Metadata Crosswalk

Authors: Arianna Moretti, Ivan Heibi, Silvio Peroni

Abstract: The acquisition of physical artifacts not only involves transferring existing information into the digital ecosystem but also generates information as a process itself, underscoring the importance of meticulous management of FAIR data and metadata. In addition, the diversity of objects within the cultural heritage domain is reflected in a multitude of descriptive models. The digitization process e… ▽ More The acquisition of physical artifacts not only involves transferring existing information into the digital ecosystem but also generates information as a process itself, underscoring the importance of meticulous management of FAIR data and metadata. In addition, the diversity of objects within the cultural heritage domain is reflected in a multitude of descriptive models. The digitization process expands the opportunities for exchange and joint utilization, granted that the descriptive schemas are made interoperable in advance. To achieve this goal, we propose a replicable workflow for metadata schema crosswalks that facilitates the preservation and accessibility of cultural heritage in the digital ecosystem. This work presents a methodology for metadata generation and management in the case study of the digital twin of the temporary exhibition "The Other Renaissance - Ulisse Aldrovandi and the Wonders of the World". The workflow delineates a systematic, step-by-step transformation of tabular data into RDF format, to enhance Linked Open Data. The methodology adopts the RDF Mapping Language (RML) technology for converting data to RDF with a human contribution involvement. This last aspect entails an interaction between digital humanists and domain experts through surveys leading to the abstraction and reformulation of domain-specific knowledge, to be exploited in the process of formalizing and converting information. △ Less

Submitted 3 May, 2024; originally announced May 2024.

Comments: Submitted to AIUCD conference 2024 1 figure 8 pages

arXiv:2404.12069 [pdf, other]

Developing Application Profiles for Enhancing Data and Workflows in Cultural Heritage Digitisation Processes

Authors: Sebastian Barzaghi, Ivan Heibi, Arianna Moretti, Silvio Peroni

Abstract: As a result of the proliferation of 3D digitisation in the context of cultural heritage projects, digital assets and digitisation processes - being considered as proper research objects - must prioritise adherence to FAIR principles. Existing standards and ontologies, such as CIDOC CRM, play a crucial role in this regard, but they are often over-engineered for the need of a particular application… ▽ More As a result of the proliferation of 3D digitisation in the context of cultural heritage projects, digital assets and digitisation processes - being considered as proper research objects - must prioritise adherence to FAIR principles. Existing standards and ontologies, such as CIDOC CRM, play a crucial role in this regard, but they are often over-engineered for the need of a particular application context, thus making their understanding and adoption difficult. Application profiles of a given standard - defined as sets of ontological entities drawn from one or more semantic artefacts for a particular context or application - are usually proposed as tools for promoting interoperability and reuse while being tied entirely to the particular application context they refer to. In this paper, we present an adaptation and application of an ontology development methodology, i.e. SAMOD, to guide the creation of robust, semantically sound application profiles of large standard models. Using an existing pilot study we have developed in a project dedicated to leveraging virtual technologies to preserve and valorise cultural heritage, we introduce an application profile named CHAD-AP, that we have developed following our customised version of SAMOD. We reflect on the use of SAMOD and similar ontology development methodologies for this purpose, highlighting its strengths and current limitations, future developments, and possible adoption in other similar projects. △ Less

Submitted 2 August, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

arXiv:2308.15920 [pdf]

doi 10.1016/j.daach.2023.e00309

Saving temporary exhibitions in virtual environments: the Digital Renaissance of Ulisse Aldrovandi -- acquisition and digitisation of cultural heritage objects

Authors: Roberto Balzani, Sebastian Barzaghi, Gabriele Bitelli, Federica Bonifazi, Alice Bordignon, Luca Cipriani, Simona Colitti, Federica Collina, Marilena Daquino, Francesca Fabbri, Bruno Fanini, Filippo Fantini, Daniele Ferdani, Giulia Fiorini, Elena Formia, Anna Forte, Federica Giacomini, Valentina Alena Girelli, Bianca Gualandi, Ivan Heibi, Alessandro Iannucci, Rachele Manganelli Del Fà, Arcangelo Massari, Arianna Moretti, Silvio Peroni , et al. (8 additional authors not shown)

Abstract: As per the objectives of Project CHANGES, particularly its thematic sub-project on the use of virtual technologies for museums and art collections, our goal was to obtain a digital twin of the temporary exhibition on Ulisse Aldrovandi called "The Other Renaissance", and make it accessible to users online. After a preliminary study of the exhibition, focussing on acquisition constraints and related… ▽ More As per the objectives of Project CHANGES, particularly its thematic sub-project on the use of virtual technologies for museums and art collections, our goal was to obtain a digital twin of the temporary exhibition on Ulisse Aldrovandi called "The Other Renaissance", and make it accessible to users online. After a preliminary study of the exhibition, focussing on acquisition constraints and related solutions, we proceeded with the digital twin creation by acquiring, processing, modelling, optimising, exporting, and metadating the exhibition. We made hybrid use of two acquisition techniques to create new digital cultural heritage objects and environments, and we used open technologies, formats, and protocols to make available the final digital product. Here, we describe the process of collecting and curating bibliographical exhibition (meta)data and the beginning of the digital twin creation to foster its findability, accessibility, interoperability, and reusability. The creation of the digital twin is currently ongoing. △ Less

Submitted 27 December, 2023; v1 submitted 30 August, 2023; originally announced August 2023.

arXiv:2308.13573 [pdf]

Retractions in Arts and Humanities: an Analysis of the Retraction Notices

Authors: Ivan Heibi, Silvio Peroni

Abstract: The aim of this work is to understand the retraction phenomenon in the arts and humanities domain through an analysis of the retraction notices: formal documents stating and describing the retraction of a particular publication. The retractions and the corresponding notices are identified using the data provided by Retraction Watch. Our methodology for the analysis combines a metadata analysis and… ▽ More The aim of this work is to understand the retraction phenomenon in the arts and humanities domain through an analysis of the retraction notices: formal documents stating and describing the retraction of a particular publication. The retractions and the corresponding notices are identified using the data provided by Retraction Watch. Our methodology for the analysis combines a metadata analysis and a content analysis (mainly performed using a topic modeling process) of the retraction notices. Considering 343 cases of retraction, we found that many retraction notices are neither identifiable nor findable. In addition, these were not always separated from the original papers, introducing ambiguity in understanding how these notices were perceived by the community (i.e., cited). Also, we noticed that there is no systematic way to write a retraction notice. Indeed, some retraction notices presented a complete discussion of the reasons for retraction, while others tended to be more direct and succinct. We have also reported many notices having similar text while addressing different retractions. We think a further study with a larger collection should be done using the same methodology to confirm and investigate our findings further. △ Less

Submitted 25 August, 2023; originally announced August 2023.

arXiv:2307.01718 [pdf]

A Prototype for a Controlled and Valid RDF Data Production Using SHACL

Authors: Elia Rizzetto, Arcangelo Massari, Ivan Heibi, Silvio Peroni

Abstract: The paper introduces a tool prototype that combines SHACL's capabilities with ad-hoc validation functions to create a controlled and user-friendly form interface for producing valid RDF data. The proposed tool is developed within the context of the OpenCitations Data Model (OCDM) use case. The paper discusses the current status of the tool, outlines the future steps required for achieving full fun… ▽ More The paper introduces a tool prototype that combines SHACL's capabilities with ad-hoc validation functions to create a controlled and user-friendly form interface for producing valid RDF data. The proposed tool is developed within the context of the OpenCitations Data Model (OCDM) use case. The paper discusses the current status of the tool, outlines the future steps required for achieving full functionality, and explores the potential applications and benefits of the tool. △ Less

Submitted 4 July, 2023; originally announced July 2023.

arXiv:2306.16191 [pdf]

OpenCitations Meta

Authors: Arcangelo Massari, Fabio Mariani, Ivan Heibi, Silvio Peroni, David Shotton

Abstract: OpenCitations Meta is a new database that contains bibliographic metadata of scholarly publications involved in citations indexed by the OpenCitations infrastructure. It adheres to Open Science principles and provides data under a CC0 license for maximum reuse. The data can be accessed through a SPARQL endpoint, REST APIs, and dumps. OpenCitations Meta serves three important purposes. Firstly, it… ▽ More OpenCitations Meta is a new database that contains bibliographic metadata of scholarly publications involved in citations indexed by the OpenCitations infrastructure. It adheres to Open Science principles and provides data under a CC0 license for maximum reuse. The data can be accessed through a SPARQL endpoint, REST APIs, and dumps. OpenCitations Meta serves three important purposes. Firstly, it enables disambiguation of citations between publications described using different identifiers from various sources. For example, it can link publications identified by DOIs in Crossref and PMIDs in PubMed. Secondly, it assigns new globally persistent identifiers (PIDs), known as OpenCitations Meta Identifiers (OMIDs), to bibliographic resources without existing external persistent identifiers like DOIs. Lastly, by hosting the bibliographic metadata internally, OpenCitations Meta improves the speed of metadata retrieval for citing and cited documents. The database is populated through automated data curation, including deduplication, error correction, and metadata enrichment. The data is stored in RDF format following the OpenCitations Data Model, and changes and provenance information are tracked. OpenCitations Meta and its production. OpenCitations Meta currently incorporates data from Crossref, DataCite, and the NIH Open Citation Collection. In terms of semantic publishing datasets, it is currently the first in data volume. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: 26 pages, 7 figures

arXiv:2305.08477 [pdf]

Representing provenance and track changes of cultural heritage metadata in RDF: a survey of existing approaches

Authors: Arcangelo Massari, Silvio Peroni, Francesca Tomasi, Ivan Heibi

Abstract: The data within collections from all Digital Humanities fields must be trustworthy. To this end, both provenance and change-tracking systems are needed. This contribution offers a systematic review of the metadata representation models for provenance in RDF, focusing on the problem of modelling conjectures in humanistic data. The data within collections from all Digital Humanities fields must be trustworthy. To this end, both provenance and change-tracking systems are needed. This contribution offers a systematic review of the metadata representation models for provenance in RDF, focusing on the problem of modelling conjectures in humanistic data. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: 10 pages, 2 figures, submitted to the ADHO Digital Humanities Conference 2023

arXiv:2305.06746 [pdf, ps, other]

doi 10.1038/s41597-024-03185-4

A maturity model for catalogues of semantic artefacts

Authors: Oscar Corcho, Fajar J. Ekaputra, Ivan Heibi, Clement Jonquet, Andras Micsik, Silvio Peroni, Emanuele Storti

Abstract: This work presents a maturity model for assessing catalogues of semantic artefacts, one of the keystones that permit semantic interoperability of systems. We defined the dimensions and related features to include in the maturity model by analysing the current literature and existing catalogues of semantic artefacts provided by experts. In addition, we assessed 26 different catalogues to demonstrat… ▽ More This work presents a maturity model for assessing catalogues of semantic artefacts, one of the keystones that permit semantic interoperability of systems. We defined the dimensions and related features to include in the maturity model by analysing the current literature and existing catalogues of semantic artefacts provided by experts. In addition, we assessed 26 different catalogues to demonstrate the effectiveness of the maturity model, which includes 12 different dimensions (Metadata, Openness, Quality, Availability, Statistics, PID, Governance, Community, Sustainability, Technology, Transparency, and Assessment) and 43 related features (or sub-criteria) associated with these dimensions. Such a maturity model is one of the first attempts to provide recommendations for governance and processes for preserving and maintaining semantic artefacts and helps assess/address interoperability challenges. △ Less

Submitted 24 March, 2024; v1 submitted 11 May, 2023; originally announced May 2023.

Journal ref: Scientific Data, 11, 479

arXiv:2206.07476 [pdf]

OpenCitations, an open e-infrastructure to foster maximum reuse of citation data

Authors: Chiara Di Giambattista, Ivan Heibi, Silvio Peroni, David Shotton

Abstract: OpenCitations is an independent not-for-profit infrastructure organization for open scholarship dedicated to the publication of open bibliographic and citation data by the use of Semantic Web (Linked Data) technologies. OpenCitations collaborates with projects that are part of the Open Science ecosystem and complies with the UNESCO founding principles of Open Science, the I4OC recommendations, and… ▽ More OpenCitations is an independent not-for-profit infrastructure organization for open scholarship dedicated to the publication of open bibliographic and citation data by the use of Semantic Web (Linked Data) technologies. OpenCitations collaborates with projects that are part of the Open Science ecosystem and complies with the UNESCO founding principles of Open Science, the I4OC recommendations, and the FAIR data principles that data should be Findable, Accessible, Interoperable and Reusable. Since its data satisfies all the Reuse guidelines provided by FAIR in terms of richness, provenance, usage licenses and domain-relevant community standards, OpenCitations provides an example of a successful open e-infrastructure in which the reusability of data is integral to its mission. △ Less

Submitted 15 June, 2022; originally announced June 2022.

arXiv:2206.03971 [pdf, other]

How to structure citations data and bibliographic metadata in the OpenCitations accepted format

Authors: Arcangelo Massari, Ivan Heibi

Abstract: The OpenCitations organization is working on ingesting citation data and bibliographic metadata directly provided by the community (e.g., scholars and publishers). The aim is to improve the general coverage of open citations, which is still far from being complete, and use the provided metadata to enrich the characterization of the citing and cited entities. This paper illustrates how the citation… ▽ More The OpenCitations organization is working on ingesting citation data and bibliographic metadata directly provided by the community (e.g., scholars and publishers). The aim is to improve the general coverage of open citations, which is still far from being complete, and use the provided metadata to enrich the characterization of the citing and cited entities. This paper illustrates how the citation data and bibliographic metadata should be structured to comply with the OpenCitations accepted format. △ Less

Submitted 8 June, 2022; originally announced June 2022.

Comments: 5 pages, submitted to JCDL 2022

Journal ref: Proc. of the Workshop on Understanding LIterature references in academic full TExt (ULITE 2022), Cologne, Germany, June 20-24, 2022. Vol-3220. CEUR-WS.org

arXiv:2206.03926 [pdf, ps, other]

doi 10.1007/978-3-031-16802-4_36

Enabling Portability and Reusability of Open Science Infrastructures

Authors: Giuseppe Grieco, Ivan Heibi, Arcangelo Massari, Arianna Moretti, Silvio Peroni

Abstract: This paper presents a methodology for designing a containerized and distributed open science infrastructure to simplify its reusability, replicability, and portability in different environments. The methodology is depicted in a step-by-step schema based on four main phases: (1) Analysis, (2) Design, (3) Definition, and (4) Managing and provisioning. We accompany the description of each step with e… ▽ More This paper presents a methodology for designing a containerized and distributed open science infrastructure to simplify its reusability, replicability, and portability in different environments. The methodology is depicted in a step-by-step schema based on four main phases: (1) Analysis, (2) Design, (3) Definition, and (4) Managing and provisioning. We accompany the description of each step with existing technologies and concrete examples of application. △ Less

Submitted 28 July, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

Comments: 8 pages, 1 PostScript figure, submitted to TPDL 2022

Journal ref: Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham

arXiv:2111.05223 [pdf]

A quantitative and qualitative open citation analysis of retracted articles in the humanities

Authors: Ivan Heibi, Silvio Peroni

Abstract: In this article, we show and discuss the results of a quantitative and qualitative analysis of open citations to retracted publications in the humanities domain. Our study was conducted by selecting retracted papers in the humanities domain and marking their main characteristics (e.g., retraction reason). Then, we gathered the citing entities and annotated their basic metadata (e.g., title, venue,… ▽ More In this article, we show and discuss the results of a quantitative and qualitative analysis of open citations to retracted publications in the humanities domain. Our study was conducted by selecting retracted papers in the humanities domain and marking their main characteristics (e.g., retraction reason). Then, we gathered the citing entities and annotated their basic metadata (e.g., title, venue, subject, etc.) and the characteristics of their in-text citations (e.g., intent, sentiment, etc.). Using these data, we performed a quantitative and qualitative study of retractions in the humanities, presenting descriptive statistics and a topic modeling analysis of the citing entities' abstracts and the in-text citation contexts. As part of our main findings, we noticed that there was no drop in the overall number of citations after the year of retraction, with few entities which have either mentioned the retraction or expressed a negative sentiment toward the cited publication. In addition, on several occasions, we noticed a higher concern/awareness when it was about citing a retracted publication, by the citing entities belonging to the health sciences domain, if compared to the humanities and the social science domains. Philosophy, arts, and history are the humanities areas that showed the higher concern toward the retraction. △ Less

Submitted 10 October, 2022; v1 submitted 9 November, 2021; originally announced November 2021.

arXiv:2106.01781 [pdf]

doi 10.1371/journal.pone.0270872

A protocol to gather, characterize and analyze incoming citations of retracted articles

Authors: Ivan Heibi, Silvio Peroni

Abstract: In this article, we present a methodology which takes as input a collection of retracted articles, gathers the entities citing them, characterizes such entities according to multiple dimensions (disciplines, year of publication, sentiment, etc.), and applies a quantitative and qualitative analysis on the collected values. The methodology is composed of four phases: (1) identifying, retrieving, and… ▽ More In this article, we present a methodology which takes as input a collection of retracted articles, gathers the entities citing them, characterizes such entities according to multiple dimensions (disciplines, year of publication, sentiment, etc.), and applies a quantitative and qualitative analysis on the collected values. The methodology is composed of four phases: (1) identifying, retrieving, and extracting basic metadata of the entities which have cited a retracted article, (2) extracting and labeling additional features based on the textual content of the citing entities, (3) building a descriptive statistical summary based on the collected data, and finally (4) running a topic modeling analysis. The goal of the methodology is to generate data and visualizations that help understanding possible behaviors related to retraction cases. We present the methodology in a structured step-by-step form following its four phases, discuss its limits and possible workarounds, and list the planned future improvements. △ Less

Submitted 3 June, 2021; originally announced June 2021.

arXiv:2012.11936 [pdf, other]

Knowledge Graphs Evolution and Preservation -- A Technical Report from ISWS 2019

Authors: Nacira Abbas, Kholoud Alghamdi, Mortaza Alinam, Francesca Alloatti, Glenda Amaral, Claudia d'Amato, Luigi Asprino, Martin Beno, Felix Bensmann, Russa Biswas, Ling Cai, Riley Capshaw, Valentina Anita Carriero, Irene Celino, Amine Dadoun, Stefano De Giorgis, Harm Delva, John Domingue, Michel Dumontier, Vincent Emonet, Marieke van Erp, Paola Espinoza Arias, Omaima Fallatah, Sebastián Ferrada, Marc Gallofré Ocaña , et al. (49 additional authors not shown)

Abstract: One of the grand challenges discussed during the Dagstuhl Seminar "Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web" and described in its report is that of a: "Public FAIR Knowledge Graph of Everything: We increasingly see the creation of knowledge graphs that capture information about the entirety of a class of entities. [...] This grand challenge extends this fur… ▽ More One of the grand challenges discussed during the Dagstuhl Seminar "Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web" and described in its report is that of a: "Public FAIR Knowledge Graph of Everything: We increasingly see the creation of knowledge graphs that capture information about the entirety of a class of entities. [...] This grand challenge extends this further by asking if we can create a knowledge graph of "everything" ranging from common sense concepts to location based entities. This knowledge graph should be "open to the public" in a FAIR manner democratizing this mass amount of knowledge." Although linked open data (LOD) is one knowledge graph, it is the closest realisation (and probably the only one) to a public FAIR Knowledge Graph (KG) of everything. Surely, LOD provides a unique testbed for experimenting and evaluating research hypotheses on open and FAIR KG. One of the most neglected FAIR issues about KGs is their ongoing evolution and long term preservation. We want to investigate this problem, that is to understand what preserving and supporting the evolution of KGs means and how these problems can be addressed. Clearly, the problem can be approached from different perspectives and may require the development of different approaches, including new theories, ontologies, metrics, strategies, procedures, etc. This document reports a collaborative effort performed by 9 teams of students, each guided by a senior researcher as their mentor, attending the International Semantic Web Research School (ISWS 2019). Each team provides a different perspective to the problem of knowledge graph evolution substantiated by a set of research questions as the main subject of their investigation. In addition, they provide their working definition for KG preservation and evolution. △ Less

Submitted 22 December, 2020; originally announced December 2020.

arXiv:2012.11475 [pdf]

A qualitative and quantitative analysis of open citations to retracted articles: the Wakefield et al.'s case

Authors: Ivan Heibi, Silvio Peroni

Abstract: In this article, we show the results of a quantitative and qualitative analysis of open citations on a popular and highly cited retracted paper: "Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children" by Wakefield et al., published in 1998. The main purpose of our study is to understand the behavior of the publications citing retracted articles… ▽ More In this article, we show the results of a quantitative and qualitative analysis of open citations on a popular and highly cited retracted paper: "Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children" by Wakefield et al., published in 1998. The main purpose of our study is to understand the behavior of the publications citing retracted articles and the characteristics of the citations the retracted articles accumulated over time. Our analysis is based on a methodology which illustrates how we gathered the data, extracted the topics of the citing articles, and visualized the results. The data and services used are all open and free to foster the reproducibility of the analysis. The outcomes concerned the analysis of the entities citing Wakefield et al.'s article and their related in-text citations. We observed a constant increasing number of citations in the last 20 years, accompanied with a constant increment in the percentage of those acknowledging its retraction. Citing articles have started either discussing or dealing with the retraction of Wakefield et al.'s article even before its full retraction, happened in 2010. Articles in the social sciences domain citing the Wakefield et al.'s one were among those that have mostly discussed its retraction. In addition, when observing the in-text citations, we noticed that a large part of the citations received by Wakefield et al.'s article has focused on general discussions without recalling strictly medical details, especially after the full retraction. Medical studies did not hesitate in acknowledging the retraction and often provided strong negative statements on it. △ Less

Submitted 24 May, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

arXiv:2011.13886 [pdf]

MITAO: a tool for enabling scholars in the Humanities to use Topic Modelling in their studies

Authors: Ivan Heibi, Silvio Peroni, Luca Pareschi, Paolo Ferri

Abstract: Automatic text analysis methods, such as Topic Modelling, are gaining much attention in Humanities. However, scholars need to have extensive coding skills to use such methods appropriately. The need of having this technical expertise prevents the broad adoption of these methods in Humanities research. In this paper, to help scholars in the Humanities to use Topic Modelling having no or limited cod… ▽ More Automatic text analysis methods, such as Topic Modelling, are gaining much attention in Humanities. However, scholars need to have extensive coding skills to use such methods appropriately. The need of having this technical expertise prevents the broad adoption of these methods in Humanities research. In this paper, to help scholars in the Humanities to use Topic Modelling having no or limited coding skills, we introduce MITAO, a web-based tool that allow the definition of a visual workflow which embeds various automatic text analysis operations and allows one to store and share both the workflow and the results of its execution to other researchers, which enables the reproducibility of the analysis. We present an example of an application of use of Topic Modelling with MITAO using a collection of English abstracts of the articles published in "Umanistica Digitale". The results returned by MITAO are shown with dynamic web-based visualizations, which allowed us to have preliminary insights about the evolution of the topics treated over the time in the articles published in "Umanistica Digitale". All the results along with the defined workflows are published and accessible for further studies. △ Less

Submitted 27 November, 2020; originally announced November 2020.

arXiv:2007.16079 [pdf]

Creating RESTful APIs over SPARQL endpoints using RAMOSE

Authors: Marilena Daquino, Ivan Heibi, Silvio Peroni, David Shotton

Abstract: Semantic Web technologies are widely used for storing RDF data and making them available on the Web through SPARQL endpoints, queryable using the SPARQL query language. While the use of SPARQL endpoints is strongly supported by Semantic Web experts, it hinders broader use of RDF data by common Web users, engineers and developers unfamiliar with Semantic Web technologies, who normally rely on Web R… ▽ More Semantic Web technologies are widely used for storing RDF data and making them available on the Web through SPARQL endpoints, queryable using the SPARQL query language. While the use of SPARQL endpoints is strongly supported by Semantic Web experts, it hinders broader use of RDF data by common Web users, engineers and developers unfamiliar with Semantic Web technologies, who normally rely on Web RESTful APIs for querying Web-available data and creating applications over them. To solve this problem, we have developed RAMOSE, a generic tool developed in Python to create REST APIs over SPARQL endpoints. Through the creation of source-specific textual configuration files, RAMOSE enables the querying of SPARQL endpoints via simple Web RESTful API calls that return either JSON or CSV-formatted data, thus hiding all the intrinsic complexities of SPARQL and RDF from common Web users. We provide evidence that the use of RAMOSE to provide REST API access to RDF data within OpenCitations triplestores is beneficial in terms of the number of queries made by external users to such RDF data using the RAMOSE API compared with the direct access via the SPARQL endpoint. Our findings show the importance for suppliers of RDF data of having an alternative API access service, which enables its use by those with no (or little) experience in Semantic Web technologies and the SPARQL query language. RAMOSE can be used both to query any SPARQL endpoint and to query any other Web API, and thus it represents an easy generic technical solution for service providers who wish to create an API service to access Linked Data stored as RDF in a conventional triplestore. △ Less

Submitted 30 May, 2021; v1 submitted 31 July, 2020; originally announced July 2020.

arXiv:1904.06052 [pdf]

doi 10.1007/s11192-019-03217-6

COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations

Authors: Ivan Heibi, Silvio Peroni, David Shotton

Abstract: In this paper, we present COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations (http://opencitations.net/index/coci). COCI is the first open citation index created by OpenCitations, in which we have applied the concept of citations as first-class data entities, and it contains more than 445 million DOI-to-DOI citation links derived from the data available in Crossref. These citation… ▽ More In this paper, we present COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations (http://opencitations.net/index/coci). COCI is the first open citation index created by OpenCitations, in which we have applied the concept of citations as first-class data entities, and it contains more than 445 million DOI-to-DOI citation links derived from the data available in Crossref. These citations are described in RDF by means of the newly extended version of the OpenCitations Data Model (OCDM). We introduce the workflow we have developed for creating these data, and also show the additional services that facilitate the access to and querying of these data via different access points: a SPARQL endpoint, a REST API, bulk downloads, Web interfaces, and direct access to the citations via HTTP content negotiation. Finally, we present statistics regarding the use of COCI citation data, and we introduce several projects that have already started to use COCI data for different purposes. △ Less

Submitted 26 July, 2019; v1 submitted 12 April, 2019; originally announced April 2019.

Comments: Submitted to Scientometrics (https://link.springer.com/journal/11192)

arXiv:1902.02534 [pdf]

Crowdsourcing open citations with CROCI -- An analysis of the current status of open citations, and a proposal

Authors: Ivan Heibi, Silvio Peroni, David Shotton

Abstract: In this paper, we analyse the current availability of open citations data in one particular dataset, namely COCI (the OpenCitations Index of Crossref open DOI-to-DOI citations; http://opencitations.net/index/coci) provided by OpenCitations. The results of these analyses show a persistent gap in the coverage of the currently available open citation data. In order to address this specific issue, we… ▽ More In this paper, we analyse the current availability of open citations data in one particular dataset, namely COCI (the OpenCitations Index of Crossref open DOI-to-DOI citations; http://opencitations.net/index/coci) provided by OpenCitations. The results of these analyses show a persistent gap in the coverage of the currently available open citation data. In order to address this specific issue, we propose a strategy whereby the community (e.g. scholars and publishers) can directly involve themselves in crowdsourcing open citations, by uploading their citation data via the OpenCitations infrastructure into our new index, CROCI, the Crowdsourced Open Citations Index. △ Less

Submitted 21 June, 2019; v1 submitted 7 February, 2019; originally announced February 2019.

Comments: 7 pages, 3 figures, accepted to ISSI 2019 (https://www.issi2019.org/)

Showing 1–22 of 22 results for author: Heibi, I