-
Why do you cite? An investigation on citation intents and decision-making classification processes
Authors:
Lorenzo Paolini,
Sahar Vahdati,
Angelo Di Iorio,
Robert Wardenga,
Ivan Heibi,
Silvio Peroni
Abstract:
Identifying the reason for which an author cites another work is essential to understand the nature of scientific contributions and to assess their impact. Citations are one of the pillars of scholarly communication and most metrics employed to analyze these conceptual links are based on quantitative observations. Behind the act of referencing another scholarly work there is a whole world of meani…
▽ More
Identifying the reason for which an author cites another work is essential to understand the nature of scientific contributions and to assess their impact. Citations are one of the pillars of scholarly communication and most metrics employed to analyze these conceptual links are based on quantitative observations. Behind the act of referencing another scholarly work there is a whole world of meanings that needs to be proficiently and effectively revealed. This study emphasizes the importance of trustfully classifying citation intents to provide more comprehensive and insightful analyses in research assessment. We address this task by presenting a study utilizing advanced Ensemble Strategies for Citation Intent Classification (CIC) incorporating Language Models (LMs) and employing Explainable AI (XAI) techniques to enhance the interpretability and trustworthiness of models' predictions. Our approach involves two ensemble classifiers that utilize fine-tuned SciBERT and XLNet LMs as baselines. We further demonstrate the critical role of section titles as a feature in improving models' performances. The study also introduces a web application developed with Flask and currently available at http://137.204.64.4:81/cic/classifier, aimed at classifying citation intents. One of our models sets as a new state-of-the-art (SOTA) with an 89.46% Macro-F1 score on the SciCite benchmark. The integration of XAI techniques provides insights into the decision-making processes, highlighting the contributions of individual words for level-0 classifications, and of individual models for the metaclassification. The findings suggest that the inclusion of section titles significantly enhances classification performances in the CIC task. Our contributions provide useful insights for developing more robust datasets and methodologies, thus fostering a deeper understanding of scholarly communication.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Beyond Deepfake Images: Detecting AI-Generated Videos
Authors:
Danial Samadi Vahdati,
Tai D. Nguyen,
Aref Azizpour,
Matthew C. Stamm
Abstract:
Recent advances in generative AI have led to the development of techniques to generate visually realistic synthetic video. While a number of techniques have been developed to detect AI-generated synthetic images, in this paper we show that synthetic image detectors are unable to detect synthetic videos. We demonstrate that this is because synthetic video generators introduce substantially differen…
▽ More
Recent advances in generative AI have led to the development of techniques to generate visually realistic synthetic video. While a number of techniques have been developed to detect AI-generated synthetic images, in this paper we show that synthetic image detectors are unable to detect synthetic videos. We demonstrate that this is because synthetic video generators introduce substantially different traces than those left by image generators. Despite this, we show that synthetic video traces can be learned, and used to perform reliable synthetic video detection or generator source attribution even after H.264 re-compression. Furthermore, we demonstrate that while detecting videos from new generators through zero-shot transferability is challenging, accurate detection of videos from a new generator can be achieved through few-shot learning.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Retention Is All You Need
Authors:
Karishma Mohiuddin,
Mirza Ariful Alam,
Mirza Mohtashim Alam,
Pascal Welke,
Michael Martin,
Jens Lehmann,
Sahar Vahdati
Abstract:
Skilled employees are the most important pillars of an organization. Despite this, most organizations face high attrition and turnover rates. While several machine learning models have been developed to analyze attrition and its causal factors, the interpretations of those models remain opaque. In this paper, we propose the HR-DSS approach, which stands for Human Resource (HR) Decision Support Sys…
▽ More
Skilled employees are the most important pillars of an organization. Despite this, most organizations face high attrition and turnover rates. While several machine learning models have been developed to analyze attrition and its causal factors, the interpretations of those models remain opaque. In this paper, we propose the HR-DSS approach, which stands for Human Resource (HR) Decision Support System, and uses explainable AI for employee attrition problems. The system is designed to assist HR departments in interpreting the predictions provided by machine learning models. In our experiments, we employ eight machine learning models to provide predictions. We further process the results achieved by the best-performing model by the SHAP explainability process and use the SHAP values to generate natural language explanations which can be valuable for HR. Furthermore, using "What-if-analysis", we aim to observe plausible causes for attrition of an individual employee. The results show that by adjusting the specific dominant features of each individual, employee attrition can turn into employee retention through informative business decisions.
△ Less
Submitted 26 August, 2023; v1 submitted 6 April, 2023;
originally announced April 2023.
-
Multitask Brain Tumor Inpainting with Diffusion Models: A Methodological Report
Authors:
Pouria Rouzrokh,
Bardia Khosravi,
Shahriar Faghani,
Mana Moassefi,
Sanaz Vahdati,
Bradley J. Erickson
Abstract:
Despite the ever-increasing interest in applying deep learning (DL) models to medical imaging, the typical scarcity and imbalance of medical datasets can severely impact the performance of DL models. The generation of synthetic data that might be freely shared without compromising patient privacy is a well-known technique for addressing these difficulties. Inpainting algorithms are a subset of DL…
▽ More
Despite the ever-increasing interest in applying deep learning (DL) models to medical imaging, the typical scarcity and imbalance of medical datasets can severely impact the performance of DL models. The generation of synthetic data that might be freely shared without compromising patient privacy is a well-known technique for addressing these difficulties. Inpainting algorithms are a subset of DL generative models that can alter one or more regions of an input image while matching its surrounding context and, in certain cases, non-imaging input conditions. Although the majority of inpainting techniques for medical imaging data use generative adversarial networks (GANs), the performance of these algorithms is frequently suboptimal due to their limited output variety, a problem that is already well-known for GANs. Denoising diffusion probabilistic models (DDPMs) are a recently introduced family of generative networks that can generate results of comparable quality to GANs, but with diverse outputs. In this paper, we describe a DDPM to execute multiple inpainting tasks on 2D axial slices of brain MRI with various sequences, and present proof-of-concept examples of its performance in a variety of evaluation scenarios. Our model and a public online interface to try our tool are available at: https://github.com/Mayo-Radiology-Informatics-Lab/MBTI
△ Less
Submitted 30 March, 2023; v1 submitted 21 October, 2022;
originally announced October 2022.
-
ProjB: An Improved Bilinear Biased ProjE model for Knowledge Graph Completion
Authors:
Mojtaba Moattari,
Sahar Vahdati,
Farhana Zulkernine
Abstract:
Knowledge Graph Embedding (KGE) methods have gained enormous attention from a wide range of AI communities including Natural Language Processing (NLP) for text generation, classification and context induction. Embedding a huge number of inter-relationships in terms of a small number of dimensions, require proper modeling in both cognitive and computational aspects. Recently, numerous objective fun…
▽ More
Knowledge Graph Embedding (KGE) methods have gained enormous attention from a wide range of AI communities including Natural Language Processing (NLP) for text generation, classification and context induction. Embedding a huge number of inter-relationships in terms of a small number of dimensions, require proper modeling in both cognitive and computational aspects. Recently, numerous objective functions regarding cognitive and computational aspects of natural languages are developed. Among which are the state-of-the-art methods of linearity, bilinearity, manifold-preserving kernels, projection-subspace, and analogical inference. However, the major challenge of such models lies in their loss functions that associate the dimension of relation embeddings to corresponding entity dimension. This leads to inaccurate prediction of corresponding relations among entities when counterparts are estimated wrongly. ProjE KGE, published by Bordes et al., due to low computational complexity and high potential for model improvement, is improved in this work regarding all translative and bilinear interactions while capturing entity nonlinearity. Experimental results on benchmark Knowledge Graphs (KGs) such as FB15K and WN18 show that the proposed approach outperforms the state-of-the-art models in entity prediction task using linear and bilinear methods and other recent powerful ones. In addition, a parallel processing structure is proposed for the model in order to improve the scalability on large KGs. The effects of different adaptive clustering and newly proposed sampling approaches are also explained which prove to be effective in improving the accuracy of knowledge graph completion.
△ Less
Submitted 15 September, 2022; v1 submitted 15 August, 2022;
originally announced September 2022.
-
LEMON: LanguagE ModeL for Negative Sampling of Knowledge Graph Embeddings
Authors:
Md Rashad Al Hasan Rony,
Mirza Mohtashim Alam,
Semab Ali,
Jens Lehmann,
Sahar Vahdati
Abstract:
Knowledge Graph Embedding models have become an important area of machine learning.Those models provide a latent representation of entities and relations in a knowledge graph which can then be used in downstream machine learning tasks such as link prediction. The learning process of such models can be performed by contrasting positive and negative triples. While all triples of a KG are considered…
▽ More
Knowledge Graph Embedding models have become an important area of machine learning.Those models provide a latent representation of entities and relations in a knowledge graph which can then be used in downstream machine learning tasks such as link prediction. The learning process of such models can be performed by contrasting positive and negative triples. While all triples of a KG are considered positive, negative triples are usually not readily available. Therefore, the choice of the sampling method to obtain the negative triples play a crucial role in the performance and effectiveness of Knowledge Graph Embedding models. Most of the current methods fetch negative samples from a random distribution of entities in the underlying Knowledge Graph which also often includes meaningless triples. Other known methods use adversarial techniques or generative neural networks which consequently reduce the efficiency of the process. In this paper, we propose an approach for generating informative negative samples considering available complementary knowledge about entities. Particularly, Pre-trained Language Models are used to form neighborhood clusters by utilizing the distances between entities to obtain representations of symbolic entities via their textual information. Our comprehensive evaluations demonstrate the effectiveness of the proposed approach on benchmark Knowledge Graphs with textual information for the link prediction task.
△ Less
Submitted 15 October, 2022; v1 submitted 9 March, 2022;
originally announced March 2022.
-
Trans4E: Link Prediction on Scholarly Knowledge Graphs
Authors:
Mojtaba Nayyeri,
Gokce Muge Cil,
Sahar Vahdati,
Francesco Osborne,
Mahfuzur Rahman,
Simone Angioni,
Angelo Salatino,
Diego Reforgiato Recupero,
Nadezhda Vassilyeva,
Enrico Motta,
Jens Lehmann
Abstract:
The incompleteness of Knowledge Graphs (KGs) is a crucial issue affecting the quality of AI-based services. In the scholarly domain, KGs describing research publications typically lack important information, hindering our ability to analyse and predict research dynamics. In recent years, link prediction approaches based on Knowledge Graph Embedding models became the first aid for this issue. In th…
▽ More
The incompleteness of Knowledge Graphs (KGs) is a crucial issue affecting the quality of AI-based services. In the scholarly domain, KGs describing research publications typically lack important information, hindering our ability to analyse and predict research dynamics. In recent years, link prediction approaches based on Knowledge Graph Embedding models became the first aid for this issue. In this work, we present Trans4E, a novel embedding model that is particularly fit for KGs which include N to M relations with N$\gg$M. This is typical for KGs that categorize a large number of entities (e.g., research articles, patents, persons) according to a relatively small set of categories. Trans4E was applied on two large-scale knowledge graphs, the Academia/Industry DynAmics (AIDA) and Microsoft Academic Graph (MAG), for completing the information about Fields of Study (e.g., 'neural networks', 'machine learning', 'artificial intelligence'), and affiliation types (e.g., 'education', 'company', 'government'), improving the scope and accuracy of the resulting data. We evaluated our approach against alternative solutions on AIDA, MAG, and four other benchmarks (FB15k, FB15k-237, WN18, and WN18RR). Trans4E outperforms the other models when using low embedding dimensions and obtains competitive results in high dimensions.
△ Less
Submitted 3 July, 2021;
originally announced July 2021.
-
Multiple Run Ensemble Learning with Low-Dimensional Knowledge Graph Embeddings
Authors:
Chengjin Xu,
Mojtaba Nayyeri,
Sahar Vahdati,
Jens Lehmann
Abstract:
Among the top approaches of recent years, link prediction using knowledge graph embedding (KGE) models has gained significant attention for knowledge graph completion. Various embedding models have been proposed so far, among which, some recent KGE models obtain state-of-the-art performance on link prediction tasks by using embeddings with a high dimension (e.g. 1000) which accelerate the costs of…
▽ More
Among the top approaches of recent years, link prediction using knowledge graph embedding (KGE) models has gained significant attention for knowledge graph completion. Various embedding models have been proposed so far, among which, some recent KGE models obtain state-of-the-art performance on link prediction tasks by using embeddings with a high dimension (e.g. 1000) which accelerate the costs of training and evaluation considering the large scale of KGs. In this paper, we propose a simple but effective performance boosting strategy for KGE models by using multiple low dimensions in different repetition rounds of the same model. For example, instead of training a model one time with a large embedding size of 1200, we repeat the training of the model 6 times in parallel with an embedding size of 200 and then combine the 6 separate models for testing while the overall numbers of adjustable parameters are same (6*200=1200) and the total memory footprint remains the same. We show that our approach enables different models to better cope with their expressiveness issues on modeling various graph patterns such as symmetric, 1-n, n-1 and n-n. In order to justify our findings, we conduct experiments on various KGE models. Experimental results on standard benchmark datasets, namely FB15K, FB15K-237 and WN18RR, show that multiple low-dimensional models of the same kind outperform the corresponding single high-dimensional models on link prediction in a certain range and have advantages in training efficiency by using parallel training while the overall numbers of adjustable parameters are same.
△ Less
Submitted 30 May, 2021; v1 submitted 11 April, 2021;
originally announced April 2021.
-
Motif Learning in Knowledge Graphs Using Trajectories Of Differential Equations
Authors:
Mojtaba Nayyeri,
Chengjin Xu,
Jens Lehmann,
Sahar Vahdati
Abstract:
Knowledge Graph Embeddings (KGEs) have shown promising performance on link prediction tasks by mapping the entities and relations from a knowledge graph into a geometric space (usually a vector space). Ultimately, the plausibility of the predicted links is measured by using a scoring function over the learned embeddings (vectors). Therefore, the capability in preserving graph characteristics inclu…
▽ More
Knowledge Graph Embeddings (KGEs) have shown promising performance on link prediction tasks by mapping the entities and relations from a knowledge graph into a geometric space (usually a vector space). Ultimately, the plausibility of the predicted links is measured by using a scoring function over the learned embeddings (vectors). Therefore, the capability in preserving graph characteristics including structural aspects and semantics highly depends on the design of the KGE, as well as the inherited abilities from the underlying geometry. Many KGEs use the flat geometry which renders them incapable of preserving complex structures and consequently causes wrong inferences by the models. To address this problem, we propose a neuro differential KGE that embeds nodes of a KG on the trajectories of Ordinary Differential Equations (ODEs). To this end, we represent each relation (edge) in a KG as a vector field on a smooth Riemannian manifold. We specifically parameterize ODEs by a neural network to represent various complex shape manifolds and more importantly complex shape vector fields on the manifold. Therefore, the underlying embedding space is capable of getting various geometric forms to encode complexity in subgraph structures with different motifs. Experiments on synthetic and benchmark dataset as well as social network KGs justify the ODE trajectories as a means to structure preservation and consequently avoiding wrong inferences over state-of-the-art KGE models.
△ Less
Submitted 18 October, 2020; v1 submitted 13 October, 2020;
originally announced October 2020.
-
5* Knowledge Graph Embeddings with Projective Transformations
Authors:
Mojtaba Nayyeri,
Sahar Vahdati,
Can Aykul,
Jens Lehmann
Abstract:
Performing link prediction using knowledge graph embedding models has become a popular approach for knowledge graph completion. Such models employ a transformation function that maps nodes via edges into a vector space in order to measure the likelihood of the links. While mapping the individual nodes, the structure of subgraphs is also transformed. Most of the embedding models designed in Euclide…
▽ More
Performing link prediction using knowledge graph embedding models has become a popular approach for knowledge graph completion. Such models employ a transformation function that maps nodes via edges into a vector space in order to measure the likelihood of the links. While mapping the individual nodes, the structure of subgraphs is also transformed. Most of the embedding models designed in Euclidean geometry usually support a single transformation type - often translation or rotation, which is suitable for learning on graphs with small differences in neighboring subgraphs. However, multi-relational knowledge graphs often include multiple sub-graph structures in a neighborhood (e.g. combinations of path and loop structures), which current embedding models do not capture well. To tackle this problem, we propose a novel KGE model (5*E) in projective geometry, which supports multiple simultaneous transformations - specifically inversion, reflection, translation, rotation, and homothety. The model has several favorable theoretical properties and subsumes the existing approaches. It outperforms them on the most widely used link prediction benchmarks
△ Less
Submitted 14 March, 2021; v1 submitted 8 June, 2020;
originally announced June 2020.
-
Adaptive Margin Ranking Loss for Knowledge Graph Embeddings via a Correntropy Objective Function
Authors:
Mojtaba Nayyeri,
Xiaotian Zhou,
Sahar Vahdati,
Hamed Shariat Yazdi,
Jens Lehmann
Abstract:
Translation-based embedding models have gained significant attention in link prediction tasks for knowledge graphs. TransE is the primary model among translation-based embeddings and is well-known for its low complexity and high efficiency. Therefore, most of the earlier works have modified the score function of the TransE approach in order to improve the performance of link prediction tasks. Neve…
▽ More
Translation-based embedding models have gained significant attention in link prediction tasks for knowledge graphs. TransE is the primary model among translation-based embeddings and is well-known for its low complexity and high efficiency. Therefore, most of the earlier works have modified the score function of the TransE approach in order to improve the performance of link prediction tasks. Nevertheless, proven theoretically and experimentally, the performance of TransE strongly depends on the loss function. Margin Ranking Loss (MRL) has been one of the earlier loss functions which is widely used for training TransE. However, the scores of positive triples are not necessarily enforced to be sufficiently small to fulfill the translation from head to tail by using relation vector (original assumption of TransE). To tackle this problem, several loss functions have been proposed recently by adding upper bounds and lower bounds to the scores of positive and negative samples. Although highly effective, previously developed models suffer from an expansion in search space for a selection of the hyperparameters (in particular the upper and lower bounds of scores) on which the performance of the translation-based models is highly dependent. In this paper, we propose a new loss function dubbed Adaptive Margin Loss (AML) for training translation-based embedding models. The formulation of the proposed loss function enables an adaptive and automated adjustment of the margin during the learning process. Therefore, instead of obtaining two values (upper bound and lower bound), only the center of a margin needs to be determined. During learning, the margin is expanded automatically until it converges. In our experiments on a set of standard benchmark datasets including Freebase and WordNet, the effectiveness of AML is confirmed for training TransE on link prediction tasks.
△ Less
Submitted 9 July, 2019;
originally announced July 2019.
-
Soft Marginal TransE for Scholarly Knowledge Graph Completion
Authors:
Mojtaba Nayyeri,
Sahar Vahdati,
Jens Lehmann,
Hamed Shariat Yazdi
Abstract:
Knowledge graphs (KGs), i.e. representation of information as a semantic graph, provide a significant test bed for many tasks including question answering, recommendation, and link prediction. Various amount of scholarly metadata have been made vailable as knowledge graphs from the diversity of data providers and agents. However, these high-quantities of data remain far from quality criteria in te…
▽ More
Knowledge graphs (KGs), i.e. representation of information as a semantic graph, provide a significant test bed for many tasks including question answering, recommendation, and link prediction. Various amount of scholarly metadata have been made vailable as knowledge graphs from the diversity of data providers and agents. However, these high-quantities of data remain far from quality criteria in terms of completeness while growing at a rapid pace. Most of the attempts in completing such KGs are following traditional data digitization, harvesting and collaborative curation approaches. Whereas, advanced AI-related approaches such as embedding models - specifically designed for such tasks - are usually evaluated for standard benchmarks such as Freebase and Wordnet. The tailored nature of such datasets prevents those approaches to shed the lights on more accurate discoveries. Application of such models on domain-specific KGs takes advantage of enriched meta-data and provides accurate results where the underlying domain can enormously benefit. In this work, the TransE embedding model is reconciled for a specific link prediction task on scholarly metadata. The results show a significant shift in the accuracy and performance evaluation of the model on a dataset with scholarly metadata. The newly proposed version of TransE obtains 99.9% for link prediction task while original TransE gets 95%. In terms of accuracy and Hit@10, TransE outperforms other embedding models such as ComplEx, TransH and TransR experimented over scholarly knowledge graphs
△ Less
Submitted 27 April, 2019;
originally announced April 2019.
-
Unveiling Scholarly Communities over Knowledge Graphs
Authors:
Sahar Vahdati,
Guillermo Palma,
Rahul Jyoti Nath,
Christoph Lange,
Sören Auer,
Maria-Esther Vidal
Abstract:
Knowledge graphs represent the meaning of properties of real-world entities and relationships among them in a natural way. Exploiting semantics encoded in knowledge graphs enables the implementation of knowledge-driven tasks such as semantic retrieval, query processing, and question answering, as well as solutions to knowledge discovery tasks including pattern discovery and link prediction. In thi…
▽ More
Knowledge graphs represent the meaning of properties of real-world entities and relationships among them in a natural way. Exploiting semantics encoded in knowledge graphs enables the implementation of knowledge-driven tasks such as semantic retrieval, query processing, and question answering, as well as solutions to knowledge discovery tasks including pattern discovery and link prediction. In this paper, we tackle the problem of knowledge discovery in scholarly knowledge graphs, i.e., graphs that integrate scholarly data, and present Korona, a knowledge-driven framework able to unveil scholarly communities for the prediction of scholarly networks. Korona implements a graph partition approach and relies on semantic similarity measures to determine relatedness between scholarly entities. As a proof of concept, we built a scholarly knowledge graph with data from researchers, conferences, and papers of the Semantic Web area, and apply Korona to uncover co-authorship networks. Results observed from our empirical evaluation suggest that exploiting semantics in scholarly knowledge graphs enables the identification of previously unknown relations between researchers. By extending the ontology, these observations can be generalized to other scholarly entities, e.g., articles or institutions, for the prediction of other scholarly patterns, e.g., co-citations or academic collaboration.
△ Less
Submitted 18 July, 2018;
originally announced July 2018.
-
Towards a Cloud-Based Service for Maintaining and Analyzing Data About Scientific Events
Authors:
Andreas Behrend,
Sahar Vahdati,
Christoph Lange,
Christiane Engels
Abstract:
We propose the new cloud-based service OpenResearch for managing and analyzing data about scientific events such as conferences and workshops in a persistent and reliable way. This includes data about scientific articles, participants, acceptance rates, submission numbers, impact values as well as organizational details such as program committees, chairs, fees and sponsors. OpenResearch is a centr…
▽ More
We propose the new cloud-based service OpenResearch for managing and analyzing data about scientific events such as conferences and workshops in a persistent and reliable way. This includes data about scientific articles, participants, acceptance rates, submission numbers, impact values as well as organizational details such as program committees, chairs, fees and sponsors. OpenResearch is a centralized repository for scientific events and supports researchers in collecting, organizing, sharing and disseminating information about scientific events in a structured way. An additional feature currently under development is the possibility to archive web pages along with the extracted semantic data in order to lift the burden of maintaining new and old conference web sites from public research institutions. However, the main advantage is that this cloud-based repository enables a comprehensive analysis of conference data. Based on extracted semantic data, it is possible to determine quality estimations, scientific communities, research trends as well the development of acceptance rates, fees, and number of participants in a continuous way complemented by projections into the future. Furthermore, data about research articles can be systematically explored using a content-based analysis as well as citation linkage. All data maintained in this crowd-sourcing platform is made freely available through an open SPARQL endpoint, which allows for analytical queries in a flexible and user-defined way.
△ Less
Submitted 28 November, 2017; v1 submitted 13 November, 2017;
originally announced November 2017.
-
A Semi-Automatic Approach for Detecting Dataset References in Social Science Texts
Authors:
Behnam Ghavimi,
Philipp Mayr,
Christoph Lange,
Sahar Vahdati,
Sören AUER
Abstract:
Today, full-texts of scientific articles are often stored in different locations than the used datasets. Dataset registries aim at a closer integration by making datasets citable but authors typically refer to datasets using inconsistent abbreviations and heterogeneous metadata (e.g. title, publication year). It is thus hard to reproduce research results, to access datasets for further analysis, a…
▽ More
Today, full-texts of scientific articles are often stored in different locations than the used datasets. Dataset registries aim at a closer integration by making datasets citable but authors typically refer to datasets using inconsistent abbreviations and heterogeneous metadata (e.g. title, publication year). It is thus hard to reproduce research results, to access datasets for further analysis, and to determine the impact of a dataset. Manually detecting references to datasets in scientific articles is time-consuming and requires expert knowledge in the underlying research domain.We propose and evaluate a semi-automatic three-step approach for finding explicit references to datasets in social sciences articles.We first extract pre-defined special features from dataset titles in the da|ra registry, then detect references to datasets using the extracted features, and finally match the references found with corresponding dataset titles. The approach does not require a corpus of articles (avoiding the cold start problem) and performs well on a test corpus. We achieved an F-measure of 0.84 for detecting references in full-texts and an F-measure of 0.83 for finding correct matches of detected references in the da|ra dataset registry.
△ Less
Submitted 6 November, 2016;
originally announced November 2016.
-
Identifying and Improving Dataset References in Social Sciences Full Texts
Authors:
Behnam Ghavimi,
Philipp Mayr,
Sahar Vahdati,
Christoph Lange
Abstract:
Scientific full text papers are usually stored in separate places than their underlying research datasets. Authors typically make references to datasets by mentioning them for example by using their titles and the year of publication. However, in most cases explicit links that would provide readers with direct access to referenced datasets are missing. Manually detecting references to datasets in…
▽ More
Scientific full text papers are usually stored in separate places than their underlying research datasets. Authors typically make references to datasets by mentioning them for example by using their titles and the year of publication. However, in most cases explicit links that would provide readers with direct access to referenced datasets are missing. Manually detecting references to datasets in papers is time consuming and requires an expert in the domain of the paper. In order to make explicit all links to datasets in papers that have been published already, we suggest and evaluate a semi-automatic approach for finding references to datasets in social sciences papers. Our approach does not need a corpus of papers (no cold start problem) and it performs well on a small test corpus (gold standard). Our approach achieved an F-measure of 0.84 for identifying references in full texts and an F-measure of 0.83 for finding correct matches of detected references in the da|ra dataset registry.
△ Less
Submitted 29 March, 2016; v1 submitted 5 March, 2016;
originally announced March 2016.
-
Semantic Publishing Challenge - Assessing the Quality of Scientific Output by Information Extraction and Interlinking
Authors:
Angelo Di Iorio,
Christoph Lange,
Anastasia Dimou,
Sahar Vahdati
Abstract:
The Semantic Publishing Challenge series aims at investigating novel approaches for improving scholarly publishing using Linked Data technology. In 2014 we had bootstrapped this effort with a focus on extracting information from non-semantic publications - computer science workshop proceedings volumes and their papers - to assess their quality. The objective of this second edition was to improve i…
▽ More
The Semantic Publishing Challenge series aims at investigating novel approaches for improving scholarly publishing using Linked Data technology. In 2014 we had bootstrapped this effort with a focus on extracting information from non-semantic publications - computer science workshop proceedings volumes and their papers - to assess their quality. The objective of this second edition was to improve information extraction but also to interlink the 2014 dataset with related ones in the LOD Cloud, thus paving the way for sophisticated end-user services.
△ Less
Submitted 25 August, 2015;
originally announced August 2015.
-
Mapping Large Scale Research Metadata to Linked Data: A Performance Comparison of HBase, CSV and XML
Authors:
Sahar Vahdati,
Farah Karim,
Jyun-Yao Huang,
Christoph Lange
Abstract:
OpenAIRE, the Open Access Infrastructure for Research in Europe, comprises a database of all EC FP7 and H2020 funded research projects, including metadata of their results (publications and datasets). These data are stored in an HBase NoSQL database, post-processed, and exposed as HTML for human consumption, and as XML through a web service interface. As an intermediate format to facilitate statis…
▽ More
OpenAIRE, the Open Access Infrastructure for Research in Europe, comprises a database of all EC FP7 and H2020 funded research projects, including metadata of their results (publications and datasets). These data are stored in an HBase NoSQL database, post-processed, and exposed as HTML for human consumption, and as XML through a web service interface. As an intermediate format to facilitate statistical computations, CSV is generated internally. To interlink the OpenAIRE data with related data on the Web, we aim at exporting them as Linked Open Data (LOD). The LOD export is required to integrate into the overall data processing workflow, where derived data are regenerated from the base data every day. We thus faced the challenge of identifying the best-performing conversion approach.We evaluated the performances of creating LOD by a MapReduce job on top of HBase, by mapping the intermediate CSV files, and by mapping the XML output.
△ Less
Submitted 6 July, 2015; v1 submitted 12 June, 2015;
originally announced June 2015.
-
OpenCourseWare Observatory -- Does the Quality of OpenCourseWare Live up to its Promise?
Authors:
Sahar Vahdati,
Christoph Lange,
Sören Auer
Abstract:
A vast amount of OpenCourseWare (OCW) is meanwhile being published online to make educational content accessible to larger audiences. The awareness of such courses among users and the popularity of systems providing such courses are increasing. However, from a subjective experience, OCW is frequently cursory, outdated or non-reusable. In order to obtain a better understanding of the quality of OCW…
▽ More
A vast amount of OpenCourseWare (OCW) is meanwhile being published online to make educational content accessible to larger audiences. The awareness of such courses among users and the popularity of systems providing such courses are increasing. However, from a subjective experience, OCW is frequently cursory, outdated or non-reusable. In order to obtain a better understanding of the quality of OCW, we assess the quality in terms of fitness for use. Based on three OCW use case scenarios, we define a range of dimensions according to which the quality of courses can be measured. From the definition of each dimension a comprehensive list of quality metrics is derived. In order to obtain a representative overview of the quality of OCW, we performed a quality assessment on a set of 100 randomly selected courses obtained from 20 different OCW repositories. Based on this assessment we identify crucial areas in which OCW needs to improve in order to deliver up to its promises.
△ Less
Submitted 14 April, 2015; v1 submitted 21 October, 2014;
originally announced October 2014.