We are happy to announce SANSA 0.7.1 – the seventh release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.
- Website: http://sansa-stack.net
- GitHub: https://github.com/SANSA-Stack
- Download: http://sansa-stack.net/downloads-usage/
- ChangeLog: https://github.com/SANSA-Stack/SANSA-Stack/releases
You can find usage guidelines and examples at http://sansa-stack.net/user-guide.
The following features are currently supported by SANSA:
- Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad, TRIX format
- Reading OWL files in various standard formats
- Query heterogeneous sources (Data Lake) using SPARQL – CSV, Parquet, MongoDB, Cassandra, JDBC (MySQL, SQL Server, etc.) are supported
- Support for multiple data partitioning techniques
- SPARQL querying via Sparqlify and Ontop and Tensors
- Graph-parallel querying of RDF using SPARQL (1.0) via GraphX traversals (experimental)
- RDFS, RDFS Simple and OWL-Horst forward chaining inference
- RDF graph clustering with different algorithms
- Terminological decision trees (experimental)
- Knowledge graph embedding approaches: TransE (beta), DistMult (beta)
Noteworthy changes or updates since the previous release are:
- TRIX support
- A new query engine over compressed RDF data
- OWL/XML Support
Deployment and getting started:
- There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
- The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
- Example code is available for various tasks.
- We provide interactive notebooks for running and testing code via Docker.
We want to thank everyone who helped to create this release, in particular the projects Big Data Ocean, SLIPO, QROWD, BETTER, BOOST, MLwin, PLATOON and Simple-ML. Also check out our recent articles in which we describe how to use SANSA for tensor based querying, scalable RDB2RDF query execution, quality assessment and semantic partitioning.
Greetings from the SANSA Development Team
From 5 to 6 December, a conference on QNLP took place at St. Aldate’s Church in Oxford. This event was organized by the Quantum Group at the Department of Computer Science of the University of Oxford with support from the companies Cambridge Quantum Computing (CQC) and IBM. The two members Cedric Möller and Daniel Steinigen of the SDA team in Dresden participated in the conference. This was also the first conference ever about this combination of NLP with quantum computing.
Quantum Artificial Intelligence (QAI) has become increasingly interesting for research activities in the recent years. Noisy intermediate-scale quantum (NISQ) computers already provide the ability to perform algorithms and to find possible advantages for NLP. Since mathematical foundations of quantum theory are very similar to those of compositional NLP with applied category theory, quantum computers should provide a natural setting for compositional NLP tasks .
 Zeng, Coecke – Quantum Algorithms for Compositional Natural Language Processing https://arxiv.org/abs/1608.01406
We are very pleased to announce that our group got four papers accepted for presentation at IEEE-ICSC 2020.
The 14th IEEE International Conference on Semantic Computing (ICSC2020) addresses the derivation, description, integration, and use of semantics (“meaning”, “context”, “intention”) for all types of resources including data, document, tool, device, process and people. The scope of ICSC2020 includes, but is not limited to, analytics, semantics description languages and integration (of data and services), interfaces, and applications.
Here are the pre-prints of the accepted papers with their abstracts:
- “DISE: A Distributed in-Memory SPARQL Processing Engine over Tensor Data” by Hajira Jabeen, Eskender Haziiev, Gezim Sejdiu, and Jens Lehmann.
Abstract:SPARQL is a W3C standard for querying the data stored as Resource Description Framework (RDF). The SPARQL queries are represented using triple-patterns, and are tailored to search for these patterns in RDF. Most of the existing SPARQL evaluators provide centralized, DBMS inspired solutions consuming high resources and offering limited flexibility. In order to deal with the increasing RDF data, it is important to develop scalable and efficient solutions for distributed SPARQL query evaluators. In this paper we present DISE — an open source implementation of distributed in-memory SPARQL engine that can scale out to a cluster of machines. DISE represents an RDF graph as a three way distributed tensor for querying large-scale RDF datasets. This distributed tensor representation offers opportunities for novel distributed applications. DISE relies on translating SPARQL queries into Spark tensor operations by exploiting the information about the query complexity and creating a dynamic execution plan. We have tested the scalability and efficiency of DISE on different datasets and the results have been found scalable and efficient while exploiting the relatively new representation format.
- “Let’s build Bridges, not Walls – SPARQL Querying of TinkerPop Graph Databases with sparql-gremlin” by Harsh Thakkar, Renzo Angles, Marko Rodriguez, Stephen Mallette, and Jens Lehmann.
Abstract: This article presents sparql-gremlin, a tool to translate SPARQL queries to Gremlin pattern matching traversals. Currently, sparql-gremlin is a plugin of the Apache TinkerPop graph computing framework, thus the users can run queries expressed in the W3C SPARQL query language over a wide variety of graph data management systems, including both OLTP graph databases and OLAP graph processing frameworks. With sparql-gremlin, we perform the first step to bridgethe query interoperability gap between the Semantic Web and Graph database communities. The plugin has received adoption from both academia and industry research in its short timespan.
- “VoColReg: A Registry for Supporting Distributed Ontology Development using Version Control Systems” by Abderrahmane Khiat, Lavdim Halilaj, Ahmad Hemid and Steffen Lohmann (ICSC Resource Track).
Abstract: The number of ontologies used for different pur-poses, such as data integration, information retrieval or search optimization, is constantly increasing. Therefore, it is crucial that ontologies can be developed and explored in an easy way by humans, and are accessible by intelligent agents. To this end, we created VoColReg on top of the VoCol platform. VoColReg provides an integrated registry that hosts VoCol instances, allowing the community to access, browse, reuse, and improve ontologies in a collaborative fashion. VoColReg integrates several improved features, such as RDF-Doctor which is able to simultaneously identify a comprehensive list of syntax errors and automatically correct a subset of them. Currently, the VoColReg platform hosts more than 21 ontologies from various domains, wherenine of them are publicly available. We analyzed those nine ontologies to discover different facts about them such as hosting platforms used, expressivity of the ontologies, number of triples and modules.
- “Learning a Lightweight Representation: First Step Towards Automatic Detection of Multidimensional Relationships between Ideas” by Abderrahmane Khiat (ICSC Research Track, Concise Paper).
Abstract: Moving ideation from a closed paradigm (companies) to an open one (crowd) yields several benefits: (1) The crowd allows the generation of a large number of ideas and (2) Its heterogeneity increases the potential in obtaining creative ideas. In practice, however, the crowd often fails at generating innovative solutions, leading to duplicate or ideas that use each other’s description. Thus, it is practically and economically unfeasible to sift through this large number of ideas to select valuable ones. One promising solution to overcome this issue is finding relationships between idea texts such as duplicate, generalize, disjoint, alternative solution, etc. Existing approaches either rely on human judgment, which is expensive and requires domain experts or automatic approaches which compute similarity i.e. one dimension and do not consider other relations. The proposed solution is based on sequence-to-sequence learning, which allows the machine to learn a lightweight structural representation that is used next to establishing complex relations between ideas. This lightweight structural representation is obtained based on our investigation. We found that ideas contain the following patterns: what the idea is about (e.g. window with heat-sensitive material), how it works (e.g. it lights up) and when it works (e.g. in case of fire). Those extracted patterns are then compared with the corresponding patterns of other ideas to establish relations. Our preliminary investigation shows promising results to learn and leverage such lightweight structural representation in identifying the complex relationship between ideas.
We are very pleased to announce that our group got a paper accepted for presentation at ESWA (International Journal for Expert Systems with Applications). With an Impact Factor of 4.3 the journal is one of the major venues in for intelligent systems and information exchange. The focus of the journal is on exchanging information relating to expert and intelligent systems applied in industry, government, and universities worldwide.
Here are the pre-prints of the accepted papers with their abstracts:
- “IOTA: Interlinking of Heterogeneous Multilingual Open Fiscal DaTA” by Fathoni.A. Musyaffa, Maria-Esther Vidal, Fabrizio Orlandi, Jens Lehmann, Hajira Jabeen ( Elsevier Journal of Expert Systems with Applications).
Abstract: Open budget data are among the most frequently published datasets of the open data ecosystem, intended to improve public administrations and government transparency. Unfortunately, the prospects of analysis across different open budget data remain limited due to schematic and linguistic differences. Budget and spending datasets are published together with descriptive classifications. Various public administrations typically publish the classifications and concepts in their regional languages. These classifications can be exploited to perform a more in-depth analysis, such as comparing similar items across different, cross-lingual datasets. However, in order to enable such analysis, a mapping across the multilingual classifications of datasets is required. In this paper, we present the framework for Interlinking of Heterogeneous Multilingual Open Fiscal DaTA (IOTA). IOTA makes use of machine translation followed by string similarities to map concepts across different datasets. To the best of our knowledge, IOTA is the first framework to offer scalable implementation of string similarity using distributed computing. The results demonstrate the applicability of the proposed multilingual matching, the scalability of the proposed framework, and an in-depth comparison of string similarity measures.
We are very pleased to announce that our group got a paper accepted for presentation at ICEGOV (International Conference on Theory and Practice of Electronic Governance). ICEGOV stands for International Conference on Theory and Practice of Electronic Governance. Established in 2007, the conference runs annually and is coordinated by the United Nations University Operating Unit on Policy-Driven Electronic Governance (UNU-EGOV). Part of the United Nations University and headquartered in the city of Guimarães, north of Portugal, UNU-EGOV is a think tank dedicated to Electronic Governance; a core centre of research, advisory services and training; a bridge between research and public policies; an innovation enhancer; a solid partner within the UN system and its Member States with a particular focus on sustainable development, social inclusion and active citizenship.
Here are the pre-prints of the accepted papers with their abstracts:
- “Cross-Administration Comparative Analysis of Open Fiscal Data” by Fathoni A. Musyaffa, Jens Lehmann, Hajira Jabeen.
Abstract: To improve governance accountability, public administrations are increasingly publishing their open data, which includes budget and spending data. Analyzing these datasets requires both domain and technical expertise. In civil communities, these technical and domain expertise are often not available. Hence, despite the increasing size of the open fiscal datasets being published, the level of analytics done on top of these datasets is still limited. Providentially, the developments in the computer science community enable further progress in data analysis in different domains, such as performing a comparative analysis of open budgets and spending data (open fiscal data). This is done by adopting and applying semantics on open fiscal data. In this paper, we demonstrate the feasibility of comparative analysis over linked open fiscal data and devise an approach to perform comparative analysis across from different public administrations. Open fiscal data are cleaned, analyzed, transformed (i.e., semantically lied), and have their related concept labels connected across different public administrations so budget/spending items from related concepts can be queried. Additionally, the growing information on linked open data (e.g., DBpedia) can also be used to provide additional context to the analysis and the query.
We are very pleased to announce that our group got a paper accepted at the K-CAP 2019: The 10th International Conference on Knowledge Capture conference, which will be held on 19 – 21 November 2019 Marina del Rey, California, United States.
The 20th International Conference on Knowledge Capture aims at attracting researchers from diverse areas of Artificial Intelligence, including knowledge representation, knowledge acquisition, Semantic and World Wide Web, intelligent user interfaces for knowledge acquisition and retrieval, innovative query processing and question answering over heterogeneous knowledge bases, novel evaluation paradigms, problem-solving and reasoning, planning, agents, information extraction from text, metadata, tables and other heterogeneous data such as images and videos, machine learning and representation learning, information enrichment and visualization, as well as researchers interested in cyber-infrastructures to foster the publication, retrieval, reuse, and integration of data.
Here is the pre-print of the accepted paper with its abstract:
- “GizMO — A Customizable Representation Model for Graph-Based Visualizations of Ontologies” by Vitalis Wiens, Steffen Lohmann, and Sören Auer.
Abstract: Visualizations can support the development, exploration, communication, and sense-making of ontologies. Suitable visualizations, however, are highly dependent on individual use cases and targeted user groups. In this article, we present a methodology that enables customizable definitions for the visual representation of ontologies. The methodology describes visual representations using the OWL annotation mechanisms and separates the visual abstraction into two information layers. The first layer describes the graphical appearance of OWL constructs. The second layer addresses visual properties for conceptual elements from the ontology. Annotation ontologies and a modular architecture enable separation of concerns for individual information layers. Furthermore, the methodology ensures the separation between the ontology and its visualization. We showcase the applicability of the methodology by introducing GizMO, a representation model for graph-based visualizations in the form of node-link diagrams. The graph visualization meta ontology (GizMO) provides five annotation object types that address various aspects of the visualization (e.g., spatial positions, viewport zoom factor, and canvas background color). The practical use of the methodology and GizMO is shown using two applications that indicate the variety of achievable ontology visualizations.
This work is co-funded by the European Research Council project ScienceGRAPH (Grant agreement #819536). In addition, parts of it evolved in the context of the Fraunhofer Cluster of Excellence “Cognitive Internet Technologies”.
Looking forward to seeing you at The K-Cap 2019.
We are very pleased to announce that our group got a paper accepted at the ODBASE 2019 – The 18th International Conference on Ontologies, DataBases, and Applications of Semantics conference, which will be held on 22-23 October 2019, Rhodes, Greece.
The conference on Ontologies, DataBases, and Applications of Semantics for Large Scale Information Systems (ODBASE’19) provides a forum on the use of ontologies, rules and data semantics in novel applications. Of particular relevance to ODBASE are papers that bridge traditional boundaries between disciplines such as artificial intelligence and the Semantic Web, databases, data science, data analytics and machine learning, human-computer interaction, social networks, distributed and mobile systems, data and information retrieval, knowledge discovery, and computational linguistics.
Here is the pre-print of the accepted paper with its abstract:
- “Complex Query Augmentation for Question Answering over Knowledge Graphs” by Abdelrahman Abdelkawi, Hamid Zafar, Maria Maleshkova, and Jens Lehmann.
Abstract: Question answering systems have often a pipeline architecture that consists of multiple components. A key component in the pipeline is the query generator, which aims to generate a formal query that corresponds to the input natural language question. Even if the linked entities and relations to an underlying knowledge graph are given, finding the corresponding query that captures the true intention of the input question still remains a challenging task, due to the complexity of sentence structure or the features that need to be extracted. In this work, we focus on the query generation component and introduce techniques to support a wider range of questions that are currently less represented in the community of question answering.
This research was supported by the European Union H2020 project CLEOPATRA (ITN, GA. 812997) as well as by the German Federal Ministry of Education and Research (BMBF) funding for the project SOLIDE (no. 13N14456).
Looking forward to seeing you at The ODBASE 2019.
We are very happy to announce that our group got one paper accepted at iiWAS 2019: The 21st International Conference on Information Integration and Web-based Applications & Services, which will be held on December 2 – 4 in Munich, Germany.
The 21st International Conference on Information Integration and Web-based Applications & Services (iiWAS2019) is a leading international conference for researchers and industry practitioners to share their new ideas, original research results and practical development experiences from all information integration and web-based applications & services related areas.
iiWAS2019 is endorsed by the International Organization for Information Integration and Web-based Applications & Services (@WAS), and will be held from 2-4 December 2019, in Munich, Germany, the city of innovation, technology, art and culture in conjunction with the 17th International Conference on Advances in Mobile Computing & Multimedia (MoMM2019).
Here is the pre-print of the accepted paper with its abstract:
- “Uniform Access to Multiform Data Lakes using Semantic Technologies” by Mohamed Nadjib Mami, Damien Graux, Simon Scerri, Hajira Jabeen, Sören Auer, and Jens Lehmann. Abstract: Increasing data volumes have extensively increased application possibilities. However, accessing this data in an ad hoc manner remains an unsolved problem due to the diversity of data management approaches, formats and storage frameworks, resulting in the need to effectively access and process distributed heterogeneous data at scale. For years, Semantic Web techniques have addressed data integration challenges with practical knowledge representation models and ontology-based mappings. Leveraging these techniques, we provide a solution enabling uniform access to large, heterogeneous data sources, without enforcing centralization; thus realizing the vision of a Semantic Data Lake. In this paper, we define the core concepts underlying this vision and the architectural requirements that systems implementing it need to fulfill. Squerall, an example of such a system, is an extensible framework built on top of state-of-the-art Big Data technologies. We focus on Squerall’s distributed query execution techniques and strategies, empirically evaluating its performance throughout its various sub-phases.
This work is partly supported by the EU H2020 projects BETTER (GA 776280) and QualiChain (GA 822404), and by the ADAPT Centre for Digital Content Technology funded under the SFI Research Centres Programme (Grant 13/RC/2106) and co-funded under the European Regional Development Fund.
Looking forward to seeing you at The iiWAS 2019.
We are very pleased to announce that our group got 7 demo/poster papers accepted for presentation at ISWC 2019: the 18th International Semantic Web Conference, which will be held on October 26 – 30 2019 in Auckland, New Zealand.
The International Semantic Web Conference (ISWC) is the premier international forum where Semantic Web / Linked Data researchers, practitioners, and industry specialists come together to discuss, advance, and shape the future of semantic technologies on the web, within enterprises and in the context of the public institution.
Here is the list of the accepted papers with their abstract:
- “Querying large-scale RDF datasets using the SANSA framework” by Claus Stadler, Gezim Sejdiu, Damien Graux, and Jens Lehmann.
Abstract: In this paper, we present Sparklify: a scalable software component for efficient evaluation of SPARQL queries over distributed RDF datasets. In particular, we demonstrate a W3C SPARQL endpoint powered by our SANSA framework’s RDF partitioning system and Apache SPARK for querying the DBpedia knowledge base. This work is motivated by the lack of Big Data SPARQL systems that are capable of exposing large-scale heterogeneous RDF datasets via a Web SPARQL endpoint.
- “How to feed the Squerall with RDF and other data nuts?” by Mohamed Nadjib Mami, Damien Graux, Simon Scerri, Hajira Jabeen, Sören Auer, and Jens Lehmann.
Abstract: Advances in Data Management methods have resulted in a wide array of storage solutions having varying query capabilities and supporting different data formats. Traditionally, heterogeneous data was transformed off-line into a unique format and migrated to a unique data management system, before being uniformly queried. However, with the increasing amount of heterogeneous data sources, many of which are dynamic, modern applications prefer accessing directly the original fresh data. Addressing this requirement, we designed and developed Squerall, a software framework that enables the querying of original large and heterogeneous data on-the-fly without prior data transformation. Squerall is built from the ground up with extensibility in consideration, e.g., supporting more data sources. Here, we explain Squerall’s extensibility aspect and demonstrate step-by-step how to add support for RDF data, a new extension to the previously supported range of data sources.
- “Towards Semantically Structuring GitHub” by Dennis Oliver Kubitza, Matthias Böckmann, and Damien Graux.
Abstract: With the recent increase of open-source projects, tools have emerged to enable developers collaborating. Among these, git has received lots of attention and various on-line platforms have been created around this tool, hosting millions of projects. Recently, some of these platforms opened APIs to allow users questioning their public databases of open-source projects. Despite the common protocol core, there are for now no common structures someone could use to link those sources of information. To tackle this, we propose here the first ontology dedicated to the git protocol and also describe GitHub’s features within it to show how it is extendable to encompass more git-based data sources.
- “Microbenchmarks for Question AnsweringSystems Using QaldGen” by Qaiser Mehmood, Abhishek Nadgeri, Muhammad Saleem, Kuldeep Singh, Axel-Cyrille Ngonga Ngomo and Jens Lehmann.
Abstract: [Microbenchmarks are used to test the individual components of the given systems. Thus, such benchmarks can provide a more detailed analysis pertaining to the different components of the systems. We present a demo of the QaldGen, a framework for generating question samples for micro-benchmarking of Question Answering (QA) systems over Knowledge Graphs (KGs). QaldGen is able to select customized question samples from existing QA datasets. The sampling of questions is carried out by using different clustering techniques. It is flexible enough to select benchmarks of varying sizes and complexities according to user-defined criteria on the most important features to be considered for QA benchmarking. We evaluate the usability of the interface by using the standard system usability scale questionnaire. Our overall usability score of 77.25 (ranked B+) suggests that the online interface is recommendable easy to use, and well-integrated.
- “ FALCON: An Entity and Relation Linking framework over DBpedia” by Ahmad Sakor, Kuldeep Singh, Maria Esther Vidal.
Abstract: [We tackle the problem of entity and relation linking and present FALCON, a rule-based tool able to accurately map entities and relations in short texts to resources in a knowledge graph. FALCON resorts to fundamental principles of the English morphology (e.g., compounding and headword identification) and performs joint entity and relation linking against a short text. We demonstrate the benefits of the rule-based approach implemented in FALCON on short texts composed of various types of entities. The attendees will observe the behavior of FALCON on the observed limitations of Entity Linking (EL) and Relation Linking (RL) tools. The demo is available at https://labs.tib.eu/falcon/.
- “Demonstration of a Customizable Representation Model for Graph-Based Visualizations of Ontologies – GizMO” by Vitalis Wiens, Mikhail Galkin, Steffen Lohmann, and Sören Auer
Abstract: Visualizations can facilitate the development, exploration, communication, and sense-making of ontologies. Suitable visualizations, however, are highly dependent on individual use cases and targeted user groups. In this demo, we present a methodology that enables customizable definitions for ontology visualizations. We showcase its applicability by introducing GizMO, a representation model for graph-based visualizations in the form of node-link diagrams. Additionally, we present two applications that operate on the GizMO representation model and enable individual customizations for ontology visualizations.
- “Predict Missing Links Using PyKEEN“ by Mehdi Ali, Charles Tapley Hoyt, Daniel Domingo-Fernandez, and Jens Lehmann.
Abstract:PyKEEN is a framework, which integrates several approaches to compute knowledge graph embeddings (KGEs). We demonstrate the usage of PyKEEN in a biomedical use case, i.e. we trained and evaluated several KGE models on a biological knowledge graph containing genes’ annotations to pathways and pathway hierarchies from well-known databases. We used the best performing model to predict new links and present an evaluation in collaboration with a domain expert.
Looking forward to seeing you at The ISWC 2019.
We are very pleased to announce that our group got 2 papers accepted at the 4th Workshop on Data Science for Social Good.
SoGood is a peer-reviewed workshop that focuses on how Data Science can and does contribute to social good in its widest sense. The workshop is held from 2016 yearly together with ECML PKDD Conference and this year is on 20th September, Wurzburg, Germany.
Here is the pre-print of the accepted papers with their abstract:
- Linking Physicians to Medical Research Results via Knowledge Graph Embeddings and Twitter by Afshin Sadeghi and Jens Lehmann.
Abstract: Informing professionals about the latest research results in their field is a particularly important task in the field of health care, since any development in this field directly improves the health status of the patients. Meanwhile, social media is an infrastructure that allows public instant sharing of information, thus it has recently become popular in medical applications. In this study, we apply Multi Distance Knowledge Graph Embeddings (MDE) to link physicians and surgeons to the latest medical breakthroughs that are shared as the research results on Twitter. Our study shows that using this method physicians can be informed about the new findings in their field given that they have an account dedicated to their profession.
- Improving Access to Science for Social Good by Mehdi Ali, Sahar Vahdati, Shruti Singh, Sourish Dasgupta, and Jens Lehmann.
Abstract: One of the major goals of science is to make the world socially a good place to live. The old paradigm of scholarly communication through publishing has generated enormous amount of heterogeneous data and metadata. However, most scientific results are not easy to discover, in particular those results which benefit social good and are also targeted at non-scientific people. In this paper, we showcase a knowledge graph embedding (KGE) based recommendation system to be used by students involved in activities aiming at social good. The recommendation system has been trained on a scholarly knowledge graph, which we constructed. The obtained results highlight that the KGEs successfully encoded the structure of the KG, and therefore, our system could provide valuable recommendations.
This study is partially supported by project MLwin (Maschinelles Lernen mit Wissensgraphen, grant no. 01IS18050F), Cleopatra (grant no. 812997), EPSRC grant EP/M025268/1, the WWTF grant VRG18-013, LAMBDA (GA no. 809965). The authors gratefully acknowledge financial support from the Federal Ministry of Education and Research of Germany (BMBF) which is funding MLwin and European Union Marie Curie ITN that funds Cleopatra, as well as Fraunhofer IAIS.