May 10, 2024

[PrivateAI] Data inconsistencies detection

[PrivateAI] Data inconsistencies detection leverages the power of knowledge graphs to enhance the detection and exclusion of low-quality scientific papers from research pools. In this document we outline our approach using Knowledge Graphs for consistency analysis. 

Our process begins with the systematic construction of knowledge graphs from scientific texts. We use Natural Language Processing (NLP) technique (SpaCy) to extract entities and relationships from papers, transforming unstructured text into a structured graph that represents the semantic nuances of the research content.

1: Entity Recognition and Relation Extraction

We employ NLP tools to identify key entities (e.g., study variables, outcomes, methodologies) and their relations. This structured extraction forms the nodes and edges of our knowledge graph. This process has been described in our previous reports.

Here's an example code snippet:

2: Knowledge Graph Integration

By integrating individual graphs from related papers, we create a unified graph that spans a body of research. This integration allows us to identify discrepancies and overlaps, providing a holistic view of the research landscape on any given topic.

3: Consistency Analysis

We apply graph analytics to evaluate the consistency of information across multiple papers. By employing algorithms that can detect anomalies and contradictions within the graph, such as incompatible data points or conflicting results, we can flag potentially low-quality research.

NetworkX is a Python library used for creating, manipulating, and studying the structure, dynamics, and functions of complex networks. In the context of knowledge graphs, NetworkX is particularly useful because it allows for easy construction and analysis of networks (or graphs), enabling tasks such as adding nodes, edges, and querying graphs for specific patterns or inconsistencies.

In the provided code snippet, NetworkX is utilized to represent and analyze the knowledge graph. The library provides a straightforward way to handle nodes and edges with additional data, which is critical for representing complex relationships found in scientific papers. Function “detect_inconsistencies” leverages NetworkX's capabilities to navigate through the graph, check relationships between nodes, and find contradictions or anomalies that could indicate the presence of low-quality data. This makes NetworkX an essential tool for processing and analyzing data in graph format, particularly when dealing with large sets of interconnected data typical in knowledge graphs.

To further refine our Knowledge Graphs and enhance the detection of low-quality data, we incorporate several AI techniques:

Graph Neural Networks (GNNs): These are used to learn the topological structure of Knowledge Graphs, allowing for the detection of complex patterns that might indicate data inconsistencies or irregularities. Here's a good example of Predicting Drug-Drug Interactions Using Knowledge Graphs:

Embedding Techniques: We utilize knowledge graph embeddings to transform high-dimensional graph data into a lower-dimensional space, maintaining the inherent properties of the graph while simplifying the computational analysis​. You can read a case study here:

Anomaly Detection Models: These models are specifically tailored to identify outliers and anomalies in graph data, which often represent errors or unusual reporting in scientific papers​. Example case study:

Implementation and Continuous Learning

Our implementation involves real-time processing of new research papers, continuously updating our knowledge graphs with the latest information. This dynamic system ensures that our detection mechanisms evolve with the advancing scientific discourse, maintaining a high standard of data quality and reliability.

To merge two knowledge graphs and detect inconsistencies in their data, we can use Python with “NetworkX” for graph operations and “pandas” for handling tabular data. The example we show here involves merging knowledge graphs from two different research studies about hypothetical "Medicament A" and its effects on depression, and the reported numbers of people affected by depression.

Here's how we can implement this using NetworkX:

Graph Creation:
Two directed graphs (G1 and G2) are created to represent the conflicting data from two studies.

Adding Nodes and Edges: We add nodes for "Depression" and edges from "Medicament A" to "Depression", with attributes to reflect the effects and statistics reported by the studies.

Merging Graphs: The merge_graphs function uses NetworkX's compose method to combine G1 and G2. This method merges nodes and edges, combining attributes where nodes/edges overlap.

Detecting Inconsistencies:


To enhance the robustness of the system, expert review and feedback can be integrated. This allows for the refinement of AI predictions and adjustments based on human expertise, particularly in complex cases where automated systems might struggle.

In our approach to detecting inconsistencies across multiple research papers, we emphasize the crucial role of integrating human expertise through our "Human-in-the-Loop" system. This component leverages our proprietary platform, which not only automates the detection of data inconsistencies using advanced AI technologies but also incorporates direct feedback from users—real human researchers who can review every paper uploaded to our platform.

Here’s how the human-in-the-loop system enhances our process:

Human Review: Each paper submitted to our platform is subject to scrutiny by experienced researchers. This allows for an immediate human perspective on the content, which can identify nuances and contextual inconsistencies that AI might overlook.

Feedback Mechanism: Users can provide feedback on each paper, contributing to a richer understanding of the research quality. This feedback is instrumental in refining the AI algorithms, as it provides real-world insights and corrections that are invaluable for training and adjusting the models.

Dynamic Improvement: By continuously integrating human feedback into our AI models, we ensure that our system not only adapts to evolving scientific standards but also aligns closely with the expert consensus in various research fields.

This human-centered approach ensures that PrivateAI's platform remains sensitive to the complex nature of scientific research, enhancing the reliability and accuracy of our inconsistency detection processes. 

By leveraging these methods, aims to uphold the highest standards of data quality and integrity in scientific research. This comprehensive approach not only aids in identifying low-quality or inconsistent research but also supports the broader scientific community by ensuring that conclusions drawn from research are reliable and based on consistent data. Through these efforts, is committed to contributing to the advancement of scientific knowledge and the reliability of published research.


Experience the Future Today: Request a Demo

Scraping Protection


Equip your data with advanced AI-driven scraping protection, effectively preventing unauthorized access and securing the integrity of your information against data breaches.

Data Encryption


PrivateAI uses top-tier Fully Homomorphic Encryption (FHE) for data to safeguard sensitive information. This approach ensures compliance with the highest industry security standards, providing robust protection against unauthorized access.

Prevent Data Exploitation


Maintain complete control over your data with strong defenses against big tech exploitation. This feature ensures data security and sovereignty, keeping your information private and shielded from external corporate influences.