Data Analytics @ QCRI

Rayyan aims to build tools to support the process of creating, analyzing, and maintaining systematic reviews, in terms of data extraction, cleaning, integration, and mining of published clinical trials and journal articles. A production system is available here and a demo here.

NADEEF (or ''clean'' in Arabic) is a generalized data cleaning system. Being a commodity data cleaning system, NADEEF aims to be extensible, generic and easy-to-deploy. More can be found here.

KATARA aims to perform trusted data cleaning by using reliable knowledge bases augmented with crowd sourcing for validation.

Rheem is a system that provides both platform independence and interoperability across multiple platforms. Rheem acts as a proxy between user applications and existing data processing systems. It is fully based on user-defined functions (UDFs) to provide adaptability as well as extensibility. For more details see here.

Oftentime users face errors in the results of a query. We introduce DBRx, a system for discovering concise explanations of data anomalies. 

Our activities in Data Forensics with Analytics (DAFNA) ( project focuses on truth discovery and veracity of Big Data. The main goal of DAFNA project is to design a scalable and accurate truth discovery system to score the veracity of conflicting information extracted from multiple online sources. Conflicting information, rumors, and erroneous contents can be easily claimed and propagated by multiple online sources, making it hard to distinguish between what is true and what is not. In the Data Analytics group of QCRI, our current research goal on Truth Discovery and Fact-Checking is to determine the veracity of multi-source data and we designed a unique truth discovery system, AllegatorTrack that can discriminate true from false conflicting values and provide trustworthiness scores of the sources claiming it. AllegatorTrack main features are to enable: (i) the comparison of twelve existing methods for fact-checking, (ii) the Bayesian combination of their results, (iii) the generation of allegations to falsify true claims, (iv) the explanations of truth discovery results, and (v) the visualization of the truth discovery results.

The prototype geotagger identifies locations in documents from the World Bank Projects Data API using the Stanford Name Entity Recognizer (NER) and Alchemy, geocodes them with the Google Geocoder, Yahoo! Placefinder, and Geonames and visualized on a map. A live demo can be found here.

Web data is a great opportunity, but using it in analytics requires new solution to overcome the varierty and volatily. In this project we exploit web data for data integration tasks.

Anomaly detection is an important step of data cleaning. Although there exist a wide range of anomaly detection methods such as statistical outlier detection and integrity constraints based violation detection, it remains a hard problem about how to collaboratively apply and combine multiple (heterogeneous) anomaly detection methods upon a given dirty dataset. PEARL project intends to efficiently combine multiple anomaly detection methods with different paradigms to maximize the quality of the detection based on ensembling, transfer and active learning.