Automatic Bad Data Detection and Constraint Inference

Test-driven data analysis (TDDA) is a methodology and toolset for reducing errors in data science through a combination of testing and constraints. Existing TDDA software can automatically generate constraints to characterize good data, including automatically finding regular expressions that identify patterns in structured string data. The generated constraints can then be used to verify new data as it arrives.

Until now, it has been necessary for the training datasets used for constraint generation to contain only good data, with the result that all generated constraints are guaranteed to be satisfied by the training data. In this talk, we will review constraint discovery in TDDA and then show how these approaches can be extended to allow constraints to be inferred over datasets that include (possible) bad data. This makes the method more widely applicable (since most datasets include bad data) and also means that verification of the training data against generated constraints can automatically and immediately identify candidate bad data.

Nick Radcliffe
CEO, Stochastic Solutions Limited

Nick Radcliffe is the founder of Stochastic Solutions Limited, a specialist Edinburgh-based data science company that focuses on high-quality, test-driven data analysis, from basic data extraction, manipulation and profiling through behaviour measurement, segmentation, anomaly detection, and predictive modelling and scoring.

Prior to founding Stochastic Solutions, Nick founded and acted as Chief Technology Officer for Quadstone Limited, an Edinburgh-based software house that specialized in helping companies to improve their customer relationship management using a combination of advanced modelling and high-performance parallel processing. Quadstone was acquired by Portrait Software in late 2005.

Nick is also a Visiting Professor of Mathematics at the University of
Edinburgh, working in the Operational Research group. His research has
focused on the use of randomized (stochastic) approaches to
optimization, and he was one of the early researchers in the now
established field of genetic algorithms and evolutionary
computation.