Applying machine learning for data quality
The quantity of data errors may have been increasing in parallel to today’s explosion in the amount of data. However, those errors simply cannot be identified manually using traditional tools, given the exponential growth of data. For this reason, methods automating the quality control process are playing a prominent role, which are also being developed and applied by the Statistics Directorate of the MNB. A paradigm shift is taking place in central bank data collection. Aggregated data series are gradually being replaced by the collection of granular, contract or customer-level data. Aggregated data can be reviewed manually with good results, or, more formally, it can be checked by functions based on logical rules, calculating how much newly received values differ from previous averages. Even for these data, time series machine learning can be applied, which calculate the difference between the reported and expected values, using a statistical model estimation. If this difference exceeds a certain range, the central bank can request an explanation from the data provider. Such an explanation may not only help in understanding financial processes better, but it may also uncover errors. With granular data, central banks may revert to this aggregates-based time series quality check. Nevertheless, more sophisticated, contract-level methods can be developed to detect suspicious cases. To illustrate, a consumer loan with a low, fixed interest rate and a relatively high loan amount of several million forints, is more likely to be a mortgage than a consumer loan. Machine learning methods can identify similar relationships within almost any data set, calculate a theoretical difference from the expected value and flag suspicious cases. If similar cases belonging to the same data provider are uncovered, not only data recording errors but also data processing errors can be detected. Similarly to the reporting of granular data, the methods outlined before are still in their infancy and are continuously evolving. In any case, machine learning is here to stay and plays an increasingly important role in quality assurance.

