Meaningfully evaluating large-scale machine learning under privacy constraints
With Dmitrii Usynin
Obtaining high-quality data to train well-generalisable machine learning models can be a challenging task due to A) regulatory concerns and B) a lack of data owner incentives to participate. The first issue can be addressed through the combination of distributed machine learning techniques (e.g. federated learning) and privacy enhancing technologies (PET), such as the differentially private (DP) model training. The second challenge can be addressed by rewarding the participants for giving access to data which is beneficial to the training model, which is of particular importance in federated settings, where the data is unevenly distributed. However, many PETs which make such collaborations compliant with the data protection regulations can inadvertently affect the fairness of the reward distribution. Taking DP as a practical example we see that randomised noise can adversely affect the underrepresented and the atypical (yet often informative) data samples, making it difficult to assess their usefulness for the final model, potentially reducing the monetary incentives among the underrepresented subgroups. In order to resolve this problem we need to answer the following questions a) why do we even need PETs in the first place, b) how can we apply PETs in order to make large-scale ML regulation-compliant and c) what adaptations are needed in order to make reward allocation more meaningful in such settings.
Standardised, verified and versatile open-source frameworks to connect individual methods for trustworthy ML into unified pipelines. Currently each research lab has their own benchmarks, libraries and custom connectors making collaboration and verification of previous results significantly more challenging.