Hands-on decentralized privacy preserving Machine Learning

Machine Learning (ML) is a collection of techniques and strategies for developing data-driven systems. You may skip this section if you already have some ML background, but you might as well stick around for a quick run-through on the concepts and the terminology.
Note from the author
Kick-start enjoy our day to day work especially because we get to use state of the art tech in all of our projects. Medical AI is a subject our team is particularly interested in and we took pleasure in writing this article because of this reason. You can take it as our honest recommendation for kick-starting your next ML-on-the-edge project. We thought it’s a nice thing to do — sharing bits of insight with tech savvy mates- so stay tuned for more tech articles!

Newcomers Intro

Machine Learning (ML) is a collection of techniques and strategies for developing data-driven systems. You may skip this section if you already have some ML background, but you might as well stick around for a quick run-through on the concepts and the terminology.

The ML processes generally consist of learning how to distinguish between the characteristic features of the underlying populations within the data. ML algorithms have to learn how to execute a given task without being explicitly programmed to carry it out.

The execution of those tasks is based on predictions the ML algorithms make by looking for patterns in data. To run efficiently, machine learning models often demand strong hardware, but this situation has been constantly changing in the last years (we’ll talk about ML on the edge below).

Quick recap on the categories of ML techniques and the goals of the algorithms:

  • Classification — predicting the correct label of a sample (e.g. distinguishing different objects in pictures);
  • Regression — estimating an unknown function describing the behavior of the data population (e.g. estimating prices of houses based on specs like location, surface, etc);
  • Clustering — grouping the samples based on the similarity between each other (e.g. behavior based grouping of potential clients for developing marketing campaigns);
  • Dimensionality Reduction — discovering the most relevant variables for how the data behaves (e.g. find the relevant details to identify in order to automatically sending emails to spam folder);
  • Policy Search — aims to find the policy that maximizes long term reward (e.g. vacuum cleaner robot that learns how to perform its task better);
  • Density Estimation — is about finding the underlying population’s probability density function (e.g. finding the right parameters for your statistical models over the population);

and based on the training process:

  • Supervised learning — both data and correct labeling are provided
  • (e.g. an animal image collection backed by data on what animals are present in each picture);
  • Unsupervised learning — where the focus goes on discovering those labels, as those are not given; the aim is to identify the hidden underlying structure of the data (e.g. grouping people together based on their traits);
  • Reinforcement learning — where rewards are used for the current performance of the model, which is further used to improve the next iterations. (e.g. gaming AI that learns how to gradually score more points)

We should be good on the theory side for now, as this section has provided you with some good intuition on how these algorithms are being designed.

Machine Learning on the Edge

Moving away from the introductory theory, let’s talk about real world uses of ML. The data those algorithms use lies all around us, in the world we observe, in our actions and interactions. It’s never been easier for us to collect this information and this is mainly due to the way we have deeply integrated all sorts of technologies within our daily lives. We’re going to tell you how the recent approaches aim to take advantage of this growing IoT landscape.

Machine Learning on the edge (Edge ML) is a method of lowering dependency on central servers and cloud networks by allowing each device to analyze data locally (either using local servers or at the device level) using advanced machine learning techniques. This can be the case for computers, phones or any sort of smart devices. There’s a growing market for technical solutions that provide such decentralized ML training capabilities.

Regarding the embedded applications, it’s becoming a very straightforward process to deploy ML algorithms to embedded systems. More and more popular embedded development platforms are being supported by the mainstream ML frameworks, e.g. Tensorflow Lite[1] for Microcontrollers (link here) . If you are interested in kick-starting your own ML embedded project, we have written an article [2] about a potential candidate for your tech stack, the ESP32, which is also supported by Tensorflow Lite. On the other hand, you might be interested not only in deploying the inference models but rather actually using the embedded device to perform the model training. In this case we recommend a more powerful platform such as Nvidia Jetson Nano (link).

Federated Learning

Federated learning(FL) is a collaborative approach to perform machine learning in which the model is being trained across numerous decentralized edge devices while the local data samples are not transferred to any other party. Within the FL framework, users can improve the resulting prediction model by collaborating within the training process without having access to the data coming from the other participants, as this data remains private for each of the collaborators. The data is not leaving the edge where it already exists but the model is sent to each of the collaborators to be trained locally.

However, exploring the FL approach in depth will be subject to a future article (you might want to explore it on your own — so here’s a good starting point). We’ll be focusing on the privacy preserving aspect accompanying this technique and also its applications in the real world — in particular the medical domain.

Electronic Health Records(EHR) have become a valuable source of real-world healthcare data samples that have been used in a variety of critical biomedical studies, including ML based ones. FL is a possible solution for connecting EHR data from different medical institutions, allowing them to share valuable insight rather than their private data, thus maintaining the patient data confidentiality. In these cases, iterative gains in learning from huge and diverse medical data sets will dramatically increase the performance of the ML model. Patient similarity learning, patient representation learning, phenotyping, and predictive modelling are some of the activities that have been studied in FL scenarios in healthcare.

An example of such an application comes in the form of a privacy-preserving platform for patient similarity learning across institutions [3]. Without sharing patient-level information, their programs can discover similar patients among several hospitals.

FL has also enabled the training for predictive modelling based on multiple data sources, which ultimately can provide doctors more insights over the risks and benefits of treating patients earlier. A regularized sparse SVM classifier set in a FL environment was used to predict future hospitalizations for patients with heart-related disorders [4]. The EHR data that was used was dispersed across several data sources/agents.

Finally, companies such as Owkin are using collaborative learning for different use cases, such as anticipating how resilient patients will be to particular treatments and medications, as well as their chances of surviving certain diseases. For the prediction of preterm birth from distributed EHR, a federated uncertainty-conscious learning method was presented, in which the contribution to the final model is reduced for the members with high uncertainty levels.

Privacy preserving techniques to be used along FL

Misuse of private information is one of the greatest issues that arose along with the development of Big Data, being partially fuelled by data breaches suffered by multiple institutions. In this context, governments have taken measures to decrease the risks data-collecting institutions create for their clients, such as the United States’ HIPAA and Europe’s GDPR. In order to be compliant with those regulations, institutions and companies are required to implement protection methods against privacy threats.

Regardless of the fact that the FL approach comes with the great security advantage of the data never leaving the original host within the training process, there are still some privacy concerns involved. Similarly to reverse engineering processes, the resulting trained model can be used in order to infer relevant information about the data that was used in the training process. In order to mitigate some of those risks, privacy preserving techniques complementary to the FL original formula have been developed. The state-of-the-art techniques try to minimize information loss, which is inevitably caused by anonymization, while preserving analytical value.

Several privacy preserving approaches that suit FL environments:

  • Differential Privacy (DP) which implies the use of random noise applied at different levels in order to anonymize the data;
  • Trusted Execution Environments (TTEs) guarantee for reliable code execution within remote machines and is implemented by restricting the permissions of all parties;
  • Secure Multi-Party Computation (SMPC) where a subset of the clients collaborate cryptographically, simulating a trusted third party;
  • Homomorphic Encryption — it requires operating over data without decrypting it, which comes at a great computational costs
  • Syntactic approaches that refer to generalizing the identifying information within relational datasets.

Differential Privacy and the Syntactic approaches are least expensive techniques that have also proved to keep information loss reduced while aiming for privacy preservation. Both of those approaches propose the transformation and anonymization of data before being used in the training process in order to provide anonymity to each individual data point. Applying the DP principle on FL relies on the fact that adding the random noise to the data points is cancelled out during the training process and thus model performance is not affected. Analogously, the syntactic approach based on k-anonymity, which can be used in the case of relational data, is based on generalizing the data features such that the data points are indistinguishable from at least k other entries in the dataset (for example, a record containing a specific Age value — e.g. 15 — can be mapped to an general Age group — e.g. [7–18]).

Data anonymization is among the strategies that businesses and medical institutions may employ to comply with stringent data privacy rules that demand the protection of personally identifiable information (PII) such as medical records, contact details, and financial information. When anonymizing the data, this has been altered in such a way that sensitive information cannot be restored. We set a fee in terms of data retrieval and mining efficacy since we changed the original data, which is self-evident.

Getting started with Privacy-Preserving FL

We presented you the concepts, now we should also tell you how you can start using them and to put those in practice. Our recommendation comes in the form of a Python package — PySyft. Its main purpose is to provide users a simple interface for implementing secure, private, collaborative deep learning. PySyft decouples private data from model training, using Federated Learning. PySyft is an open-source package specifically designed for FL and Privacy Protection. It is created as an extension of several deep learning packages, including PyTorch, Keras, and Tensorflow. We’re going to leave a link here, so you can find out more about it. Till next time, stay safe! And private ;)

[1] TensorFlow Lite (2022). Available at: https://www.tensorflow.org/lite/guide (Accessed: 11 February 2022).

[2] Kickstart Your Embedded Projects With ESP32 and PlatformIO (2022). Available at: https://blog.kick-start.ro/kickstart-your-embedded-projects-with-esp32-and-platformio-643925ffdd49 (Accessed: 11 February 2022).

[3] Lee, J. et al. (2018) “Privacy-Preserving Patient Similarity Learning in a Federated Environment: Development and Analysis”, JMIR Medical Informatics, 6(2), p. e20. doi: 10.2196/medinform.7744.

[4] Brisimi, T. et al. (2018) “Federated learning of predictive models from federated Electronic Health Records”, International Journal of Medical Informatics, 112, pp. 59–67. doi: 10.1016/j.ijmedinf.2018.01.007.

[5] PySyft — OpenMined Blog (2022). Available at: https://blog.openmined.org/tag/pysyft/ (Accessed: 11 February 2022).

Our other articles