Research Projects

This page highlights the main research projects coordinated by Daniel de Oliveira.

PROV4DistML: Gerência de Dados de Proveniência para Análise de Aplicações de Aprendizado de Máquina Distribuído 🇧🇷

PROV4DistML: Provenance Data Management for the Analysis of Distributed Machine Learning Applications 🇺🇸

📅 Period: 2025 - Current

✅ Status: Active

🔍 Nature: Research

💲 Funding: R$ 75,600.00

Description: Distributed Machine Learning (or simply Distributed Learning) and its branches, such as Federated Learning, represent approaches that enable collaboration among multiple devices or users in training Machine Learning models, including deep neural networks and clustering algorithms. These techniques have been widely applied in areas such as Astronomy, Medicine, and Biology due to their ability to accelerate the training of complex models by distributing the processing across multiple machines. In particular, Federated Learning allows training to occur without the need to centralize data, thereby preserving privacy and security, making it ideal for scenarios involving sensitive data, such as personal information. In a typical configuration, each worker (e.g., smartphones, computers, or clusters) trains a model locally and sends model updates to a central server. This server combines the contributions into a global model, which is redistributed for new training rounds. Distributed training is an iterative process that can be time-intensive, with its duration influenced by factors such as aggregation methods, hyperparameter choices, and the characteristics of the datasets used. The analysis of metrics such as accuracy and fine-tuning of configurations during distributed training enables important improvements, such as increased efficiency, better model interpretability, and higher fault tolerance. In this context, provenance data emerges as a promising solution to represent and track the lifecycle of artifacts generated during the distributed learning process, such as datasets, applied transformations, trained models, and adopted configurations. These data not only enable the monitoring and analysis of training but also allow interventions, such as dynamic parameter adjustments and configuration adaptations. The PROV4DistML project aims to develop solutions to capture, model, store, and manage provenance data in Distributed and Federated Learning applications. Additionally, the project seeks to leverage these data to support actions such as automatic parameter tuning, improving result interpretability, and increasing the robustness of training. In this way, PROV4DistML contributes to the development of more efficient, transparent, and resilient solutions in Distributed Learning.

Team:

Funding Agency:
CNPq

Execução e Análise de Workflows de Aprendizado de Máquina Federado para Astronomia em Ambientes de Computação de Alto Desempenho 🇧🇷

Execution and Analysis of Federated Machine Learning Workflows for Astronomy in High-Performance Computing Environments 🇺🇸

📅 Period: 2024 - Current

✅ Status: Active

🔍 Nature: Research

💲 Funding: R$ 346,495.70

Description: In recent decades, Machine Learning has received growing attention from both academia and industry. However, traditional Machine Learning approaches face critical challenges, especially regarding data privacy and the volume and transfer of information to be processed. To address these issues, Federated Learning has emerged as a promising solution. This approach allows different nodes in a network to train models collaboratively without the need for direct sharing of examples, thereby preserving privacy and minimizing the amount of data transmitted between parties. However, the training process in Federated Learning brings specific challenges, such as the heterogeneity of participating nodes, which vary in terms of computational capacity and data characteristics. The effective distribution of training tasks among these nodes is crucial to ensure the process is both efficient and fair. Another critical point involves the monitoring and analysis of workflows in Federated Learning. Although tasks generate and consume large amounts of data, the lack of clear provenance documentation makes it difficult to analyze the results obtained. Provenance data capture provides an effective solution, increasing reliability, reproducibility, and offering mechanisms for interpreting the resulting models. Nevertheless, in the specific context of Federated Learning, the use of provenance data is still at an early stage, with current solutions showing limitations in capturing relevant metadata and in leveraging it for training monitoring and optimization. This project aims to address these gaps by proposing the development of algorithms and techniques that enable the extraction, modeling, and management of provenance data in the Federated Learning environment. The goal is not only to improve the allocation and scheduling of training tasks, but also to provide robust analytical support to users by integrating provenance data into existing frameworks. As a case study, we will use an application for identifying outlier objects in large astronomical catalogs. This application is part of the Legacy Survey of Space and Time (LSST) project, which generates 20 TB of raw data per night and has a naturally distributed processing infrastructure. The project thus seeks to contribute to making Federated Learning more efficient and explainable, with a special focus on Astronomy applications, where scalability and proper data handling are essential for scientific advances..

Team:

Funding Agency:
CNPq

FedProv - Gerência de Dados de Proveniência em Aplicações de Aprendizado de Máquina Federado 🇧🇷

FedProv - Provenance Data Management in Federated Machine Learning Applications 🇺🇸

📅 Period: 2024 - Current

✅ Status: Active

🔍 Nature: Research

💲 Funding: R$ 108,000.00

Description: Federated Machine Learning (or simply Federated Learning) is a distributed technique that enables collaboration among multiple users in training Machine Learning models (e.g., Deep Neural Networks). Federated Learning has been applied in various domains such as Medicine, Biology, and Pharmacy, as it eliminates the need to access the entire dataset for model training, since part of the data may be private or sensitive. In a Federated Learning application, each client node (e.g., a mobile phone, computer, or cluster) trains a model locally and then sends the model updates to a server node, where they are aggregated into a global model. This global model is redistributed for a new training round across client nodes. Training such models can require several iterations, making the process time-consuming, since each iteration depends on configuration choices such as the global aggregation method, hyperparameters, and the datasets used. Analyzing aggregation methods, hyperparameters, and metrics (e.g., accuracy) during distributed training allows for better understanding of the trained model and opens opportunities for improvements such as automated hyperparameter tuning and fault tolerance. Provenance data emerges as an interesting alternative to represent the derivation path of data during training, enabling analysis, monitoring, and necessary interventions. The FedProv project aims to develop algorithms and techniques for capturing, modeling, storing, and managing provenance data of artifacts involved in the lifecycle of a Federated Learning application. These artifacts include datasets, data transformations, and users associated with preprocessing, training, testing, and validation steps, in addition to the trained models themselves. Furthermore, FedProv intends to support additional actions through the captured provenance data, such as configuration adaptations, parameter tuning, and fault tolerance (since a distributed application is more susceptible to failures). By integrating provenance metadata and data into a database, we expect queries to this database to assist users during distributed training.

Team:

Funding Agency:
FAPERJ