This is a collection of projects in various, and of high current practical interest, areas in data science. The
solutions to be developed in this current set of projects encompass interesting machine learning applications
on meaningful problems; including automated assistance for improving the efficiency in (large) software
project management, responsible data science - enabling distributed machine learning over sensitive data,
accelerating scientific discovery, and investigating where LLMs can help in knowledge graph analytics.
There is pre-assembled data for all projects (either curated as part of prior R&D projects, or
links to apt-for-that-project community datasets). Either ways, real data !
About any activity is more enjoyable and efficient if you understand it. To which I highly recommend having a textbook alongside,
in the interest of understanding any machine learning algorithms and utilities you fire away at the data.
This is about building a working system that assists with real-time, granular software QA.
More specifically, the system assesses individual code commits, as any piece of code is committed into
an evolving software project (repository), for the risk of a fault down the road. That assessment is based on various characteristics regarding
(i) an individual code commit itself, and (ii) the characteristics of the developers associated with the code and that particular code commit.
This fault risk assessment is arrived at, using machine learning.
The overall objective is to employ machine learning and data, to improve the efficiency in identifying risky i.e., likely to fault pieces of code in a software project, in a dynamic "as-the-software-project-evolves" manner.
The objective is to realize a working privacy-preserving federated machine learning environment, over a collection of related but otherwise distributed collection of multiple databases, in the clinical domain.
Concepts to apply or investigate:
The objective is to develop a machine reading solution, which is essentially software that can automatically curate
a structured database, of materials science experimental data, from the text across multiple research papers in materials science.
Researchers require such aggregated and structured databases for data-driven scientific research, however currently a majority of the curation is done manually.
Concepts to apply or investigate:
The goal is to develop a solution for predicting events. The events, as well as contextual knowledge, are modeled are part of a Knowledge Graph.
Concepts to apply or investigate: