Data Science Projects

This is a collection of projects in various, and of high current practical interest, areas in data science. The solutions to be developed in this current set of projects encompass interesting machine learning applications on meaningful problems; including automated assistance for improving the efficiency in (large) software project management, responsible data science - enabling distributed machine learning over sensitive data, accelerating scientific discovery, and investigating where LLMs can help in knowledge graph analytics.

There is pre-assembled data for all projects (either curated as part of prior R&D projects, or links to apt-for-that-project community datasets). Either ways, real data !

About any activity is more enjoyable and efficient if you understand it. To which I highly recommend having a textbook alongside, in the interest of understanding any machine learning algorithms and utilities you fire away at the data.

Last not least, aim to develop a system that at least one other person can successfully run.

Project 1: Automated Software Risk Assessment with Machine Learning

This is about building a working system that assists with real-time, granular software QA. More specifically, the system assesses individual code commits, as any piece of code is committed into an evolving software project (repository), for the risk of a fault down the road. That assessment is based on various characteristics regarding (i) an individual code commit itself, and (ii) the characteristics of the developers associated with the code and that particular code commit. This fault risk assessment is arrived at, using machine learning.

The overall objective is to employ machine learning and data, to improve the efficiency in identifying risky i.e., likely to fault pieces of code in a software project, in a dynamic "as-the-software-project-evolves" manner.

Concepts to apply or investigate:
Skill stack: scikit-learn, AutoML, LLM of choice.

Project material:

Project 2: Privacy-preserving Federated Machine Learning

The objective is to realize a working privacy-preserving federated machine learning environment, over a collection of related but otherwise distributed collection of multiple databases, in the clinical domain.

Concepts to apply or investigate: Skill stack:
Project material:

Project 3: Machine Reading of Scientific Research Literature

The objective is to develop a machine reading solution, which is essentially software that can automatically curate a structured database, of materials science experimental data, from the text across multiple research papers in materials science. Researchers require such aggregated and structured databases for data-driven scientific research, however currently a majority of the curation is done manually.

Concepts to apply or investigate:

Project material:

Project 4: Knowledge Graph Predictive Analytics: Event Prediction

The goal is to develop a solution for predicting events. The events, as well as contextual knowledge, are modeled are part of a Knowledge Graph.

Concepts to apply or investigate:


Project material: