Data Science Projects

This is a collection of projects in various, and of high current practical interest, areas in data science. The solutions to be developed in this current set of projects encompass interesting machine learning applications on meaningful problems; including automated assistance for improving the efficiency in (large) software project management, responsible data science - enabling distributed machine learning over sensitive data, accelerating scientific discovery, and investigating where LLMs can help in knowledge graph analytics.

There is pre-assembled data for all projects (either curated as part of prior R&D projects, or links to apt-for-that-project community datasets). Either ways, real data !

About any activity is more enjoyable and efficient if you understand it. To which I highly recommend having a textbook alongside, in the interest of understanding any machine learning algorithms and utilities you fire away at the data.

To which I find Deep Learning: Fundamentals and Concepts (Bishop & Bishop, free on-line version) very readable
(there are others of course).

Last not least, aim to develop a system that at least one other person can successfully run.

Project 1: Automated Software Risk Assessment with Machine Learning

This is about building a working system that assists with real-time, granular software QA. More specifically, the system assesses individual code commits, as any piece of code is committed into an evolving software project (repository), for the risk of a fault down the road. That assessment is based on various characteristics regarding (i) an individual code commit itself, and (ii) the characteristics of the developers associated with the code and that particular code commit. This fault risk assessment is arrived at, using machine learning.

The overall objective is to employ machine learning and data, to improve the efficiency in identifying risky i.e., likely to fault pieces of code in a software project, in a dynamic "as-the-software-project-evolves" manner.

Concepts to apply or investigate:

Look at it as a sparse classification OR an anomaly detection problem.
Try out the gamut of machine learning classifiers, of varying complexity.
Investigate where LLMs can help further (a commit message is natural language text !).

Skill stack: scikit-learn, AutoML, LLM of choice.

Project material:

The (SERA 2022) Paper that comprehensively describes this project (accompanying Slides)
Here is the curated dataset for this project.

There two .sql database dumps comprising of:

SQL dump of the "commit_unified_meta" database table with data on over 1.5M software code
commits, and also details of the associated code and code commit developers.
SQL dump of the "user_meta" database table with details on over 18K developers
All of the above has been curated from over 60 open software project repositories and open developer
profiles on GitHub

Project 2: Privacy-preserving Federated Machine Learning

The objective is to realize a working privacy-preserving federated machine learning environment, over a collection of related but otherwise distributed collection of multiple databases, in the clinical domain.

Concepts to apply or investigate:

Achieving a federated machine learning environment, in the face of data privacy requirements.
Handling sensitive data in a data federation, using differential privacy.
If of further interest, exploring data encryption - via homomorphic encryption techniques.

Skill stack:

TensorFlow Federated (platform),
PySyft (includes utilities for differential-privacy), and
PySEAL (if you venture into homomorphic-encryption).

Project material:

Data: Use the eICU Collaborative Research Database
- Here is a recommendation for fragmenting (part of) eICU into 9 separate databases
  (that form the federation); also includes a plan for evaluating the utility-privacy trade-off for your solution
- Some relevant papers on FML

Project 3: Machine Reading of Scientific Research Literature

The objective is to develop a machine reading solution, which is essentially software that can automatically curate a structured database, of materials science experimental data, from the text across multiple research papers in materials science. Researchers require such aggregated and structured databases for data-driven scientific research, however currently a majority of the curation is done manually.

Concepts to apply or investigate:

Mostly NLP tasks in information-extraction: entity extraction, relation extraction, withone/few-shot methods.
Integration of scientific ontologies/knowledge graphs for data extraction.
LLM application for information-extraction.
- Including investigation model-distillation for achieving an efficient, cost-effective model.

Project material:

Dataset, of 300 materials science research papers, with various sections delineated (Json)
Relevant paper (materials science experimental data extraction with LLMs)

Project 4: Knowledge Graph Predictive Analytics: Event Prediction

The goal is to develop a solution for predicting events. The events, as well as contextual knowledge, are modeled are part of a Knowledge Graph.

Concepts to apply or investigate:

Knowledge Graph Analytics.
Investigating where and how effectively can LLMs help with reasoning over knowledge graphs, for predictive
applications.

Project material:

Recommend using the POLECAT dataset : an open dataset of millions of time-stamped, geolocated, political events.
Select relevant literature

Paper on temporal knowledge graph forecasting using in-context learning.