Close icon

Personalise what you see on this page.

Choose from the options below. We'll show you information based on your current location as default.

I'M FROM

  • United States
Please select so we can show the most relevant content.

LIVING IN

  • United States
Please select so we can show the most relevant content.

LOOKING FOR

  • Postgraduate courses
Please select so we can show the most relevant content.
Viewing as a student from United States living in United States interested in Postgraduate courses

Semantic-Aware Data Deduplication for Efficient and Reliable Machine Learning Training - PHD

Loughborough University

Add to favourites

Course options

  • Qualification

    PhD/DPhil - Doctor of Philosophy

  • Location

    Loughborough University

  • Study mode

    Full time

  • Start date

    JAN

  • Duration

    3 Years

Course summary

The exponential growth of training datasets in machine learning, particularly in natural language processing (NLP), has highlighted the critical challenge of data duplication. Duplicate or near-duplicate content, repetitive substrings, and redundant information in datasets can lead to biased models, inefficient training processes, and inflated evaluation metrics, ultimately undermining the reliability and generalisability of machine learning systems.

While data deduplication is essential for improving dataset quality, current methods are limited in their ability to capture semantic similarities and are often computationally expensive, making them impractical for large-scale applications.

This PhD project aims to address these limitations by developing novel, semantic-aware deduplication techniques that improve dataset quality while maintaining computational efficiency.

Research Objectives:

The proposed research will focus on three key objectives.

1. Develop and evaluate frameworks for semantic-aware deduplication that can identify both exact and near-duplicate content. A critical aspect will be preserving contextually important variations whilst removing truly redundant data. The effectiveness of these approaches will be evaluated against existing methods using standard benchmarks.

2. Examine how different deduplication strategies affect model performance, memory usage and training efficiency. This will involve carefully quantifying the relationships between deduplication levels and various aspects of model output quality. Understanding these relationships is crucial for developing practical solutions that can be deployed at scale.

3. Explore approaches including active learning approaches for deduplication that can efficiently process large-scale datasets. A key focus will be minimising both computational resources and manual labelling requirements through intelligent sample selection and automated processing techniques.

4. Conduct case studies on benchmark datasets to validate the proposed methods in real-world scenarios. This will involve applying the developed frameworks to diverse datasets, analysing their performance, and providing insights into their applicability across different domains and use cases.

This research has the potential to make significant contributions to the field of machine learning by addressing fundamental challenges in dataset quality and model training efficiency. The findings could have broad implications for improving the reliability and performance of language models across various applications.

The project will require expertise in machine learning and natural language processing, with opportunities to develop novel theoretical frameworks as well as practical implementations. The successful candidate will join a dynamic research environment with access to substantial computational resources and real-world datasets for evaluation.

This PhD programme offers an exciting opportunity to tackle important challenges at the intersection of machine learning, data quality and computational efficiency.

Tuition fees

Students living in United States
(International fees)

£ 28,600per year

Tuition fees shown are for indicative purposes and may vary. Please check with the institution for most up to date details.

University information

Loughborough University

Loughborough University

  • University League Table

    7th

  • Campus address

    Loughborough University, Epinal Way, Loughborough, Leicestershire, LE11 3TU, United Kingdom

The university's two campuses are located in Loughborough and London, both close to airports, making it easy to explore Europe and beyond.
Loughborough's research community has over 1,200 research students spanning 90 nationalities, with more than 800 members of staff supporting them.
Home to over 18,000 students and staff from more than 130 different countries. Of the students based at the London campus, 80% are international students.

Subject rankings

  • Subject ranking

    20th out of 117 3

    29th out of 48 2

  • Entry standards

    / Max 223
    165 74%

    26th

  • Graduate prospects

    / Max 100
    90.0 90%

    29th

    6
  • Student satisfaction

    / Max 4
    3.06 77%

    52nd

    39
  • Entry standards

    / Max 224
    148 66%

    31st

  • Graduate prospects

    / Max 100
    83.0 83%

    27th

    10
  • Student satisfaction

    / Max 4
    3.31 83%

    10th

Is this page useful?

Yes No

Sorry about that...

HOW CAN WE IMPROVE IT?

SUBMIT

Thanks for your feedback!