- Home
- Search
- Artificial Intelligence (AI)
- Loughborough University
- Semantic-Aware Data Deduplication for Efficient and Reliable Machine Learning Training - PHD
Semantic-Aware Data Deduplication for Efficient and Reliable Machine Learning Training - PHD
Course options
-
Qualification
PhD/DPhil - Doctor of Philosophy
-
Location
Loughborough University
-
Study mode
Full time
-
Start date
JAN
-
Duration
3 Years
Course summary
The exponential growth of training datasets in machine learning, particularly in natural language processing (NLP), has highlighted the critical challenge of data duplication. Duplicate or near-duplicate content, repetitive substrings, and redundant information in datasets can lead to biased models, inefficient training processes, and inflated evaluation metrics, ultimately undermining the reliability and generalisability of machine learning systems.
While data deduplication is essential for improving dataset quality, current methods are limited in their ability to capture semantic similarities and are often computationally expensive, making them impractical for large-scale applications.
This PhD project aims to address these limitations by developing novel, semantic-aware deduplication techniques that improve dataset quality while maintaining computational efficiency.
Research Objectives:
The proposed research will focus on three key objectives.
1. Develop and evaluate frameworks for semantic-aware deduplication that can identify both exact and near-duplicate content. A critical aspect will be preserving contextually important variations whilst removing truly redundant data. The effectiveness of these approaches will be evaluated against existing methods using standard benchmarks.
2. Examine how different deduplication strategies affect model performance, memory usage and training efficiency. This will involve carefully quantifying the relationships between deduplication levels and various aspects of model output quality. Understanding these relationships is crucial for developing practical solutions that can be deployed at scale.
3. Explore approaches including active learning approaches for deduplication that can efficiently process large-scale datasets. A key focus will be minimising both computational resources and manual labelling requirements through intelligent sample selection and automated processing techniques.
4. Conduct case studies on benchmark datasets to validate the proposed methods in real-world scenarios. This will involve applying the developed frameworks to diverse datasets, analysing their performance, and providing insights into their applicability across different domains and use cases.
This research has the potential to make significant contributions to the field of machine learning by addressing fundamental challenges in dataset quality and model training efficiency. The findings could have broad implications for improving the reliability and performance of language models across various applications.
The project will require expertise in machine learning and natural language processing, with opportunities to develop novel theoretical frameworks as well as practical implementations. The successful candidate will join a dynamic research environment with access to substantial computational resources and real-world datasets for evaluation.
This PhD programme offers an exciting opportunity to tackle important challenges at the intersection of machine learning, data quality and computational efficiency.
Tuition fees
- United States
- Afghanistan
- Albania
- Algeria
- Andorra
- Angola
- Antigua & Barbuda
- Argentina
- Armenia
- Australia
- Austria
- Azerbaijan
- Bahamas
- Bahrain
- Bangladesh
- Barbados
- Belarus
- Belgium
- Belize
- Benin
- Bhutan
- Bolivia
- Bosnia and Herzegovina
- Botswana
- Brazil
- Brunei
- Bulgaria
- Burkina Faso
- Burma
- Burundi
- Cabo Verde
- Cambodia
- Cameroon
- Canada
- Central African Republic
- Chad
- Chile
- China
- Colombia
- Comoros
- Congo
- Congo (Democratic Republic)
- Costa Rica
- Croatia
- Cuba
- Curacao
- Cyprus
- Czech Republic
- Denmark
- Djibouti
- Dominica
- Dominican Republic
- East Timor
- Ecuador
- Egypt
- El Salvador
- England
- Equatorial Guinea
- Eritrea
- Estonia
- Ethiopia
- Fiji
- Finland
- France
- Gabon
- Gambia
- Georgia
- Germany
- Ghana
- Greece
- Grenada
- Guatemala
- Guinea
- Guinea-Bissau
- Guyana
- Haiti
- Honduras
- Hong Kong
- Hungary
- Iceland
- India
- Indonesia
- Iran
- Iraq
- Israel
- Italy
- Ivory Coast
- Jamaica
- Japan
- Jordan
- Kazakhstan
- Kenya
- Kiribati
- Korea DPR (North Korea)
- Kosovo
- Kuwait
- Kyrgyzstan
- Laos
- Latvia
- Lebanon
- Lesotho
- Liberia
- Libya
- Liechtenstein
- Lithuania
- Luxembourg
- Macedonia
- Madagascar
- Malawi
- Malaysia
- Maldives
- Mali
- Malta
- Marshall Islands
- Mauritania
- Mauritius
- Mexico
- Micronesia
- Moldova
- Monaco
- Mongolia
- Montenegro
- Morocco
- Mozambique
- Namibia
- Nauru
- Nepal
- Netherlands
- New Zealand
- Nicaragua
- Niger
- Nigeria
- Northern Ireland
- Norway
- Oman
- Pakistan
- Palau
- Palestinian Authority
- Panama
- Papua New Guinea
- Paraguay
- Peru
- Philippines
- Poland
- Portugal
- Puerto Rico
- Qatar
- Republic of Ireland
- Romania
- Russia
- Rwanda
- San Marino
- Sao Tome and Principe
- Saudi Arabia
- Scotland
- Senegal
- Serbia
- Seychelles
- Sierra Leone
- Singapore
- Slovakia
- Slovenia
- Solomon Islands
- Somalia
- South Africa
- South Korea
- South Sudan
- Spain
- Sri Lanka
- St Vincent
- St. Kitts & Nevis
- St. Lucia
- Sudan
- Suriname
- Swaziland
- Sweden
- Switzerland
- Syria
- Taiwan
- Tajikistan
- Tanzania
- Thailand
- Togo
- Tonga
- Trinidad & Tobago
- Tunisia
- Turkey
- Turkmenistan
- Tuvalu
- UAE
- Uganda
- Ukraine
- United Kingdom
- Uruguay
- Uzbekistan
- Vanuatu
- Vatican City
- Venezuela
- Vietnam
- Wales
- Western Samoa
- Yemen
- Zambia
- Zimbabwe
£ 28,600per year
Tuition fees shown are for indicative purposes and may vary. Please check with the institution for most up to date details.
University information
-
University League Table
7th
-
Campus address
Loughborough University, Epinal Way, Loughborough, Leicestershire, LE11 3TU, United Kingdom
Subject rankings
-
Subject ranking
20th out of 117 3
29th out of 48 2
-
Entry standards
/ Max 223165 74%26th
-
Graduate prospects
/ Max 10090.0 90%29th
6 -
Student satisfaction
/ Max 43.06 77%52nd
39 -
Entry standards
/ Max 224148 66%31st
-
Graduate prospects
/ Max 10083.0 83%27th
10 -
Student satisfaction
/ Max 43.31 83%10th