Data-centric AI is emerging as a discipline of “systematically engineering the data” for AI. The goal is to make it easier for practitioners to understand, program and iterate on datasets, instead of spending time on models. In this project, we will focus on the effect of “bad data” on different ML/DL pipelines and what can be done to improve the data. We will study the effect of training data on ML/DL models along the following steps:
- Train data with different levels of dirtiness
- How robust are “robust ML/DL” models under different dirtiness scenarios?
- Come up with optimized data preparation pipelines