Data augmentation
Data augmentation is a technique used to expand existing datasets by creating modified versions of the original data. This helps machine learning models learn more effectively and become more robust, especially when collecting large amounts of real-world data is difficult, time-consuming, or limited by privacy concerns.
For image data, augmentation often involves simple transformations such as rotating, flipping, cropping, or changing colors. More advanced approaches include adding random noise, combining sections of different images, or copying objects from one image into another to create new contexts.
Text data can also be augmented by creating new variations of existing text. Simple approaches include replacing words with synonyms, deleting or inserting words, or altering sentence structures. Neural methods, such as back-translation or using embeddings from pre-trained large language models (LLMs), generate new text samples that preserve the original meaning while adding diversity.