[해외 DS] MIT·Google, '합성 이미지'를 사용하여 이미지 생성 모델 훈련

Picture

Member for

11 months 3 weeks

Real name

김광재

Position

연구원

Bio

균형 잡힌 시각으로 인공지능 소식을 전달하겠습니다.

입력

2023-11-28 17:09

수정

2025-09-05 11:35

StableRep, AI로 생성한 이미지를 훈련 데이터로 활용
실제 이미지로 학습한 다른 모델보다 우수한 성능 기록
하지만 이미지 생성 속도가 느려지고 비용이 많이 듦

[해외DS]는 해외 유수의 데이터 사이언스 전문지들에서 전하는 업계 전문가들의 의견을 담았습니다. 저희 데이터 사이언스 경영 연구소 (GIAI R&D Korea)에서 영어 원문 공개 조건으로 콘텐츠 제휴가 진행 중입니다.

synthetic_training_data — 출처=Microsoft Bing Image Creator

OpenAI의 DALL-E 3은 출시와 동시에 사용자들을 놀라게 했었다. OpenAI는 합성 이미지를 사용하여 모델을 학습시켰기 때문에 기능이 향상되었다고 설명했다. MIT와 Google의 연구팀은 이 개념을 확장하여 인기 있는 오픈소스 텍스트-이미지 모델인 Stable Diffusion에 합성 이미지를 학습 시켰다.

'다중양성대조학습', 나무보다 숲을 보는 법을 알려줘

연구진은 '다중양성대조학습'(multi-positive contrastive learning method) 방법을 사용하여 StableRep이라는 모델을 개발했다. 이 방법은 같은 텍스트 프롬프트에서 생성된 여러 이미지를 서로 양성(positive)으로 간주하여 훈련 중에 추가 정보를 제공함으로써 다양성을 더할 뿐만 아니라 비전 시스템에 어떤 이미지가 비슷하고 어떤 이미지가 다른지 학습한다. 즉, AI 이미지 생성 모델은 예를 들어 특정 풍경의 다양한 변형을 살펴보고 해당 풍경과 관련된 모든 설명을 상호 참조하여 해당 이미지를 기반으로 뉘앙스를 파악한다. 그 후 최종적으로 이를 적용하여 매우 상세하고 현실적인 이미지가 생성된다.

MIT와 Google의 연구팀은 StableRep을 Stable Diffusion에 적용하여 SimCLR 및 CLIP와 같은 경쟁 이미지 생성 모델보다 더 우수한 성능을 발휘하도록 했다. 이러한 노력으로 StableRep은 이미지넷 분류에서 76.7%의 선형 정확도를 달성했고, 언어 감독을 추가(StableRep+)한 결과, 2천만 개의 합성 이미지로 학습한 StableRep이 5천만 개의 실제 이미지로 학습한 CLIP보다 우수한 성능을 보였다.

MIT 박사 과정 중이며 수석 연구원인 리지 팬(Lijie Fan)은 "단순히 데이터만 공급하는 것이 아니기 때문에" 이 기술이 더 우수하다고 전했다. "같은 텍스트로부터 생성된 여러 이미지가 공통된 사물의 묘사로 취급될 때, 모델은 픽셀뿐만 아니라 물체와 같이 이미지 뒤에 숨어 있는 개념에 대해 더 깊이 파고듭니다."

Stable Diffusion에 의존하기 때문에 속도·비용·편견 해결 못 해

StableRep에도 단점이 있다. 예를 들어 이미지 생성 속도가 느리고 StableRep의 기본 모델인 Stable Diffusion은 여전히 실제 데이터에 대한 초기 학습을 거쳐야 하므로 StableRep을 사용하여 이미지를 생성하려면 시간이 더 오래 걸리고 비용도 더 많이 들 수 있다.

방대한 양의 실제 이미지를 수집하는 것에 대한 의존도를 낮춰 비용 효율을 높이고 사람의 큐레이션으로 인한 편견을 최소화할 수 있다는 점에서 의미가 있지만, 큐레이션 되지 않은 대규모 웹 데이터로 학습된 텍스트-이미지 생성 모델에는 여전히 잠재적인 사회적 편견과 오류가 존재할 수 있다. 또한 프롬프트의 텍스트 선택이 생성된 이미지에 영향을 미쳐 또 다른 잠재적 편견을 야기할 수 있다.

StableRep은 GitHub을 통해 액세스할 수 있고, 상업적으로 활용할 수 있다. StableRep은 Apache 2.0 라이선스에 따라 사용하고 2차 저작물 제작이 가능하다. 그러나 재배포된 저작물 또는 파생 저작물과 함께 Apache 라이선스 사본을 제공하고 변경 사항에 대한 공지를 포함해야 한다. 라이선스에는 책임 제한 조항이 포함되어 있어, 기여자는 라이선스가 부여된 저작물의 사용으로 인한 손해에 대한 책임을 지지 않는다. 또한 이 라이선스에는 책임 제한 조항이 포함되어 있어, 기여자는 라이선스가 부여된 저작물의 사용으로 인해 발생하는 어떠한 손해에 대해서도 책임을 지지 않는다.

MIT, Google: Using Synthetic Images to Train AI Image Models

Researchers describe a new method for creating highly detailed AI images, using training data made up of AI-generated images.

At a Glance

MIT and Google researchers developed a new technique that generates highly detailed images in image generation models.
Called StableRep, it uses AI-generated images to train AI models.
Researchers applied it to open-source Stable Diffusion.
But StableRep has flaws that make image generation slower and likely costlier to do.

Upon launch, OpenAI’s DALL-E 3 wowed users with its ability to generate highly detailed images compared to prior versions. OpenAI said the model's improved ability to do so came from using synthetic images to train the model. Now, a team of researchers from MIT and Google are expanding on this concept, applying it to the popular open source text-to-image model Stable Diffusion.

In a newly published paper, the researchers described a new approach to using AI-generated images to train image generation models that they call StableRep. It uses millions of labeled synthetic images to generate high-quality images.

The researchers said StableRep is a “multi-positive contrastive learning method” where multiple images generated from the same text prompt are treated as positives for each other, which enhances the learning process. That means an AI image generation model would view several variations of, for example, a landscape and cross-reference them with all descriptions related to that landscape to recognize nuances based on those images. It would then apply them in the final output. This is what creates a highly detailed image.

Outperforms rivals
The MIT and Google researchers applied StableRep to Stable Diffusion to make it outperform rival image generation models such as SimCLR and CLIP, which were trained with the same text prompts and corresponding real images.

StableRep achieved 76.7% linear accuracy on the ImageNet classification with a Vision Transformer model. Adding language supervision, the researchers found that StableRep, trained on 20 million synthetic images, outperformed CLIP, which was trained on 50 million real images.

Lijie Fan, a doctoral candidate at MIT and lead researcher, said that their technique is superior as it “not just feeding it data.” “When multiple images, all generated from the same text, all treated as depictions of the same underlying thing, the model dives deeper into the concepts behind the images, say the object, not just their pixels.”

StableRep does have its flaws. For example, it is slow to generate images. It also gets confused on semantic mismatches between text prompts and the resultant images.

StableRep’s underlying model, Stable Diffusion, also needed to go through an initial round of training on real data – so using StableRep to create images will take longer and likely be costlier.

Access StableRep
StableRep can be accessed via GitHub.

It is available for commercial use – StableRep is under an Apache2.0 License, meaning you can use it and produce derivative works.

However, you would have to provide a copy of the Apache License with any redistributed work or derivative works and include a notice of the changes. The license also includes a limitation of liability, where contributors are not liable for any damages arising from the use of the licensed work.

Picture