[해외 DS] 보고 말하고 쓰는 멀티모달 챗봇 등장

Picture

Member for

11 months 3 weeks

Real name

Hyojung Lee

Bio

지식은 전달하는 정보가 아니라, 함께 고민하기 위해 만들어진 언어입니다.

입력

2023-10-13 09:00

수정

2025-09-05 11:36

[해외DS]는 해외 유수의 데이터 사이언스 전문지들에서 전하는 업계 전문가들의 의견을 담았습니다. 저희 데이터 사이언스 경영 연구소 (GIAI R&D Korea)에서 영어 원문 공개 조건으로 콘텐츠 제휴가 진행 중입니다.

약 10개월 전 OpenAI의 ChatGPT가 처음 대중에게 공개되고 Google, Meta 및 기타 거대 기술 기업의 경쟁적인 대규모언어모델(LLM) 개발이 가속화됐다. 이제는 텍스트뿐만 아니라 이미지, 오디오 등을 분석할 수 있는 멀티모달 AI가 등장하고 있다.

OpenAI는 유료 가입자를 대상으로 자사의 LLM GPT-4를 기반으로 하는 ChatGPT의 멀티모달 버전을 출시했다. Google은 지난 5월부터 LLM 기반 챗봇인 Bard의 일부 버전에 이미지 오디오 기능을 통합하기 시작했고, 메타 역시 지난봄에 멀티모달리티 분야에서 큰 진전을 이뤘다고 발표했었다. 아직 초기 단계이지만 급성장하는 이 기술은 다양한 작업에 활용될 수 있다.

멀티모달 AI는 무엇을 할 수 있을까?

사이언티픽 아메리칸은 멀티모달 LLM을 사용하는 두 가지 챗봇, GPT-4V(GPT-4 Vision)와 PaLM 2 모델로 구동되는 Bard를 테스트했다. 두 제품 모두 오디오만으로 핸즈프리 음성 대화를 할 수 있으며, 이미지 속 장면을 묘사하고 사진 속 텍스트를 인식할 수 있다. 단순한 프롬프트로 영수증 사진 속 팁과 세금을 포함하여 4명이 각각 지급해야 할 금액을 계산했고, 이 작업은 모두 30초도 채 걸리지 않았다. 바드는 숫자 '9' 하나를 '0'으로 인식하여 총액을 잘못 계산했다. 또 다른 실험에서는 책이 가득한 책장 사진을 줬을 때 두 챗봇 모두 해당 책장 주인의 성격과 관심사에 대한 상세한 설명을 제공했다. 두 챗봇 모두 사진 한 장으로 자유의 여신상을 식별하고, 남부 맨해튼의 한 사무실에서 찍은 사진임을 추론하고 사진작가의 위치에서 여신상까지 길 안내를 제공했다(ChatGPT의 안내가 Bard보다 더 자세했다). 또한 사진에서 곤충을 정확하게 식별하는 데도 ChatGPT가 Bard보다 뛰어난 성능을 보였다.

한편 OpenAI는 시각장애인과 저시력자를 위해 무료 설명 서비스를 제공하는 Be My Eyes라는 회사를 통해 GPT-4V 성능을 테스트했다. 초기 실험이 성공적으로 진행되어 현재 Be My Eyes는 모든 사용자에게 AI 기반 버전의 앱을 배포하는 과정에 있다. 처음엔 텍스트 설명의 질이 낮았고 AI 환각으로 인한 부정확한 설명과 같은 문제들이 많았다고 알려졌으나 단점을 많이 개선한 상태다. 앱을 이용하는 사람들이 독립성을 되찾았다고 Be My Eyes의 제스퍼 흐비링 헨릭슨(Jesper Hvirring Henriksen)은 말했다.

멀티모달 AI 작동 방식

개별 기업들은 자사 모델의 토대를 공유하기를 꺼리지만, 멀티모달 인공 지능을 연구하는 그룹이 이들 기업만 알고 있는 것은 아니다. 다른 인공지능 연구자들도 이면에서 어떤 일이 벌어지고 있는지 잘 알고 있다. 스탠퍼드 대학교에서 기계 학습에 관한 강의를 가르치는 겸임교수이자 컨텍스트 AI(Contextual AI)의 CEO인 더웨 키엘라(Douwe Kiela)는 텍스트 전용 LLM에서 시각 및 청각 프롬프트에도 반응하는 AI로 전환하는 방법에는 크게 두 가지가 있다고 전했다. 키엘라 교수에 따르면 보다 통상적인 방법에서는 AI 모델이 서로 쌓여(스택) 있다고 설명했다. 사용자가 챗봇에 이미지를 입력하면, 이 사진은 상세한 이미지 캡션을 출력하기 위해 특별히 구축된 별도의 이미지 AI를 먼저 거친다(Google은 수년 동안 이와 같은 구조를 사용해 왔다). 그런 다음 해당 텍스트 설명이 챗봇에 피드백되고, 챗봇은 번역된 프롬프트에 응답하는 식이다.

이와는 대조적으로 다른 방법은 훨씬 더 긴밀하게 결합하는 과정이 필요하다. 각 모델의 기반이 되는 AI 알고리즘 코드를 다른 알고리즘에 접목하는 방식이다. 그런 다음 접목된 모델을 멀티미디어 데이터 세트에 대해 재학습하여 AI가 시각적 표현과 단어 사이의 패턴을 찾을 수 있게 한다. 첫 번째 전략보다 자원 집약적이지만 훨씬 더 유능한 AI를 만들 수 있는 장점이 있다. 키엘라 교수는 Google이 Bard에 첫 번째 방법을 사용했지만, OpenAI는 두 번째 방법에 의존해 GPT-4V를 만들었을 것으로 추측했다. 두 모델 간의 기능 차이 원인을 짐작할 수 있는 관점이다.

서로 다른 AI 모델을 융합하는 방식과 관계없이 내부적으로는 같은 프로세스가 진행되고 있다. LLM은 주어진 단어에서 다음 단어 또는 음절을 예측하는 기본 원리로 작동한다. 이를 위해 '트랜스포머' 아키텍처에 의존한다. 이러한 유형의 신경망은 텍스트를 벡터로 표현하여 일련의 수학적 관계로 바꾼다. 트랜스포머 신경망은 문장을 단순한 단어의 나열이 아니라 문맥을 매핑하는 연결망으로 바라본다. 여러 가지 의미를 파악하고 문법 규칙을 따르며 스타일을 모방할 수 있는 인간과 같은 챗봇이 탄생할 수 있는 배경이다. AI 모델을 결합하거나 스택을 쌓으려면 알고리즘은 시각, 오디오, 텍스트 등 다양한 입력을 출력으로 가는 경로에서 같은 유형의 벡터 데이터로 변환해야 한다. 서로 다른 AI는 백터 데이터를 기준으로 소통할 수 있어서 사용자에게 최종적으로 멀티모달 서비스를 제공할 수 있게 된다.

가능성과 한계

다양한 유형의 AI를 함께 조정하고 통합하고 개선하기 시작하면 급속한 발전이 계속될 것이다. 머신러닝 모델이 냄새를 분석하고 생성할 수 있는 가까운 미래를 상상해 볼 수 있다. 구글리서치 브레인팀, 오스모연구소, 모넬화학감각센터 공동 연구팀은 국제학술지 ‘사이언스’에 9월 1일(현지 시각) AI도 사람만큼 냄새를 잘 맡는다는 연구 결과를 발표했다. 멀티모달 AI는 인공 일반 지능과는 다르다. 하지만 컴퓨터에도 인간과 비슷한 다양한 감각기관이 생기면 점차 그 수준에 도달할 것으로 보인다.

업계 관계자들은 멀티모달 AI의 가장 큰 문제도 환각이라고 지적했다. 언제든 정보를 위조할 수 있는 AI 비서를 신뢰하긴 어렵다. LLM이 자랑하는 복잡한 구조 때문에 안정적인 미세조정이 현재는 불가능한 상태다. 이에 대해 개발사들도 답변을 기피했다. 그리고 프라이버시 문제도 있다. 음성 및 영상과 같은 정보 밀도가 높은 입력의 경우, 유출되거나 해킹으로 인해 손상될 가능성이 높다. 특히 챗봇은 간접 프롬프트 인젝션(indirect prompt injection)이라고 불리는 공격 유형에 취약하다. 공격 수행 방법은 너무나 간단하고 알려진 대응 방법도 없다. 소셜미디어와 이메일에 연결된 AI 모델은 사용자 맞춤 편의성을 제공하는 대신 보안 문제도 함께 부각됐다. 따라서 전문가들은 멀티모달 AI 서비스를 사용할 때 민감한 개인 정보 입력을 피하라고 입을 모아 강조한다.

The Latest AI Chatbots Can Handle Text, Images and Sound. Here’s How
New “multimodal” AI programs can do much more than respond to text—they also analyze images and chat aloud

Slightly more than 10 months ago OpenAI’s ChatGPT was first released to the public. Its arrival ushered in an era of nonstop headlines about artificial intelligence and accelerated the development of competing large language models (LLMs) from Google, Meta and other tech giants. Since that time, these chatbots have demonstrated an impressive capacity for generating text and code, albeit not always accurately. And now multimodal AIs that are capable of parsing not only text but also images, audio, and more are on the rise.

OpenAI released a multimodal version of ChatGPT, powered by its LLM GPT-4, to paying subscribers for the first time last week, months after the company first announced these capabilities. Google began incorporating similar image and audio features to those offered by the new GPT-4 into some versions of its LLM-powered chatbot, Bard, back in May. Meta, too, announced big strides in multimodality this past spring. Though it is in its infancy, the burgeoning technology can perform a variety of tasks.

WHAT CAN MULTIMODAL AI DO?
Scientific American tested out two different chatbots that rely on multimodal LLMs: a version of ChatGPT powered by the updated GPT-4 (dubbed GPT-4 with vision, or GPT-4V) and Bard, which is currently powered by Google’s PaLM 2 model. Both can both hold hands-free vocal conversations using only audio, and they can describe scenes within images and decipher lines of text in a picture.

These abilities have myriad applications. In our test, using only a photograph of a receipt and a two-line prompt, ChatGPT accurately split a complicated bar tab and calculated the amount owed for each of four different people—including tip and tax. Altogether, the task took less than 30 seconds. Bard did nearly as well, but it interpreted one “9” as a “0,” thus flubbing the final total. In another trial, when given a photograph of a stocked bookshelf, both chatbots offered detailed descriptions of the hypothetical owner’s supposed character and interests that were almost like AI-generated horoscopes. Both identified the Statue of Liberty from a single photograph, deduced that the image was snapped from an office in lower Manhattan and offered spot-on directions from the photographer’s original location to the landmark (though ChatGPT’s guidance was more detailed than Bard’s). And ChatGPT also outperformed Bard in accurately identifying insects from photographs.

For disabled communities, the applications of such tech are particularly exciting. In March OpenAI started testing its multimodal version of GPT-4 through the company Be My Eyes, which provides a free description service through an app of the same name for blind and low-sighted people. The early trials went well enough that Be My Eyes is now in the process rolling out the AI-powered version of its app to all its users. “We are getting such exceptional feedback,” says Jesper Hvirring Henriksen, chief technology officer of Be My Eyes. At first there were lots of obvious issues, such as poorly transcribed text or inaccurate descriptions containing AI hallucinations. Henriksen says that OpenAI has improved on those initial shortcomings, however—errors are still present but less common. As a result, “people are talking about regaining their independence,” he says.

HOW DOES MULTIMODAL AI WORK?
In this new wave of chatbots, the tools go beyond words. Yet they’re still based around artificial intelligence models that were built on language. How is that possible? Although individual companies are reluctant to share the exact underpinnings of their models, these corporations aren’t the only groups working on multimodal artificial intelligence. Other AI researchers have a pretty good sense of what’s happening behind the scenes.

There are two primary ways to get from a text-only LLM to an AI that also responds to visual and audio prompts, says Douwe Kiela, an adjunct professor at Stanford University, where he teaches courses on machine learning, and CEO of the company Contextual AI. In the more basic method, Kiela explains, AI models are essentially stacked on top of one another. A user inputs an image into a chatbot, but the picture is filtered through a separate AI that was built explicitly to spit out detailed image captions. (Google has had algorithms like this for years.) Then that text description is fed back to the chatbot, which responds to the translated prompt.

In contrast, “the other way is to have a much tighter coupling,” Kiela says. Computer engineers can insert segments of one AI algorithm into another by combining the computer code infrastructure that underlies each model. According to Kiela, it’s “sort of like grafting one part of a tree onto another trunk.” From there, the grafted model is retrained on a multimedia data set—including pictures, images with captions and text descriptions alone—until the AI has absorbed enough patterns to accurately link visual representations and words together. It’s more resource-intensive than the first strategy, but it can yield an even more capable AI. Kiela theorizes that Google used the first method with Bard, while OpenAI may have relied on the second to create GPT-4. This idea potentially accounts for the differences in functionality between the two models.

Regardless of how developers fuse their different AI models together, under the hood, the same general process is occurring. LLMs function on the basic principle of predicting the next word or syllable in a phrase. To do that, they rely on a “transformer” architecture (the “T” in GPT). This type of neural network takes something such as a written sentence and turns it into a series of mathematical relationships that are expressed as vectors, says Ruslan Salakhutdinov, a computer scientist at Carnegie Mellon University. To a transformer neural net, a sentence isn’t just a string of words—it’s a web of connections that map out context. This gives rise to much more humanlike bots that can grapple with multiple meanings, follow grammatical rules and imitate style. To combine or stack AI models, the algorithms have to transform different inputs (be they visual, audio or text) into the same type of vector data on the path to an output. In a way, it’s taking two sets of code and “teaching them to talk to each other,” Salakhutdinov says. In turn, human users can talk to these bots in new ways.

WHAT COMES NEXT?
Many researchers view the present moment as the start of what’s possible. Once you begin aligning, integrating and improving different types of AI together, rapid advances are bound to keep coming. Kiela envisions a near future where machine learning models can easily respond to, analyze and generate videos or even smells. Salakhutdinov suspects that “in the next five to 10 years, you’re just going to have your personal AI assistant.” Such a program would be able to navigate everything from full customer service phone calls to complex research tasks after receiving just a short prompt.

Multimodal AI is not the same as artificial general intelligence, a holy grail goalpost of machine learning wherein computer models surpass human intellect and capacity. Multimodal AI is an “important step” toward it, however, says James Zou, a computer scientist at Stanford University. Humans have an interwoven array of senses through which we understand the world. Presumably, to reach general AI, a computer would need the same.

As impressive and exciting as they are, multimodal models have many of the same problems as their singly focused predecessors, Zou says. “The one big challenge is the problem of hallucination,” he notes. How can we trust an AI assistant if it might falsify information at any moment? Then there’s the question of privacy. With information-dense inputs such as voice and visuals, even more sensitive information might inadvertently be fed to bots and then regurgitated in leaks or compromised in hacks.

Zou still advises people to try out these tools—carefully. “It’s probably not a good idea to put your medical records directly into the chatbot,” he says.

Picture