Skip to content

Commit f11ab45

Browse files
MultimodalQnA image query, pdf, dynamic ports, and UI updates (#1381)
Per the proposed changes in this [RFC](https://github.com/opea-project/docs/blob/main/community/rfcs/24-10-02-GenAIExamples-001-Image_and_Audio_Support_in_MultimodalQnA.md)'s Phase 2 plan, this PR adds support for image queries, PDF ingestion and display, and dynamic ports. There are also some bug fixes. This PR goes with [this one in GenAIComps](opea-project/GenAIComps#1134). Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com> Co-authored-by: Liang Lv <liang1.lv@intel.com>
1 parent f3562be commit f11ab45

26 files changed

+802
-289
lines changed

MultimodalQnA/Dockerfile

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,12 @@ RUN useradd -m -s /bin/bash user && \
1616

1717
WORKDIR $HOME
1818

19-
2019
# Stage 2: latest GenAIComps sources
2120
FROM base AS git
2221

2322
RUN apt-get update && apt-get install -y --no-install-recommends git
2423
RUN git clone --depth 1 https://github.com/opea-project/GenAIComps.git
2524

26-
2725
# Stage 3: common layer shared by services using GenAIComps
2826
FROM base AS comps-base
2927

MultimodalQnA/README.md

Lines changed: 41 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# MultimodalQnA Application
22

3-
Suppose you possess a set of videos and wish to perform question-answering to extract insights from these videos. To respond to your questions, it typically necessitates comprehension of visual cues within the videos, knowledge derived from the audio content, or often a mix of both these visual elements and auditory facts. The MultimodalQnA framework offers an optimal solution for this purpose.
3+
Suppose you possess a set of videos, images, audio files, PDFs, or some combination thereof and wish to perform question-answering to extract insights from these documents. To respond to your questions, the system needs to comprehend a mix of textual, visual, and audio facts drawn from the document contents. The MultimodalQnA framework offers an optimal solution for this purpose.
44

5-
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos, images, and audio files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
5+
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (e.g. images, transcripts, and captions) from your collection of video, image, audio, and PDF files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
66

77
The MultimodalQnA architecture shows below:
88

@@ -87,12 +87,12 @@ In the below, we provide a table that describes for each microservice component
8787
<details>
8888
<summary><b>Gaudi default compose.yaml</b></summary>
8989

90-
| MicroService | Open Source Project | HW | Port | Endpoint |
91-
| ------------ | --------------------- | ----- | ---- | ----------------------------------------------- |
92-
| Embedding | Langchain | Xeon | 6000 | /v1/embeddings |
93-
| Retriever | Langchain, Redis | Xeon | 7000 | /v1/multimodal_retrieval |
94-
| LVM | Langchain, TGI | Gaudi | 9399 | /v1/lvm |
95-
| Dataprep | Redis, Langchain, TGI | Gaudi | 6007 | /v1/generate_transcripts, /v1/generate_captions |
90+
| MicroService | Open Source Project | HW | Port | Endpoint |
91+
| ------------ | --------------------- | ----- | ---- | --------------------------------------------------------------------- |
92+
| Embedding | Langchain | Xeon | 6000 | /v1/embeddings |
93+
| Retriever | Langchain, Redis | Xeon | 7000 | /v1/multimodal_retrieval |
94+
| LVM | Langchain, TGI | Gaudi | 9399 | /v1/lvm |
95+
| Dataprep | Redis, Langchain, TGI | Gaudi | 6007 | /v1/generate_transcripts, /v1/generate_captions, /v1/ingest_with_text |
9696

9797
</details>
9898

@@ -172,8 +172,38 @@ docker compose -f compose.yaml up -d
172172

173173
## MultimodalQnA Demo on Gaudi2
174174

175-
![MultimodalQnA-upload-waiting-screenshot](./assets/img/upload-gen-trans.png)
175+
### Multimodal QnA UI
176176

177-
![MultimodalQnA-upload-done-screenshot](./assets/img/upload-gen-captions.png)
177+
![MultimodalQnA-ui-screenshot](./assets/img/mmqna-ui.png)
178178

179-
![MultimodalQnA-query-example-screenshot](./assets/img/example_query.png)
179+
### Video Ingestion
180+
181+
![MultimodalQnA-ingest-video-screenshot](./assets/img/video-ingestion.png)
182+
183+
### Text Query following the ingestion of a Video
184+
185+
![MultimodalQnA-video-query-screenshot](./assets/img/video-query.png)
186+
187+
### Image Ingestion
188+
189+
![MultimodalQnA-ingest-image-screenshot](./assets/img/image-ingestion.png)
190+
191+
### Text Query following the ingestion of an image
192+
193+
![MultimodalQnA-video-query-screenshot](./assets/img/image-query.png)
194+
195+
### Audio Ingestion
196+
197+
![MultimodalQnA-audio-ingestion-screenshot](./assets/img/audio-ingestion.png)
198+
199+
### Text Query following the ingestion of an Audio Podcast
200+
201+
![MultimodalQnA-audio-query-screenshot](./assets/img/audio-query.png)
202+
203+
### PDF Ingestion
204+
205+
![MultimodalQnA-upload-pdf-screenshot](./assets/img/ingest_pdf.png)
206+
207+
### Text query following the ingestion of a PDF
208+
209+
![MultimodalQnA-pdf-query-example-screenshot](./assets/img/pdf-query.png)
40.8 KB
Loading
63.1 KB
Loading
931 KB
Loading
219 KB
Loading
124 KB
Loading
24.8 KB
Loading
98.7 KB
Loading
595 KB
Loading

0 commit comments

Comments
 (0)