|
1 | 1 | # MultimodalQnA Application |
2 | 2 |
|
3 | | -Suppose you possess a set of videos and wish to perform question-answering to extract insights from these videos. To respond to your questions, it typically necessitates comprehension of visual cues within the videos, knowledge derived from the audio content, or often a mix of both these visual elements and auditory facts. The MultimodalQnA framework offers an optimal solution for this purpose. |
| 3 | +Suppose you possess a set of videos, images, audio files, PDFs, or some combination thereof and wish to perform question-answering to extract insights from these documents. To respond to your questions, the system needs to comprehend a mix of textual, visual, and audio facts drawn from the document contents. The MultimodalQnA framework offers an optimal solution for this purpose. |
4 | 4 |
|
5 | | -`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos, images, and audio files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user. |
| 5 | +`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (e.g. images, transcripts, and captions) from your collection of video, image, audio, and PDF files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user. |
6 | 6 |
|
7 | 7 | The MultimodalQnA architecture shows below: |
8 | 8 |
|
@@ -87,12 +87,12 @@ In the below, we provide a table that describes for each microservice component |
87 | 87 | <details> |
88 | 88 | <summary><b>Gaudi default compose.yaml</b></summary> |
89 | 89 |
|
90 | | -| MicroService | Open Source Project | HW | Port | Endpoint | |
91 | | -| ------------ | --------------------- | ----- | ---- | ----------------------------------------------- | |
92 | | -| Embedding | Langchain | Xeon | 6000 | /v1/embeddings | |
93 | | -| Retriever | Langchain, Redis | Xeon | 7000 | /v1/multimodal_retrieval | |
94 | | -| LVM | Langchain, TGI | Gaudi | 9399 | /v1/lvm | |
95 | | -| Dataprep | Redis, Langchain, TGI | Gaudi | 6007 | /v1/generate_transcripts, /v1/generate_captions | |
| 90 | +| MicroService | Open Source Project | HW | Port | Endpoint | |
| 91 | +| ------------ | --------------------- | ----- | ---- | --------------------------------------------------------------------- | |
| 92 | +| Embedding | Langchain | Xeon | 6000 | /v1/embeddings | |
| 93 | +| Retriever | Langchain, Redis | Xeon | 7000 | /v1/multimodal_retrieval | |
| 94 | +| LVM | Langchain, TGI | Gaudi | 9399 | /v1/lvm | |
| 95 | +| Dataprep | Redis, Langchain, TGI | Gaudi | 6007 | /v1/generate_transcripts, /v1/generate_captions, /v1/ingest_with_text | |
96 | 96 |
|
97 | 97 | </details> |
98 | 98 |
|
@@ -172,8 +172,38 @@ docker compose -f compose.yaml up -d |
172 | 172 |
|
173 | 173 | ## MultimodalQnA Demo on Gaudi2 |
174 | 174 |
|
175 | | - |
| 175 | +### Multimodal QnA UI |
176 | 176 |
|
177 | | - |
| 177 | + |
178 | 178 |
|
179 | | - |
| 179 | +### Video Ingestion |
| 180 | + |
| 181 | + |
| 182 | + |
| 183 | +### Text Query following the ingestion of a Video |
| 184 | + |
| 185 | + |
| 186 | + |
| 187 | +### Image Ingestion |
| 188 | + |
| 189 | + |
| 190 | + |
| 191 | +### Text Query following the ingestion of an image |
| 192 | + |
| 193 | + |
| 194 | + |
| 195 | +### Audio Ingestion |
| 196 | + |
| 197 | + |
| 198 | + |
| 199 | +### Text Query following the ingestion of an Audio Podcast |
| 200 | + |
| 201 | + |
| 202 | + |
| 203 | +### PDF Ingestion |
| 204 | + |
| 205 | + |
| 206 | + |
| 207 | +### Text query following the ingestion of a PDF |
| 208 | + |
| 209 | + |
0 commit comments