The term multimodal embedding spaces originates from the fields of artificial intelligence, big data and digital transformation. This involves storing and linking information from different sources - such as text, images, speech or even videos - together in a digital space, a so-called embedding space.
Imagine this embedding space as a huge cupboard in which different types of information are sorted according to a standardised system. This allows an AI to recognise, for example, that a dog in a photo and the word "dog" in a text belong together. It understands connections between images, words and even sounds much better.
A concrete example: In modern product searches in online shops, a customer can upload a photo of a pair of trainers and be shown suitable results as if she had written a detailed search query. Multimodal embedding spaces make it possible for artificial intelligence to analyse different types of data together and thus offer more intelligent services - whether in online shopping, image searches or assistance systems in everyday life.















