OpenAI vs Open-Source Multilingual Embedding Models | by Yann-Aël Le Borgne | Feb, 2024

From Medium:

OpenAI has released new embedding models, embedding v3, which are described as the most performant models yet, with higher multilingual performance. These models come in two classes: text-embedding-3-small and text-embedding-3-large. Little information was provided about their design and training, and they are only accessible through a paid API.

To compare the performance of these models with open-source counterparts, the EU AI Act will be used as the data corpus. This act is available in 24 languages, making it possible to compare the accuracy of data retrieval across different language families. The process involves generating a custom synthetic question/answer dataset and evaluating the accuracy of OpenAI models on this dataset.

The dataset is generated using a chunk-based approach, where synthetic questions are generated for each chunk of the corpus. The resulting dataset includes questions and answers pairs for evaluation. The evaluation function assesses the embeddings for all answers and retrieves the top k most similar documents to calculate Mean Reciprocal Rank (MRR) for each query. Four different OpenAI embedding models were evaluated on four languages: English, French, Czech, and Hungarian.



Read more at Medium: OpenAI vs Open-Source Multilingual Embedding Models | by Yann-Aël Le Borgne | Feb, 2024