At the time, Ollama probably did not support multimodal model inference. They built the toolkit to support text, audio, image generation, and other multimodal models, supporting ONNX and GGML models.
A straightforward tool for inferencing multimodal models at their conception must have been an in-demand problem.