Model Optimization Strategies

Title can be a little misleading, because this subtheme not only contains the strategies for optimizing models (i.e. how they are trained), but more importantly how to optimize inference. At the end of the day though, they are both for the purpose of juicing out the most performance with limited compute.

Model optimization

  • Mixture of experts models have fewer activated weights for any given model size, which make them a way to efficiently inference a large model with limited compute. (mixture_of_experts, 5) ^mixtureofexperts5

Inference optimization.

  • RAG can help offload tokens from the context window at inference, relieving compute burden. (rag_context_window, 6)
  • Local LLMs can be inferenced using the CPU, in case a high-performance GPU is not available. (cpu_llm_inference, 3)
  • Batch inference tooling like vLLM can help making batch tasks more efficient. (batching_llm_inference, 3)
  • Training models with less floating point precision helps reduce inference compute requirements. (training_less_precision, 3)
  • Running models locally with tensor parallelism (better single stream performance) (llm_tensor_parallel, 2)