OP simply shared the comparisons of open source Whisper packages that support long-form transcription. Some comments:
- “I have been using whisper.cpp for a while. I guess I should try faster whisper and whisperX” (based on the benchmark results posted by OP)
- “I love that you shared the notebook for running these benchmarks” (which isn’t standard practice)
- “Have you tried distilled whisper v2? It was more accurate for me.”
- OP assumed not trying it because they tried whisper-large-v3, but the commenter says contrary to intuition, distilled whisper v2 is more accurate.
- The OP follows up: https://www.reddit.com/r/LocalLLaMA/comments/1brqwun/comment/kxfts9p/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
One of the commenters’ reason for being in the thread is interesting: “I research whisper for a company project I work. We use it for subtitling.”
OP also does a good job of updating the thread with more models: “Update: I benchmarked large-v3 and distill-large-v2. Here are the updated results with color formatting https://preview.redd.it/iv60rvqa1qrc1.png?width=1337&format=png&auto=webp&s=4954ababfbd98bffea555285bc048b437e513f98 You can find all the results as a csv file in the blog post.”
The most notable comment in this thread is from a Huggingface Transformers maintainer, who found in their benchmarks that it is possible to get the chunked algorithm to get within 1.5% absolute WER of OpenAI’s sequential algorithm, and that OP might have set the hyperparameters chunk_length_s and return_typestamps wrong. They are looking out for the community, but also thanking the OP for providing a useful resource.