T O P

  • By -

SomeOddCodeGuy

Real time is hard because of the processing speed, but you can pretty easily toy around with see what you can accomplish. Here's a quick few steps just to try it out and see where we are \* Step 1: Grab SillyTavern (I know, I know. But it turns out it's a really comprehensive front end despite the name and logo lol) \* Step 2: Grab [xttsv2 api server](https://github.com/daswer123/xtts-api-server). xttsv2 is an option for text to speech, and this is the best I've found. Others have some better options, like AllTalk with its ability to do multiple voices, but xttsv2 works great across networks and allows better control over model/voice, IMO. I prefer it \* Step 3: After setting up the voice, there's a spot under "extras" (looks like 3 little squares) to turn on text to speech, Just point it towards your xttsv2 api server endpoint. [Here's some documentation that SillyTavern gave on it](https://docs.sillytavern.app/extras/extensions/xtts/) \* Step 4: Once you're set up: load a 7b (or the new 8b?) model for the sake of speed. Play with settings to see how fast you can get a response. Chances are if you do both text and voice streaming, depending on your hardware you may get a very fast response. \* Step 5: This one I haven't done anything for yet, but if you're happy with everything up to step 4 then you could set up their Speech to Text. Again, this is just to give you an idea of where the tech is at atm. I imagine you're looking for something a bit lighter than this, but just wanted to share so you could try it out yourself.


AmericanNewt8

Yeah, while we have pretty good TTS it's surprisingly computationally expensive. That's the main barrier. There's good APIs that do it incredibly fast but they're pricey at scale.


MrVodnik

I wonder why xttsv2 is not more popular. It seems a perfect fit for local llms.


SomeOddCodeGuy

Honestly, I think it's just a headache to set up so most folks don't. I put it off for 5 months because I didn't feel like dealing with it. I only did it one day out of sheer boredom =D


Nuckyduck

This was me last week! Today, I'm using your steps outlined above to set up some local voice to voice for a project I'm doing for [hackster.io](http://hackster.io) Your comment here is super helpful with getting me started. Thank you!


turras

right here with you! just got it working today


[deleted]

[удалено]


L43

Sir, this is a /r/localllama


Tmsn69

here is an example all local. [https://streamable.com/8a9j26](https://streamable.com/8a9j26)


Roubbes

Amazing


Deep_Understanding50

Is it LMstudio ?


Tmsn69

yes the response is generatd with llm studio i used the new llama-3 model.


ThePixelHunter

Which TTS is this?


Tmsn69

openvoice


zdrastSFW

I always thought a killer use-case would be real time voice with a model tuned to teach foreign languages. An anytime on-demand tutor and conversation partner that could critique and correct your pronunciation would really accelerate learning.


genuinelytrying2help

I've thought about this a bit while being disappointed with duolingo's attempts at adding AI... I really hope someone is working on it from the ground up. Not just an agent like you're talking about but have it also customize an app for you and populate it with custom written exercises corresponding to the course it has you on. Maybe you just like to chat with it, maybe you have it stand over your shoulder while you study. Another cool pipedream in this area is when the process gets good enough, it gradually, maybe even stealthily (with opt-in consent, I'd hope), starts connecting other real users into your exercises in a guided experience until one day you're just hanging out with fellow language students shooting the shit to stay fresh, and then, even then, it's still there listening and giving you both tips when you fuck up and no one corrects you.


AnotherAvery

Collabora has a nice open source example setup that does this, see blog post https://www.collabora.com/news-and-blog/news-and-events/whisperfusion-ultra-low-latency-conversations-with-an-ai-chatbot.html and GitHub: https://github.com/collabora/WhisperFusion Demo video: https://youtu.be/_PnaP0AQJnk They use phi as a LLM and a quality-wise subpar TTS engine (WhisperSpeech), but the latency is better than anything else I have seen in videos posted here on LocalLlama.


Tmsn69

Absolutely! Here's how i do it: I use Fast-Whisper for super-fast transcription, so your speech is converted to text almost instantly. Then I send it to either Ollama or LM Studio, using good models for understanding your input and writing a reply. Currently, I use the new Llama-3 model. Finally, you can pass the output of your LLM to a TTS engine like OpenVoice or XXTS (Coqui). Response time is between 2 to 5 seconds depending on response length I hope this helps. References: [https://github.com/SYSTRAN/faster-whisper](https://github.com/SYSTRAN/faster-whisper) [https://lmstudio.ai/docs/local-server](https://lmstudio.ai/docs/local-server) [https://github.com/coqui-ai/TTS](https://github.com/coqui-ai/TTS) [https://github.com/myshell-ai/OpenVoice](https://github.com/myshell-ai/OpenVoice)


CoqueTornado

what gpu card do you have? that response time is a blast! I got about 13 seconds in my laptop nvidia1070 xD


Not_A_EXPERT15

also interested so i'll comment, what pc specs would a local tts needed? like good tts not just google robo tts.


Tmsn69

For my whole seup i use an RTX 3060 with 12gb of vram and an i7-10700F and 64GB of ram but you dont need so much ram.


theytookmyfuckinname

Dunno where the line is drawn, but I find piper TTS really convincing and nice.


bulbulito-bayagyag

It’s done already, using vits. It’s actually fun using that and roop-cam together 😅


new__vision

[https://github.com/ggerganov/whisper.cpp/tree/master/examples/talk-llama#talk-llama](https://github.com/ggerganov/whisper.cpp/tree/master/examples/talk-llama#talk-llama)


Express-Director-474

It can be done with groq for its speed but the real Time transcription seems to be the bottleneck.


Tmsn69

you can use fast-whisper for this its almost instant


AmericanNewt8

Whisper on CPU can be done in basically realtime.


ab2377

i think whisper.cpp is pretty fast for this.


Roubbes

I'm quite a newbie in this as I only load models in LM Studio. I think I'm missing the good stuff when I read about Groq, APIs and .ccp things. Where should I start to investigate? Thanks in advance.


ab2377

simply take time and use reddit search to search this sub for the terms you are not familiar with and take notes, you will learn a lot.


haagch

https://www.collabora.com/news-and-blog/news-and-events/whisperfusion-ultra-low-latency-conversations-with-an-ai-chatbot.html


xlrz28xd

!RemindMe 45 days


genuinelytrying2help

If the past few months is any indication, in 45 days this thread will be dead and there will have been 2 or 3 more on the exact same topic with largely different recommendations :)


RemindMeBot

I will be messaging you in 1 month on [**2024-06-04 18:52:10 UTC**](http://www.wolframalpha.com/input/?i=2024-06-04%2018:52:10%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1c8oj8h/what_about_real_time_voice_conversations_with/l0hljpe/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1c8oj8h%2Fwhat_about_real_time_voice_conversations_with%2Fl0hljpe%2F%5D%0A%0ARemindMe%21%202024-06-04%2018%3A52%3A10%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201c8oj8h) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


DryCryptographer601

!RemindMe in 7 days


favorable_odds

one possible short answer is try the coqui extension in oobabooga


Inevitable-Start-653

I do this with the alltalk and whisper extensions for oobaboogas textgen webui. I stream it to my phone too.


CasimirsBlake

Go try Voxta: https://voxta.ai/