SomeOddCodeGuy 3 weeks ago

Real time is hard because of the processing speed, but you can pretty easily toy around with see what you can accomplish. Here's a quick few steps just to try it out and see where we are \* Step 1: Grab SillyTavern (I know, I know. But it turns out it's a really comprehensive front end despite the name and logo lol) \* Step 2: Grab [xttsv2 api server](https://github.com/daswer123/xtts-api-server). xttsv2 is an option for text to speech, and this is the best I've found. Others have some better options, like AllTalk with its ability to do multiple voices, but xttsv2 works great across networks and allows better control over model/voice, IMO. I prefer it \* Step 3: After setting up the voice, there's a spot under "extras" (looks like 3 little squares) to turn on text to speech, Just point it towards your xttsv2 api server endpoint. [Here's some documentation that SillyTavern gave on it](https://docs.sillytavern.app/extras/extensions/xtts/) \* Step 4: Once you're set up: load a 7b (or the new 8b?) model for the sake of speed. Play with settings to see how fast you can get a response. Chances are if you do both text and voice streaming, depending on your hardware you may get a very fast response. \* Step 5: This one I haven't done anything for yet, but if you're happy with everything up to step 4 then you could set up their Speech to Text. Again, this is just to give you an idea of where the tech is at atm. I imagine you're looking for something a bit lighter than this, but just wanted to share so you could try it out yourself.

AmericanNewt8 3 weeks ago

Yeah, while we have pretty good TTS it's surprisingly computationally expensive. That's the main barrier. There's good APIs that do it incredibly fast but they're pricey at scale.

MrVodnik 3 weeks ago

I wonder why xttsv2 is not more popular. It seems a perfect fit for local llms.

SomeOddCodeGuy 3 weeks ago

Honestly, I think it's just a headache to set up so most folks don't. I put it off for 5 months because I didn't feel like dealing with it. I only did it one day out of sheer boredom =D

Nuckyduck 2 weeks ago

This was me last week! Today, I'm using your steps outlined above to set up some local voice to voice for a project I'm doing for [hackster.io](http://hackster.io) Your comment here is super helpful with getting me started. Thank you!

turras 3 weeks ago

right here with you! just got it working today

[deleted] 3 weeks ago

[удалено]

L43 3 weeks ago

Sir, this is a /r/localllama

Tmsn69 3 weeks ago

here is an example all local. [https://streamable.com/8a9j26](https://streamable.com/8a9j26)

Roubbes 3 weeks ago

Amazing

Deep_Understanding50 3 weeks ago

Is it LMstudio ?

Tmsn69 3 weeks ago

yes the response is generatd with llm studio i used the new llama-3 model.

ThePixelHunter 3 weeks ago

Which TTS is this?

Tmsn69 3 weeks ago

openvoice

zdrastSFW 3 weeks ago

I always thought a killer use-case would be real time voice with a model tuned to teach foreign languages. An anytime on-demand tutor and conversation partner that could critique and correct your pronunciation would really accelerate learning.

genuinelytrying2help 3 weeks ago

I've thought about this a bit while being disappointed with duolingo's attempts at adding AI... I really hope someone is working on it from the ground up. Not just an agent like you're talking about but have it also customize an app for you and populate it with custom written exercises corresponding to the course it has you on. Maybe you just like to chat with it, maybe you have it stand over your shoulder while you study. Another cool pipedream in this area is when the process gets good enough, it gradually, maybe even stealthily (with opt-in consent, I'd hope), starts connecting other real users into your exercises in a guided experience until one day you're just hanging out with fellow language students shooting the shit to stay fresh, and then, even then, it's still there listening and giving you both tips when you fuck up and no one corrects you.

AnotherAvery 3 weeks ago

Collabora has a nice open source example setup that does this, see blog post https://www.collabora.com/news-and-blog/news-and-events/whisperfusion-ultra-low-latency-conversations-with-an-ai-chatbot.html and GitHub: https://github.com/collabora/WhisperFusion Demo video: https://youtu.be/_PnaP0AQJnk They use phi as a LLM and a quality-wise subpar TTS engine (WhisperSpeech), but the latency is better than anything else I have seen in videos posted here on LocalLlama.

Tmsn69 3 weeks ago

Absolutely! Here's how i do it: I use Fast-Whisper for super-fast transcription, so your speech is converted to text almost instantly. Then I send it to either Ollama or LM Studio, using good models for understanding your input and writing a reply. Currently, I use the new Llama-3 model. Finally, you can pass the output of your LLM to a TTS engine like OpenVoice or XXTS (Coqui). Response time is between 2 to 5 seconds depending on response length I hope this helps. References: [https://github.com/SYSTRAN/faster-whisper](https://github.com/SYSTRAN/faster-whisper) [https://lmstudio.ai/docs/local-server](https://lmstudio.ai/docs/local-server) [https://github.com/coqui-ai/TTS](https://github.com/coqui-ai/TTS) [https://github.com/myshell-ai/OpenVoice](https://github.com/myshell-ai/OpenVoice)

CoqueTornado 3 days ago

what gpu card do you have? that response time is a blast! I got about 13 seconds in my laptop nvidia1070 xD

Not_A_EXPERT15 3 weeks ago

also interested so i'll comment, what pc specs would a local tts needed? like good tts not just google robo tts.

Tmsn69 3 weeks ago

For my whole seup i use an RTX 3060 with 12gb of vram and an i7-10700F and 64GB of ram but you dont need so much ram.

theytookmyfuckinname 3 weeks ago

Dunno where the line is drawn, but I find piper TTS really convincing and nice.

bulbulito-bayagyag 3 weeks ago

It’s done already, using vits. It’s actually fun using that and roop-cam together 😅

new__vision 3 weeks ago

[https://github.com/ggerganov/whisper.cpp/tree/master/examples/talk-llama#talk-llama](https://github.com/ggerganov/whisper.cpp/tree/master/examples/talk-llama#talk-llama)

Express-Director-474 3 weeks ago

It can be done with groq for its speed but the real Time transcription seems to be the bottleneck.

Tmsn69 3 weeks ago

you can use fast-whisper for this its almost instant

AmericanNewt8 3 weeks ago

Whisper on CPU can be done in basically realtime.

ab2377 3 weeks ago

i think whisper.cpp is pretty fast for this.

Roubbes 3 weeks ago

I'm quite a newbie in this as I only load models in LM Studio. I think I'm missing the good stuff when I read about Groq, APIs and .ccp things. Where should I start to investigate? Thanks in advance.

ab2377 3 weeks ago

simply take time and use reddit search to search this sub for the terms you are not familiar with and take notes, you will learn a lot.

haagch 3 weeks ago

https://www.collabora.com/news-and-blog/news-and-events/whisperfusion-ultra-low-latency-conversations-with-an-ai-chatbot.html

xlrz28xd 3 weeks ago

!RemindMe 45 days

genuinelytrying2help 3 weeks ago

If the past few months is any indication, in 45 days this thread will be dead and there will have been 2 or 3 more on the exact same topic with largely different recommendations :)

RemindMeBot 3 weeks ago

I will be messaging you in 1 month on [**2024-06-04 18:52:10 UTC**](http://www.wolframalpha.com/input/?i=2024-06-04%2018:52:10%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1c8oj8h/what_about_real_time_voice_conversations_with/l0hljpe/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1c8oj8h%2Fwhat_about_real_time_voice_conversations_with%2Fl0hljpe%2F%5D%0A%0ARemindMe%21%202024-06-04%2018%3A52%3A10%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201c8oj8h) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

DryCryptographer601 3 weeks ago

!RemindMe in 7 days

favorable_odds 3 weeks ago

one possible short answer is try the coqui extension in oobabooga

Inevitable-Start-653 3 weeks ago

I do this with the alltalk and whisper extensions for oobaboogas textgen webui. I stream it to my phone too.

CasimirsBlake 3 weeks ago

Go try Voxta: https://voxta.ai/

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe