emacspeak - Re: [Emacspeak] TTS Server Implementation Questions

Subject: Emacspeak discussion list

List archive

Re: [Emacspeak] TTS Server Implementation Questions

From: Rob Hill <robmichill AT gmail.com>
To: Tim Cross <theophilusx AT gmail.com>
Cc: Victor Tsaran <vtsaran AT gmail.com>, Parham Doustdar <parham90 AT gmail.com>, John Covici <covici AT ccs.covici.com>, Robert Melton <lists AT robertmelton.com>, Emacspeaks <emacspeak AT emacspeak.net>, "T.V Raman" <raman AT google.com>
Subject: Re: [Emacspeak] TTS Server Implementation Questions
Date: Wed, 10 Apr 2024 09:52:37 +1200

Speaking as someone who has relied on emacspeak for many and varied
tasks, both work and pleasure, though isn't from the programming
world, I have never missed indexing. The ability to easily navigate
by chunks does the job, and also prevents me from falling asleep if
reading late at night. It's just a different way of doing the job.

Rob

"Tim Cross" (via emacspeak Mailing List) writes:
>
> I could be missing something, but as I see it, what voice indexing would
> provide is for the ability to have a 'voice cursor' (which may or may
> not be the same as your emacs cursor) tracking of location where the TTS
> engine is up to when generating speech from the submitted text.
>
> The would, for example, enable for the pausing and then subsequent
> resuming of speech whereby the resumed speech would start from where the
> speech was previously paused. In some systems, this is very important
> becasue the system only sends large chunks of speech at a time. For
> example, I've seen a simple TTS interface for reading files where it
> will just start reading the file. You odn't have the ability to ask for
> just a page, paragraph, sentence, line, word. You just ask for it to
> start speaking and then yuou can pause and resume speech. The other
> thing you may get is cursor tracking of speech. A cursor might move
> through the text as it is spoken so that when you pause speech, the
> cursor is at that point in yhour text. This can be useful for people who
> want to read along with the speech i.e. the speech is an aid to visual
> reading.
>
> While I can see the potential benefits in having the ability to get and
> use speech index information, I've not found it very high on my wishlist
> for emacspeak. This is primarily because emacspeak provides very fined
> grained control over the size or chunks of speech I send at a
> time. Depending on what I'm doing, I'll read/browse the data using a
> movement/chank size which suits my need. For example, I'f I have a large
> buffer of text I want to read, I'm unlikely to ask emacspeak to just
> read the whole buffer. Instead, I'm more likely to as it to read by
> page, paragraph or perhaps sentence.
>
> With emacspeak, I find it is very much about moving around using the
> unit (letter, word, sentence, paragraph, page, buffer) best sutied for
> what I'm doing. I find this provides an adequate balance between my use
> case and complexity/consistency across speech servers. This has also
> enabled me to experiment with different TTS engines. For example, many
> years ago, I wrote speech servers for the Cepstral TTS engines. These
> were a commercial TTS engine that at the time, had high quality
> voices. The additional complexity and overheads involved in a TTS
> interface model which supported voice indexing would likely hav made
> this much harder to implement and discouraged the type of experimentation
> which is at the heart of emacspeak. Likewise, I wonder if we would have
> had the other TTS engines, some of which have come and gone, like the
> flite and festival servers or the server written in C, or the existing
> mac, swiftmac servers or the experimental windows, speech-dispatcher and
> JS servers that are out there currently in various stages of
> development.
>
> I personally don't see the amount of required effort justifyhing the
> benefits given we already have the capability to work with varying
> chunks of speech. Yes, it would provide some convenience, but at a high
> cost which I feel is hard to justify. However, provided someone can
> implement something which does not require changes to the existing
> servers or their design, I would say go for it. A lot can be learnt from
> implementing a TTS server. In fact, I've learnt a lot from failed
> attempts to implement TTS servers as there is a considerable amount of
> subtle and non-obvious aspects to a TTS server which only become clear
> when you try implementing one, making it a great learning experience. At
> least it was for me.
>
>
> Victor Tsaran <vtsaran AT gmail.com> writes:
>
> > I guess, the question stands: what user-facing problem are we trying to
> > solve?
> >
> > On Tue, Apr 9, 2024 at 3:14 AM Parham Doustdar <emacspeak AT emacspeak.net>
> > wrote:
> >
> > That's true, Emacspeak doesn't currently "read" from the speech server
> > process as far as I've seen, it only "writes" to it.
> > Fixing that isn't impossible, but definitely time consuming.
> > The other concrete issue is that last time I checked, console screen
> > readers read all the text in one chunk. They don't use the
> > audio CSS (forgive me if I don't use the correct name here) that
> > Emacspeak has, which requires you to play audio icons,
> > speak text with different pitch, and pauses. All of this means that you
> > have to do extra heavy-lifting to really track the index,
> > because the index you get back from the TTS engine isn't simply a
> > position in the buffer -- it is just the position in the
> > current chunk of text it has recently received.
> > So that's why I'm curious if we really think it's worth it. It could
> > be, or not, I'm not opinionated, but I'm also realizing that in
> > our community, we don't really have a good mechanism to discuss and
> > decide on things like this.
> >
> > On Tue, Apr 9, 2024 at 8:35 AM Tim Cross <theophilusx AT gmail.com> wrote:
> >
> > You are overlooking one critical component which explains why adding
> > indxing support is a non-trivial exercise which would require a complete
> > redesign of the existing TTS interface model.
> >
> > For indexing information to be of any use, it has to be fed back into
> > the
> > client and used by the client. For example, tell the client to
> > update/move the cursor to the last position spoken.
> >
> > There is absolutely no support for this data to be fed back into the
> > current system. The current TTS interface has data flowing in only one
> > direction, from emacs to emacpseak and from emacspeak to the TTS server
> > and form the tts server to the tts synthesizer. There is no existing
> > mechanism to feed information (i.e. index positions) back from the TTS
> > engine to emacs. While getting this information from the TTS engine into
> > the TTS server is probably reasonably easy, there is no existing channel
> > to feed that information up into Emacspeak.
> >
> > Not only would it be necessary to define and implement a whole new model
> > to incorporate this feedback, in addition to also working with TTS
> > engines which do not provide indexing information, you would also likely
> > need to implement some sort of multi speech cursor tracking so that the
> > system can track cursor positions in different buffers.
> >
> > The reason this sort of functionality seems easy in systems like speakup
> > or speech-dispatcher is because those systems were designed with this
> > functionality. It is incprporated into the base design and part of the
> > various communication protocols the design implement. Adding this
> > functionality is not something which can just be 'tacked on'.
> >
> > The good news of course is that being open source, anyone can go ahead
> > and define a new interface model and add indexing capability. However,
> > it may be worth considering that it has taken 30 years of development to
> > get the current model to where it is at, so I think you can expect a
> > pretty steep climb initially!
> >
> > John Covici <covici AT ccs.covici.com> writes:
> >
> > > Its a lot simpler -- indexing is supposed to simply arrange things so
> > > that when reading a buffer, and you stop reading, the cursor will be
> > > at or near the point where you stopped. Speakup has had this for a
> > > long time and that is why I use it on Linux. But its only good for
> > > the virtual console. Now speech dispatcher has indexinng built in, so
> > > if you connect to that and use one of the supported synthesizers,
> > > indexing works correctly and I don't see any performance hit. I think
> > > all the client has to do is connect to speech dispatcher, but check me
> > > on this.
> > >
> > > On Mon, 08 Apr 2024 08:25:27 -0400,
> > > Robert Melton wrote:
> > >>
> > >> Is indexing supposed to be like per reading block, or like one
> > global? Is the idea
> > >> that you can be reading a buffer, go to another buffer, read some of
> > it, then come
> > >> back and continue? IE: Index per "reading block"?
> > >>
> > >> Assuming it is global for simplicity, it is still a heavy lift for
> > implementation on
> > >> Mac and Windows.
> > >>
> > >> As they do not natively report back as words are spoken, now
> > >> you can get this behavior at an "Utterance" level, by installing
> > hooks and callbacks
> > >> and tracking those. With that you would need to additionally keep
> > copies of the future
> > >> utterances, even if they already where queued with the TTS.
> > >>
> > >> Considered from the POV of index per reading block, then you need to
> > find ways to ident
> > >> each one and its position and index them and continue reading.
> > >>
> > >> Sounds neat, but at least for my servers, right now, the juice isn't
> > worth the sqeeze, I
> > >> am still trying to get basic stuff like pitch multipliers working on
> > windows via wave
> > >> mangling and other basic features, hehe.
> > >>
> > >> > On Apr 8, 2024, at 05:20, Parham Doustdar <parham90 AT gmail.com>
> > wrote:
> > >> >
> > >> > I understand. My question isn't whether it's possible though, or
> > how difficult it
> > >> > would be, or the steps we'd have to take to implement it.
> > >> > My question is more about whether the use cases we have today make
> > it worth it to
> > >> > reconsider. All other questions we can apply the wisdom of the
> > community to solve, if
> > >> > we were convinced that the effort would be worth it.
> > >> > For me, the way I've got around this is to use the next/previous
> > paragraph
> > >> > commands. The chunks are good small enough that I can "zoom in" if
> > I want, and yet
> > >> > large enough that I don't have to constantly hit next-line.
> > >> > Sent from my iPhone
> > >> >
> > >> >> On 8 Apr 2024, at 11:13, Tim Cross <theophilusx AT gmail.com> wrote:
> > >> >>
> > >> >> 
> > >> >> This is extremely unlikely to be implemented. It is non-trivial
> > and
> > >> >> would require a significant re-design of the whole interface and
> > model
> > >> >> of operation. It isn't as simple as just getting index
> > information from
> > >> >> the TTS servers which support it. That information has to then be
> > fed
> > >> >> backwards to Emacs through some mechanism which currently does not
> > >> >> exist and would result in a far more complicated interface/model.
> > >> >>
> > >> >> As Raman said, the decision not to have this was not simply an
> > oversight
> > >> >> or due to lack of time. It was a conscious design decision. What
> > your
> > >> >> asking for isn't simply an enhancement, it is a complete redesign
> > of the
> > >> >> TTS interface model.
> > >> >>
> > >> >> "Parham Doustdar" (via emacspeak Mailing List)
> > <emacspeak AT emacspeak.net> writes:
> > >> >>
> > >> >>> I agree. I'm not sure which TTS engines support it. Maybe, just
> > like notification streams
> > >> >>> are supported in some servers, we can implement this feature for
> > engines that support it?
> > >> >>> Sent from my iPhone
> > >> >>>
> > >> >>>>> On 8 Apr 2024, at 10:24, John Covici <emacspeak AT emacspeak.net>
> > wrote:
> > >> >>>>
> > >> >>>> I know this might be contraversial, but, indexing would be
> > very useful
> > >> >>>> to me, sometimes I read long buffers and when I stop the
> > reading, the
> > >> >>>> cursor is still where I started, so no real way to do this
> > adequately
> > >> >>>> -- I would not mind if it were just down to the line, rather
> > than
> > >> >>>> individual words, but it would make emacspeak lots nicer for me.
> > >> >>>>
> > >> >>>>> On Fri, 05 Apr 2024 15:39:15 -0400,
> > >> >>>>> "T.V Raman" (via emacspeak Mailing List) wrote:
> > >> >>>>>
> > >> >>>>> [1 <text/plain; us-ascii (7bit)>]
> > >> >>>>> as a single call is that it ensures atomicity i.e. all of the
> > state
> > >> >>>>> gets set at one shot from the perspective of the elisp layer,
> > so you
> > >> >>>>> hopefully never get TTS that has its state partially set.
> > >> >>>>> note that the other primary benefit of tts_sync_state
> > >> >>>>>
> > >> >>>>> Robert Melton writes:
> > >> >>>>>> On threading. It is all concurrent, lots of fun protecting of
> > the state.
> > >> >>>>>>
> > >> >>>>>> On language and voice, I was thinking of them as a tree,
> > language/voice,
> > >> >>>>>> as this is how Windows and MacOS seem to provide them.
> > >> >>>>>>
> > >> >>>>>> ----
> > >> >>>>>>
> > >> >>>>>> Oh, one last thing. Should TTS Server implementations be
> > returning a \n
> > >> >>>>>> after command is complete, or is just returning nothing
> > acceptable?
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>>> On Apr 5, 2024, at 14:01, T.V Raman <raman AT google.com> wrote:
> > >> >>>>>>>
> > >> >>>>>>> And do spend some time thinking of atomicity and
> > multithreaded systems,
> > >> >>>>>>> e.g. ask yourself the question "how many threads of
> > execution are active
> > >> >>>>>>> at any given time"; Hint: the answer isn't as simple as
> > "just one
> > >> >>>>>>> because my server doesn't use threads". > Raman--
> > >> >>>>>>>>
> > >> >>>>>>>> Thanks so much, that clarifies a bunch. A few questions on
> > the
> > >> >>>>>>>> language / voice support.
> > >> >>>>>>>>
> > >> >>>>>>>> Does the TTS server maintain an internal list and switch
> > through
> > >> >>>>>>>> it or does it send the list the lisp in a way I have missed?
> > >> >>>>>>>>
> > >> >>>>>>>> Would it be useful to have a similar feature for voices,
> > being
> > >> >>>>>>>> first you pick right language, then you pick preferred voice
> > >> >>>>>>>> then maybe it is stored in a defcustom and sent next time as
> > >> >>>>>>>> (set_lang lang:voice t)
> > >> >>>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>>>> On Apr 5, 2024, at 13:10, T.V Raman <raman AT google.com>
> > wrote:
> > >> >>>>>>>>>
> > >> >>>>>>>>> If your TTS supports more than one language, the TTS API
> > exposes these
> > >> >>>>>>>>> as a list; these calls loop through the list
> > (dectalk,espeak, outloud)
> > >> >>>>>>>>
> > >> >>>>>>>> --
> > >> >>>>>>>> Robert "robertmeta" Melton
> > >> >>>>>>>> lists AT robertmelton.com
> > >> >>>>>>>>
> > >> >>>>>>>
> > >> >>>>>>
> > >> >>>>>> --
> > >> >>>>>> Robert "robertmeta" Melton
> > >> >>>>>> lists AT robertmelton.com
> > >> >>>>>
> > >> >>>>> --
> > >> >>>>> [2 <text/plain; UTF-8 (8bit)>]
> > >> >>>>> Emacspeak discussion list -- emacspeak AT emacspeak.net
> > >> >>>>> To unsubscribe send email to:
> > >> >>>>> emacspeak-request AT emacspeak.net with a subject of: unsubscribe
> > >> >>>>
> > >> >>>> --
> > >> >>>> Your life is like a penny. You're going to lose it. The
> > question is:
> > >> >>>> How do
> > >> >>>> you spend it?
> > >> >>>>
> > >> >>>> John Covici wb2una
> > >> >>>> covici AT ccs.covici.com
> > >> >>>> Emacspeak discussion list -- emacspeak AT emacspeak.net
> > >> >>>> To unsubscribe send email to:
> > >> >>>> emacspeak-request AT emacspeak.net with a subject of: unsubscribe
> > >> >>>
> > >> >>> Emacspeak discussion list -- emacspeak AT emacspeak.net
> > >> >>> To unsubscribe send email to:
> > >> >>> emacspeak-request AT emacspeak.net with a subject of: unsubscribe
> > >>
> > >> --
> > >> Robert "robertmeta" Melton
> > >> lists AT robertmelton.com
> > >>
> > >>
> >
> > Emacspeak discussion list -- emacspeak AT emacspeak.net
> > To unsubscribe send email to:
> > emacspeak-request AT emacspeak.net with a subject of: unsubscribe
> Emacspeak discussion list -- emacspeak AT emacspeak.net
> To unsubscribe send email to:
> emacspeak-request AT emacspeak.net with a subject of: unsubscribe

--

Re: [Emacspeak] TTS Server Implementation Questions, (continued)

List archive

Re: [Emacspeak] TTS Server Implementation Questions