[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Emacspeak] TTS Server Implementation Questions

To: emacspeak@xxxxxxxxxxxxx
Subject: Re: [Emacspeak] TTS Server Implementation Questions
From: Devin Prater <r.d.t.prater@xxxxxxxxx>
Date: Tue, 9 Apr 2024 15:02:29 -0500
In-reply-to: <CABUWEtesZ6UXrOYY6CkjwuhGaJ7mwFxsevRkwgGS=O9SXxWbqw@mail.gmail.com>

Whenever I read books, with nov.el, info manuals, all that, I move bypage using SPC. It works well for me, and I've successfully read a bookwith it. Well, I technically used a game controller with a button mappedto SPC but I did it. <smile>

If I need to pause, I take note of the word I stopped at, hit C-e s,then C-s, type until I reach that line, then press RET.


On 4/9/2024 2:56 PM, Victor Tsaran (via emacspeak Mailing List) wrote:

THanks John.
Makes sense!

Incremental search usually saves me in such situations! For some reason,I just have not seen it as an issue. But thanks for an interesting take!

On Tue, Apr 9, 2024 at 12:48 PM Tim Cross <theophilusx@xxxxxxxxx<mailto:theophilusx@xxxxxxxxx>> wrote:



    I could be missing something, but as I see it, what voice indexing would
    provide is for the ability to have a 'voice cursor' (which may or may
    not be the same as your emacs cursor) tracking of location where the TTS
    engine is up to when generating speech from the submitted text.

    The would, for example, enable for the pausing and then subsequent
    resuming of speech whereby the resumed speech would start from where the
    speech was previously paused. In some systems, this is very important
    becasue the system only sends large chunks of speech at a time. For
    example, I've seen a simple TTS interface for reading files where it
    will just start reading the file. You odn't have the ability to ask for
    just a page, paragraph, sentence, line, word. You just ask for it to
    start speaking and then yuou can pause and resume speech. The other
    thing you may get is cursor tracking of speech. A cursor might move
    through the text as it is spoken so that when you pause speech, the
    cursor is at that point in yhour text. This can be useful for people who
    want to read  along with the speech i.e. the speech is an aid to visual
    reading.

    While I can see the potential benefits in having the ability to get and
    use speech index information, I've not found it very high on my wishlist
    for emacspeak. This is primarily because emacspeak provides very fined
    grained control over the size or chunks of speech I send at a
    time. Depending on what I'm doing, I'll read/browse the data using a
    movement/chank size which suits my need. For example, I'f I have a large
    buffer of text I want to read, I'm unlikely to ask emacspeak to just
    read the whole buffer. Instead, I'm more likely to as it to read by
    page, paragraph or perhaps sentence.

    With emacspeak, I find it is very much about moving around using the
    unit (letter, word, sentence, paragraph, page, buffer) best sutied for
    what I'm doing. I find this provides an adequate balance between my use
    case and complexity/consistency across speech servers. This has also
    enabled me to experiment with different TTS engines. For example, many
    years ago, I wrote speech servers for the Cepstral TTS engines. These
    were a commercial TTS engine that at the time, had high quality
    voices. The additional complexity and overheads involved in a TTS
    interface model which supported voice indexing would likely hav made
    this much harder to implement and discouraged the type of
    experimentation
    which is at the heart of emacspeak. Likewise, I wonder if we would have
    had the other TTS engines, some of which have come and gone, like the
    flite and festival servers or the server written in C, or the existing
    mac, swiftmac servers or the experimental windows, speech-dispatcher and
    JS servers that are out there currently in various stages of
    development.

    I personally don't see the amount of required effort justifyhing the
    benefits given we already have the capability to work with varying
    chunks of speech. Yes, it would provide some convenience, but at a high
    cost which I feel is hard to justify. However, provided someone can
    implement something which does not require changes to the existing
    servers or their design, I would say go for it. A lot can be learnt from
    implementing a TTS server. In fact, I've learnt a lot from failed
    attempts to implement TTS servers as there is a considerable amount of
    subtle and non-obvious aspects to a TTS server which only become clear
    when you try implementing one, making it a great learning experience. At
    least it was for me.


    Victor Tsaran <vtsaran@xxxxxxxxx <mailto:vtsaran@xxxxxxxxx>> writes:

     > I guess, the question stands: what user-facing problem are we
    trying to solve?
     >
     > On Tue, Apr 9, 2024 at 3:14 AM Parham Doustdar
    <emacspeak@xxxxxxxxxxxxx <mailto:emacspeak@xxxxxxxxxxxxx>> wrote:
     >
     >  That's true, Emacspeak doesn't currently "read" from the speech
    server process as far as I've seen, it only "writes" to it.
     >  Fixing that isn't impossible, but definitely time consuming.
     >  The other concrete issue is that last time I checked, console
    screen readers read all the text in one chunk. They don't use the
     >  audio CSS (forgive me if I don't use the correct name here) that
    Emacspeak has, which requires you to play audio icons,
     >  speak text with different pitch, and pauses. All of this means
    that you have to do extra heavy-lifting to really track the index,
     >  because the index you get back from the TTS engine isn't simply
    a position in the buffer -- it is just the position in the
     >  current chunk of text it has recently received.
     >  So that's why I'm curious if we really think it's worth it. It
    could be, or not, I'm not opinionated, but I'm also realizing that in
     >  our community, we don't really have a good mechanism to discuss
    and decide on things like this.
     >
     >  On Tue, Apr 9, 2024 at 8:35 AM Tim Cross <theophilusx@xxxxxxxxx
    <mailto:theophilusx@xxxxxxxxx>> wrote:
     >
     >  You are overlooking one critical component which explains why adding
     >  indxing support is a non-trivial exercise which would require a
    complete
     >  redesign of the existing TTS interface model.
     >
     >  For indexing information to be of any use, it has to be fed back
    into the
     >  client and used by the client. For example, tell the client to
     >  update/move the cursor to the last position spoken.
     >
     >  There is absolutely no support for this data to be fed back into the
     >  current system. The current TTS interface has data flowing in
    only one
     >  direction, from emacs to emacpseak and from emacspeak to the TTS
    server
     >  and form the tts server to the tts synthesizer. There is no existing
     >  mechanism to feed information (i.e. index positions) back from
    the TTS
     >  engine to emacs. While getting this information from the TTS
    engine into
     >  the TTS server is probably reasonably easy, there is no existing
    channel
     >  to feed that information up into Emacspeak.
     >
     >  Not only would it be necessary to define and implement a whole
    new model
     >  to incorporate this feedback, in addition to also working with TTS
     >  engines which do not provide indexing information, you would
    also likely
     >  need to implement some sort of multi speech cursor tracking so
    that the
     >  system can track cursor positions in different buffers.
     >
     >  The reason this sort of functionality seems easy in systems like
    speakup
     >  or speech-dispatcher is because those systems were designed with
    this
     >  functionality. It is incprporated into the base design and part
    of the
     >  various communication protocols the design implement. Adding this
     >  functionality is not something which can just be 'tacked on'.
     >
     >  The good news of course is that being open source, anyone can go
    ahead
     >  and define a new interface model and add indexing capability.
    However,
     >  it may be worth considering that it has taken 30 years of
    development to
     >  get the current model to where it is at, so I think you can expect a
     >  pretty steep climb initially!
     >
     >  John Covici <covici@xxxxxxxxxxxxxx
    <mailto:covici@xxxxxxxxxxxxxx>> writes:
     >
     >  > Its a lot simpler -- indexing is supposed to simply arrange
    things so
     >  > that when reading a buffer, and you stop reading, the cursor
    will be
     >  > at or near the point where you stopped.  Speakup has had this
    for a
     >  > long time and that is why I use it on Linux.  But its only
    good for
     >  > the virtual console.  Now speech dispatcher has indexinng
    built in, so
     >  > if you connect to that and use one of the supported synthesizers,

> > indexing works correctly and I don't see any performance hit.I think

     >  > all the client has to do is connect to speech dispatcher, but
    check me
     >  > on this.
     >  >
     >  > On Mon, 08 Apr 2024 08:25:27 -0400,
     >  > Robert Melton wrote:
     >  >>
     >  >> Is indexing supposed to be like per reading block, or like
    one global?  Is the idea
     >  >> that you can be reading a buffer, go to another buffer, read
    some of it, then come
     >  >> back and continue? IE: Index per "reading block"?
     >  >>
     >  >> Assuming it is global for simplicity, it is still a heavy
    lift for implementation on
     >  >> Mac and Windows.
     >  >>
     >  >> As they do not natively report back as words are spoken, now
     >  >> you can get this behavior at an "Utterance" level, by
    installing hooks and callbacks
     >  >> and tracking those. With that you would need to additionally
    keep copies of the future
     >  >> utterances, even if they already where queued with the TTS.
     >  >>
     >  >> Considered from the POV of index per reading block, then you
    need to find ways to ident
     >  >> each one and its position and index them and continue reading.
     >  >>
     >  >> Sounds neat, but at least for my servers, right now, the
    juice isn't worth the sqeeze, I
     >  >> am still trying to get basic stuff like pitch multipliers
    working on windows via wave
     >  >> mangling and other basic features, hehe.
     >  >>
     >  >> > On Apr 8, 2024, at 05:20, Parham Doustdar
    <parham90@xxxxxxxxx <mailto:parham90@xxxxxxxxx>> wrote:
     >  >> >
     >  >> > I understand. My question isn't whether it's possible
    though, or how difficult it
     >  >> > would be, or the steps we'd have to take to implement it.
     >  >> > My question is more about whether the use cases we have
    today make it worth it to
     >  >> > reconsider. All other questions we can apply the wisdom of
    the community to solve, if
     >  >> > we were convinced that the effort would be worth it.
     >  >> > For me, the way I've got around this is to use the
    next/previous paragraph
     >  >> > commands. The chunks are good small enough that I can "zoom
    in" if I want, and yet
     >  >> > large enough that I don't have to constantly hit next-line.
     >  >> > Sent from my iPhone
     >  >> >
     >  >> >> On 8 Apr 2024, at 11:13, Tim Cross <theophilusx@xxxxxxxxx
    <mailto:theophilusx@xxxxxxxxx>> wrote:
     >  >> >>
     >  >> >> 
     >  >> >> This is extremely unlikely to be implemented. It is
    non-trivial and
     >  >> >> would require a significant re-design of the whole
    interface and model
     >  >> >> of operation. It isn't as simple as just getting index
    information from
     >  >> >> the TTS servers which support it. That information has to
    then be fed
     >  >> >> backwards to Emacs through some mechanism which currently
    does not
     >  >> >> exist and would result in a far more complicated
    interface/model.
     >  >> >>
     >  >> >> As Raman said, the decision not to have this was not
    simply an oversight
     >  >> >> or due to lack of time. It was a conscious design
    decision. What your
     >  >> >> asking for isn't simply an enhancement, it is a complete
    redesign of the
     >  >> >> TTS interface model.
     >  >> >>
     >  >> >> "Parham Doustdar" (via emacspeak Mailing List)
    <emacspeak@xxxxxxxxxxxxx <mailto:emacspeak@xxxxxxxxxxxxx>> writes:
     >  >> >>
     >  >> >>> I agree. I'm not sure which TTS engines support it.
    Maybe, just like notification streams
     >  >> >>> are supported in some servers, we can implement this
    feature for engines that support it?
     >  >> >>> Sent from my iPhone
     >  >> >>>
     >  >> >>>>> On 8 Apr 2024, at 10:24, John Covici
    <emacspeak@xxxxxxxxxxxxx <mailto:emacspeak@xxxxxxxxxxxxx>> wrote:
     >  >> >>>>
     >  >> >>>> I know this might be contraversial, but, indexing would
    be very useful
     >  >> >>>> to me,  sometimes I read long buffers and when I stop
    the reading, the
     >  >> >>>> cursor is still where I started, so no real  way to do
    this adequately
     >  >> >>>> -- I would not mind if it were just down to the line,
    rather than
     >  >> >>>> individual words, but it would make emacspeak lots nicer
    for me.
     >  >> >>>>
     >  >> >>>>> On Fri, 05 Apr 2024 15:39:15 -0400,
     >  >> >>>>> "T.V Raman" (via emacspeak Mailing List) wrote:
     >  >> >>>>>
     >  >> >>>>> [1  <text/plain; us-ascii (7bit)>]
     >  >> >>>>> as a single call is that it ensures  atomicity i.e. all
    of the state
     >  >> >>>>> gets set at one shot from the perspective of the elisp
    layer, so you
     >  >> >>>>> hopefully never get TTS that has its state  partially set.
     >  >> >>>>> note that the other primary benefit of tts_sync_state
     >  >> >>>>>
     >  >> >>>>> Robert Melton writes:
     >  >> >>>>>> On threading. It is all concurrent, lots of fun
    protecting of the state.
     >  >> >>>>>>
     >  >> >>>>>> On language and voice, I was thinking of them as a
    tree, language/voice,
     >  >> >>>>>> as this is how Windows and MacOS seem to provide them.
     >  >> >>>>>>
     >  >> >>>>>> ----
     >  >> >>>>>>
     >  >> >>>>>> Oh, one last thing. Should TTS Server implementations
    be returning a \n
     >  >> >>>>>> after command is complete, or is just returning
    nothing acceptable?
     >  >> >>>>>>
     >  >> >>>>>>
     >  >> >>>>>>> On Apr 5, 2024, at 14:01, T.V Raman <raman@xxxxxxxxxx
    <mailto:raman@xxxxxxxxxx>> wrote:
     >  >> >>>>>>>
     >  >> >>>>>>> And do spend some time thinking of atomicity and
    multithreaded systems,
     >  >> >>>>>>> e.g. ask yourself the question "how many threads of
    execution are active
     >  >> >>>>>>> at any given time"; Hint: the answer isn't as simple
    as "just one
     >  >> >>>>>>> because my server doesn't use threads". > Raman--
     >  >> >>>>>>>>
     >  >> >>>>>>>> Thanks so much, that clarifies a bunch. A few
    questions on the
     >  >> >>>>>>>> language / voice support.
     >  >> >>>>>>>>
     >  >> >>>>>>>> Does the TTS server maintain an internal list and
    switch through
     >  >> >>>>>>>> it or does it send the list the lisp in a way I have
    missed?
     >  >> >>>>>>>>
     >  >> >>>>>>>> Would it be useful to have a similar feature for
    voices, being
     >  >> >>>>>>>> first you pick right language, then you pick
    preferred voice
     >  >> >>>>>>>> then maybe it is stored in a defcustom and sent next
    time as
     >  >> >>>>>>>> (set_lang lang:voice t)
     >  >> >>>>>>>>
     >  >> >>>>>>>>
     >  >> >>>>>>>>> On Apr 5, 2024, at 13:10, T.V Raman
    <raman@xxxxxxxxxx <mailto:raman@xxxxxxxxxx>> wrote:
     >  >> >>>>>>>>>
     >  >> >>>>>>>>> If your TTS supports more than one language, the
    TTS API exposes these
     >  >> >>>>>>>>> as a list; these calls loop through the list
    (dectalk,espeak, outloud)
     >  >> >>>>>>>>
     >  >> >>>>>>>> --
     >  >> >>>>>>>> Robert "robertmeta" Melton
     >  >> >>>>>>>> lists@xxxxxxxxxxxxxxxx <mailto:lists@xxxxxxxxxxxxxxxx>
     >  >> >>>>>>>>
     >  >> >>>>>>>
     >  >> >>>>>>
     >  >> >>>>>> --
     >  >> >>>>>> Robert "robertmeta" Melton
     >  >> >>>>>> lists@xxxxxxxxxxxxxxxx <mailto:lists@xxxxxxxxxxxxxxxx>
     >  >> >>>>>
     >  >> >>>>> --
     >  >> >>>>> [2  <text/plain; UTF-8 (8bit)>]
     >  >> >>>>> Emacspeak discussion list -- emacspeak@xxxxxxxxxxxxx
    <mailto:emacspeak@xxxxxxxxxxxxx>
     >  >> >>>>> To unsubscribe send email to:
     >  >> >>>>> emacspeak-request@xxxxxxxxxxxxx
    <mailto:emacspeak-request@xxxxxxxxxxxxx> with a subject of: unsubscribe
     >  >> >>>>
     >  >> >>>> --

> >> >>>> Your life is like a penny. You're going to lose it.The question is:

     >  >> >>>> How do
     >  >> >>>> you spend it?
     >  >> >>>>
     >  >> >>>>       John Covici wb2una
     >  >> >>>> covici@xxxxxxxxxxxxxx <mailto:covici@xxxxxxxxxxxxxx>
     >  >> >>>> Emacspeak discussion list -- emacspeak@xxxxxxxxxxxxx
    <mailto:emacspeak@xxxxxxxxxxxxx>
     >  >> >>>> To unsubscribe send email to:
     >  >> >>>> emacspeak-request@xxxxxxxxxxxxx
    <mailto:emacspeak-request@xxxxxxxxxxxxx> with a subject of: unsubscribe
     >  >> >>>
     >  >> >>> Emacspeak discussion list -- emacspeak@xxxxxxxxxxxxx
    <mailto:emacspeak@xxxxxxxxxxxxx>
     >  >> >>> To unsubscribe send email to:
     >  >> >>> emacspeak-request@xxxxxxxxxxxxx
    <mailto:emacspeak-request@xxxxxxxxxxxxx> with a subject of: unsubscribe
     >  >>
     >  >> --
     >  >> Robert "robertmeta" Melton
     >  >> lists@xxxxxxxxxxxxxxxx <mailto:lists@xxxxxxxxxxxxxxxx>
     >  >>
     >  >>
     >
     >  Emacspeak discussion list -- emacspeak@xxxxxxxxxxxxx
    <mailto:emacspeak@xxxxxxxxxxxxx>
     >  To unsubscribe send email to:
     > emacspeak-request@xxxxxxxxxxxxx
    <mailto:emacspeak-request@xxxxxxxxxxxxx> with a subject of: unsubscribe



--

--- --- --- ---
Find my music on

Youtube: http://www.youtube.com/c/victortsaran<http://www.youtube.com/vtsaran>Spotify: https://open.spotify.com/artist/605ZF2JPei9KqgbXBqYA16<https://open.spotify.com/artist/605ZF2JPei9KqgbXBqYA16>Band Camp: http://victortsaran.bandcamp.com<http://victortsaran.bandcamp.com>



Emacspeak discussion list -- emacspeak@xxxxxxxxxxxxx
To unsubscribe send email to:
emacspeak-request@xxxxxxxxxxxxx with a subject of: unsubscribe


--
Devin Prater

References:
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: Tim Cross
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: Parham Doustdar
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: Robert Melton
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: John Covici
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: Tim Cross
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: Parham Doustdar
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: Victor Tsaran
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: Tim Cross
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: Victor Tsaran

Prev by Date: Re: [Emacspeak] TTS Server Implementation Questions
Next by Date: Re: [Emacspeak] TTS Server Implementation Questions
Previous by thread: Re: [Emacspeak] TTS Server Implementation Questions
Next by thread: Re: [Emacspeak] TTS Server Implementation Questions
Index(es):
- Date
- Thread

|Full archive May 1995 - present by Year|Search the archive|

If you have questions about this archive or had problems using it, please contact us.

Contact Info Page