How can we improve Alexa Skill development services and tools?

Provide an endpoint to test the voice model

Currently, there is no way to accurately do automated testing of how user audio resolves (in terms of intents and slots) for a given skill.

At present, this testing is more-or-less done manually. There are several use cases where automation would be vitally important. Given:

-When testing a large skill, it is extremely cumbersome to manually test a wide array of sample utterances for every change we make to the voice model. This discourages iterative design.
-The overall Alexa voice model is constantly shifting. This is a good thing, but often this results in changes to how our skills launch utterances, one-shots, etc are resolved, and right now we have no way to do this testing on a regular, automated cadence.
-Testing a skill across a variety of voice types (varying gender, age, accent) is not something that we see done today, because the effort to manage something like that is considerable.

And, on a point unrelated to test automation, it would also be beneficial if we could have our users who are struggling with misfiring intents send us an audio clip of their voice and let us send that to this endpoint to see what's happening. Trying to reproduce a user's accent/pitch/prosody as a developer is (often comical, but) not particularly helpful.

All of these cases could be resolved by:
1. An endpoint within the SMAPI or some other similar REST service; which,
2. Accepts as a POST request an audio file in a specified format;
3. Along with whatever request metadata might be relevant; as well as
4. A parameter in the path or body that specifies the id of the skill under test; and
4. Returns a JSON object with the exact Alexa request that would've been made to the specified skill

Ideally this API could also have a mechanism for dealing with things like sessions and dialog management (and how the intent resolution may change in that state), but the 90% case could be easily covered by a simple "isNewSession" flag that lets us emulate both one-shots and open session requests.

It's worth noting that we've actually tried (with some success) to solve this problem on our own - to middling success - via AVS. The functionality just does not exist for this to be addressed by third parties, unfortunately. A description of our feeble attempt can be found here: http://www.3po-labs.com/blog/a-proof-of-concept-and-its-roadmap

26 votes
Sign in

Sign in with a provider below:

  • amazon
Signed in as (Sign out)

We’ll send you updates on this idea

Eric Olson shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

4 comments

Sign in

Sign in with a provider below:

  • amazon
Signed in as (Sign out)
Submitting...
  • [Deleted User] commented  ·   ·  Flag as inappropriate

    I am new to Alexa development. I am not sure if this exists but I think it would be nice to have an API call to send a text command to Alexa and get the response. This would be the programmatic equivalent of like entering a command into the utterance box of the Test page. The purpose of this feature would be to have an easy way to get JSON responses used for testing the backend/endpoint.

  • Anonymous commented  ·   ·  Flag as inappropriate

    Please let me have access to the whole use utterance since i have my own algoriths for generating proper answers. I think the proposed method that requires me to define a language model is unfeasible. In my scenario, i work with food names. I can't model every single food name or potentially new names. the model builders asks me to use one of the existing slots (Amazon.FOOD actually available for en_US language only, i work with all english variances) or define a new custom slot. Is this a serious and professional thing? How can i possibly define milion of food names??? That's why i'm asking: let developers have access to the raw user utterace instead of trying to model each and every single intent grammar.

  • Mark Tucker commented  ·   ·  Flag as inappropriate

    This API call would return the following for the audio request: 1) the intent name and slot values, 2) the full text of the speech-to-text.

Feedback and Knowledge Base