Fundamentals for programming Amazon Echo Show

Marcel Naujeck

Senior Software Engineer, hmmh

Posts

The new Echo Show is a result from Amazon’s learnings, designed to overcome the problems and obstacles faced by all voice interfaces when it comes to communicating information. Echo Show compensates for the limitations of communicating information via a voice Interface with a classic display. A no-interface device is thus transformed into a full service touchpoint in the digital ecosystem, opening up hitherto undreamt-of possibilities. In this article we will examine how we can approach this new component, the screen with delegation via the Amazon endpoint. Reverse engineering is the key.

Echo Show does not look particularly exciting, and reminds you of something from a 1970s sci-fi movie. At the front the surface is sloped, with the display at the centre, a camera at the top and the speaker grille at the bottom. The device won’t look out of place in your living room. We had to use an address in the United States to buy our Echo Show, as it is not yet available in Germany. Amazon has yet to announce a release date for the German market.

Currently no developer guidelines

Also lacking is information for developers and the development of skills on Echo Show. However, the Amazon Developer Portal provides information on how JSON communication between the skill and endpoint must look for the new functionality. Parameters are described, templates are shown and callbacks are explained. However, all of this information exists solely in the form of communication protocols and not as guidelines. As traditional users of the Alexa Skills Kit framework for Java, we feel left out in the cold. A note in the latest framework version in GitHub tells us that Version 1.4 was prepared for Echo Show, but there are neither documents nor code examples available.

What display options does Echo Show offer?

We have an Echo Show, and we want to develop a skill for it. So, let’s take a deep breath and dive into the framework code, which was committed in the last release. We must begin by asking ourselves what our actual expectations are. Echo Show can present information in a variety of ways. It must therefore be possible to transmit information on layouts and send this as a response to the Alexa endpoint. If we take a look at the response object, we see that virtually nothing has changed since the last release. The only point at which we could transmit dynamic data are the so-called directives. If we search a little in the framework, we will find the RenderTemplateDirective instance, and it is here where we can transfer a template.

We already have access to templates on the Amazon Developer pages (https://developer.amazon.com). At present there are six fixed templates: two for displaying content lists and four for displaying single content. The difference between the two content list templates is that one is intended for horizontal lists, while the other is for vertical lists. The four templates for single content differ in their display options as follows (see Fig. 1):

  • BodyTemplate1
  • Title (optional)
  • Skill icon (provided in developer portal)
  • Rich or Plain text
  • Background image (optional)
  • BodyTemplate2
  • Title (optional)
  • Skill icon (provided by developer portal)
  • Image (optional – can be a rectangle or square)
  • Rich or Plain text
  • Background image (optional)
  • BodyTemplate3
  • Title (optional)
  • Skill icon
  • Image (optional – can be a rectangle or square)
  • Rich or Plain text
  • Background image (optional)
  • BodyTemplate4
  • No title
  • Skill icon
  • One full-screen image (1024 x 600 for background)
  • Rich or Plain text

Fig. 1: Differences between templates for single content

If we want to display information on Amazon Show, we must first be aware of which template is suitable for which required information. A creative conceptioner must plan precisely to ensure that the user experience meets expectations and is above all intuitive. Even the cleanest of programming is of no use if the user cannot use the displayed information. Qualitative user surveys are a useful means of obtaining initial indicators and feedback.

How do I create image and text content?

From a technical perspective, we simply instantiate one of the templates that we find under com.amazon.speech.speechlet.interfaces.display.template. For corresponding properties such as Title, Icon, Background, etc. there are getters and setters. For images there is an Image and an ImageInstance class. Images are transmitted in the form of URLs for the corresponding image source. Text content can be transferred as plain or rich text. In the latter case, there is the option of using markup tags for formatting, which we are also familiar with from HTML. For example, there is <br/>,<b>,<i>,<u> and <font size=”n”>. There are also different areas for content within the text content. Now that we have defined the images and text and entered them in the properties of the corresponding template, the next step is to transfer this template to the RenderTemplateDirective instance. All we have to do now is add our new directive to the list of directives and transfer this to the response object. When we now call the skill, we can see the newly created content.

How do I define content for the touchscreen?

The Echo Show display is a 7-inch touchscreen. This means that you can select elements of a content list or single content. Each single content template and each element of a list template has a token. From the point of view of the framework, this token is a property of type “String” and is used to identify the touched element for callback. If we look at the way in which we previously developed skills, we can see that we only received a callback for a recognised intent: in other words, only when the user said something. This is adequate for the voice interface, but not for the display. However, the SpeechletV2 interface exclusively supports voice callbacks.

If we take a closer look at the SpeechletRequestDispatcher in the framework, we see that the dispatcher can respond to a wide variety of requests. For example, we have AudioPlayerRequest, PlaybackController, SystemRequests, as well as DisplayRequests. When a DisplayRequest is recognised, the dispatcher attempts to call the onElementSelected method from the Display interface. To receive this callback, we need to not only implement SpeechletV2 in our Speechlet class, but also the display interface. Once we have done this, we can overwrite with the following method:

 

public SpeechletResponseonElementSelected(SpeechletRequestEnvelope<ElementSelectedRequest> requestEnvelope)

 Fig. 2: Overwriting the callback

This callback method is then always called when an element is selected on the display. When the callback is called, we can have the token – in other words, the identifier of the element that was pressed – returned from the requestEnvelope withrequestEnvelope.getRequest().getToken() and respond accordingly. We are completely free to select the identifier.

The response to an ElementSelectedRequest is a normal SpeechletResponse. We can therefore return both speech and an additional display template. It is thus also possible to implement the Master/Detail views commonly used for mobile devices. It is precisely for these mechanisms that the Back button, which can be activated by default for every template, is intended. However, it is up to the developer to implement the functionality for a “go back”.

Conclusion

At present, it is somewhat difficult for Java developers to get to grips with Echo Show. Google and Stack Overflow provide neither links to examples nor documentation. Apart from the small amount of information provided directly by Amazon, there isn’t much else available. If you don’t want to spend your time analysing the framework, you will have to wait until the developer community or Amazon provide more information. However, with expansion of the framework, the development of skills for Echo Show is impressive and well thought out.

There are only a few minor negative points. The pre-loading of images in content lists doesn’t work very well. It is not good to see images of list elements appear gradually. In this case, you must consider the access time for content servers when designing skills or hope that Amazon improves the corresponding mechanisms. It remains to be seen what kind of enhancements Amazon will come up with.

It will be interesting to see what developers can do with the combination of voice interface and touchscreen. Close cooperation between design, creation and development is crucial. In summary, Amazon Echo Show will without doubt bring about big changes in the market.

 

This page is available in Deutsch (German)