In the 2nd Quarter of the MSTI program at Global Innovation Exchange, University of Washington, we had a course called the History and Future of Technology. The course aims to give us an understanding and appreciation of past trends in technology to understand the trajectories of technology and how they change over time. This paper was the piece of work for the final assignment. In the article, I explored the path of Amazon Echo/Alexa’s development, and its impact on the society.
In 2015, the Seattle-based company Amazon released Amazon Echo, a voice command device that connects to the company’s voice-controlled virtual assistant service Alexa. This research paper will introduce the development path of Amazon Echo/Alexa — how was it conceived, what were the critical decisions made. As the voice-based-command product relatively belongs to a new category, its optimal research and development approach has not been well defined; Amazon initiated the Alexa Prize for university talents to challenge specific topics, we can have a glimpse into the crucial technical challenge in a conversation AI product through this competition. This paper will also cover the ecosystem of the voice-command service industry, its impact on the society, debates, and controversies around such services.
Keywords: Amazon Alexa, conversational AI, natural language processing, machine learning, virtual assistant
“Alexa, who are you?”
“I’m Alexa, and I’m designed around your voice. You can ask me to play music, answer questions, tell jokes and much more.”
When you ask an Amazon Echo who s/he is, the sentence above might be the answer you get from the cylinder device. With the first-generation Amazon Echo released in 2015, the Echo product line has been a big hit in the consumer market. In the 2017 holiday season alone, Amazon had sold “tens of millions of Alexa-enabled devices* sold worldwide” .
As computer scientists Julia Hirschberg and Christopher Manning put it, a vast increase in computing power, the availability of very large amount of linguistic data, the development of highly successful Machine Learning methods, and a much richer understanding of the structure of human language and its deployment in social contexts,  all these combined enabled voice assistant to respond to voice commands meaningfully and quickly. Conversational AI products from Amazon, Google, Apple, and Microsoft are fighting for the chance to have a conversation with the customers, and home appliance makers are tuning their products to bear a “Compatible with Alexa / Google Home” label. While this sector of business is booming, concerns over security and privacy raised up. For instance, the unexpected calendar content divulgence, or unauthorized purchase is likely to happen. Meanwhile, how to avoid the smart speakers become probes of the tech giants? While security and privacy are the most talked issue, I would like to raise the awareness of a hidden social problem — cultural bias — an inferior situation faced by minorities or specific demographic groups when technology products are designed and operated by strong/mainstream culture groups.
This research paper will introduce the making of Alexa service and the Echo device, tell the hidden story behind the decisions made for the creation of the cylindrical speaker. The Alexa Prize for universities teams has been a hot topic in the natural language processing (NLP) community. The paper will look into the case to see how Amazon successfully organized and benefitted from the competition. Aside from looking at Amazon Alexa, this paper will also observe and analyze the conversational AI products’ impact on the society, which comes from two dimensions — social concerns and application possibilities.
2. The Making of Alexa
2.1. The Conception and Key Decisions
It may be surprised to know the hugely successful product — Amazon Echo — came into being as a branch of a halted project. In 2004, Amazon founded Amazon Lab126 in Sunnyvale, California. The number in the name 126 was derived from the arrow in Amazon’s logo, which points from A to Z, the beginning and ending letters of the English alphabet. The lab was founded by Gregg Zehr, who was previously the Vice President of Hardware Engineering at Palm, and Vice President of PowerBook Engineering at Apple Computer. The Lab126 has since been charged the mission to develop hardware products for Amazon.
After three years of research and development, in 2007, Lab126 delivered its first product — the Amazon Kindle e-reader, which has been a great success and changed the publishing landscape. In the following years, Lab126 released Kindle Fire tablet, Amazon Fire TV, and Fire Phone. Inside Lab126, each project has an anonymous code name, the projects are run separately, members in one project have no communication with members on other projects. The Kindle project was code-named Project A, while the Fire Phone was code-named Project B.  In the meantime, there was a Project C. According to patents filed by Amazon-related entity Rawles LLC, the Project C was a technologically aggressive plan. “One of the initial patent applications described a device that would display augmented-reality images that people could interact with; another proposed tracking people’s movements and responding when they clapped, whistled, sang, or spoke. Taken together, Amazon’s patents during this period point toward a vision of a home where virtual displays follow people around as they wander from room to room, offering a range of services in response to voice commands and physical gestures.”  However, the Project B as mentioned earlier — the Fire Phone — failed, which made the leadership at Lab126 to question whether they could launch the far more complex Project C successfully. In the end, there came the offshoot project of Project C — Project D.
In Project D, all visual and kinetic part of the Project C was removed, the voice became the sole interaction channel for the user. However, by that time, there was no precise product definition of the device. Jeff Bezos expected it to integrate with the shopping experience. There were debates on the design of the product before shipping to customers. One of disagreement was about how to listen to the user’s command clearly while emitting sound simultaneously. The engineers recommended building some hockey puck-shaped small devices to place around the room to help collect the user’s command when s/he strayed away from the main speaker’s microphone range. This idea was rejected by the leadership, who believed there should not be additional devices to enhance the performance. In another occasion, some engineers insisted there should be a remote for the user to speak with anywhere in the house. The leadership opposed this idea again, they believe voice should be the only form of input, no physical interaction between the user and the device should be added. To settle the dispute, they decided to let the users make the decision. The first batch of speakers shipped was equipped with a remote, and the usage analytic data were sent back to Amazon’s server. The usage data indicated the remotes were almost never used, so since then, no additional gadgets were bundled with the main device, the voice is the only interaction channel. 
Being different from the counterparts — Apple Siri, Google Home, and Microsoft Cortana, Amazon Alexa is a screen-less service. It’s a more significant departure from the current computing platforms — iOS, Android, Windows. The deviation is so complete that it became the new computing platform after mobile devices.
In late 2014, an engineer rigged the speaker to control a streaming TV device. This case inspired Bezos to vision Echo as the hub for the smart home. Amazon opened its APIs for Alexa development, and it joined the ecosystem of the smart home. By now, users can control appliances from light bulb to thermostat in their home by speaking to their Echos.
2.2. The Alexa Prize
For the research and development of conversational products like Alexa, Amazon had several hundred employees worked on the project in its Seattle, San Francisco Bay Area and Cambridge, MA offices.  But this is not enough. The Alexa often fails to comprehend the apparent meaning in user’s speech, and the consumers are looking for something that no voice assistant can deliver now.  Aside from recruiting PhDs from top-tier schools, in September 2016, Amazon announced the first $2.5 million Alexa Prize, the goal was to enable users to have the best conversations that they could, conversations that were coherent, relevant, interesting, and kept users engaged as long as possible.  They expected the challenge of 20-minute conversations would take multiple years to achieve, so the Alexa Prize was set up as a multi-year competition to enable sustained research on this problem. 
12 university teams were selected to compete with an Amazon sponsorship. They were provided with news and current topics data API, Automatic Speech Recognition data, customer feedback data, infrastructure support like AWS services and other Amazon internal supports.  The Sounding Board team from University of Washington won the competition with a 19 minutes dialog — almost hit the target of having a coherent and consistent dialog for 20 minutes. Amazon killed multiple birds with the Alexa Prize. Firstly, the long-term competition draws attention from the academia, graduate students from top-tier schools challenged the target for months, in this period, Amazon seized the opportunity to tackle engineering problems in different approaches in a relatively cheap cost. For instance, the winning team from UW had demonstrated a well balanced hybrid architecture combining handcrafted categorized inputs with machine learning models could excel architectures relying heavily on traditional software engineering or machine learning solely. Secondly, the competition was open for Alexa users to chat with the university social-bots, by the end of the competition, Amazon had collected 100,000 hours of chats. The abundance of data is crucial to building an effective model in machine learning development. Thirdly, this competition has brought Amazon an opportunity to pick prospective employees. The fight for top AI talents is fierce, for instance, in the 2017 Neural Information Processing Systems conference†, half of the conference attendees were recruiters. At last, similar to Amazon’s HQ2 city selection campaign, the Alexa Prize cast influence beyond the technology community, it was also a marketing success.
3.1. Social Concerns
With massive numbers of devices housed in millions of homes, Alexa has received diversified comments from the user group. Someone regarded it as “useless gimmick,” while some others took it as evidence of Amazon’s “Orwellian tendencies.”  The core concern over conversational AI product like Alexa is the security issue and privacy protection.
To fully utilize the ability of Alexa, the user needs to authorize it the access to his/er calendar, other accounts or services, which are highly personal information. Once breached, the user’s sensitive information will be exposed. Even for daily use, if reminder and calendar data have been synced to Alexa, anyone in front of an Alexa device can query the reminder or calendar data by asking “Alexa, what’s on my calendar today?”. Also, as Alexa relies on the voice channel to receive a command, it is possible to spread malicious command to massive users by embedding commands in undetectable sound frequency spectra, e.g., the ultrasound frequencies, via broadcasting services like radio or TV programs. In mid-2017, Amazon enabled users to place orders by talking to their Alexa speakers. If a user has enabled this function without extra authorization measure, anyone who has access to the device could place an order on the owner’s behalf. If advertisements broadcasted on radio or TV contain sentences which instruct Alexa devices to add a certain product to the shopping cart, it will increase the sales of that product in a macro perspective. 
There are two categories of security risks: (1). Product design blind spots; (2). Data access authorization. The second risk is the classic security threat to IT systems, which are shared in all information artifacts, while the first risk is a novel and unique threat to the security of conversational AI systems. In a case, a man built a “smart” mechanism to unlock the front door by asking Siri to do the job. But it turned out that anyone who stood outside the man’s living room could tell Siri to let them in by unlocking the door.  The essence of this case is similar to authorizing purchases via Alexa. Luckily, conversational AI service providers have taken actions to deal with this issue, for instance, Google has upgraded its Assistant to include voiceprint verification technology, which enables the Assistant nodes to verify the identity of each speech it has received, Apple is teaching Siri to recognize a user’s voice, and Amazon is working to deploy voiceprint verification to prevent unauthorized purchases. 
It is never too late to mend. The technology innovation in the past had witnessed the similar scene, the creator of a novel technology product often believes all people are good and forget/ignore to build security mechanism to prevent malicious use cases. In the early days of the Internet, the resources online were mostly open to the whole community. There was no complicated precautionary mechanism deployed widely as today. But when the community grows, what happens in the real world happens in the cyberspace. So does the conversational AI field. When the screen-less artifact enters the market gets millions of copies sold, unexpected scenarios and use cases reveal. Then the creators start to build elaborate system to prevent things from derailing.
Another concern is privacy. The device has to listen to the environment with its all-direction microphone array all the time to make the conversational AI agent work. To write an exaggerated metaphor, this is similar to install a cyber camera in one’s private room linked to someone else’s office, aurally. Typically, not all data collected by the microphone array should be transferred to the server, it is designed to only send the voice after the “wake word” to the server. But there was at least one case. A malfunctioned Google Home Mini recorded at all times and sent the recordings to Google’s server.  This was unintentional misbehavior, but what if the recording data was stolen or leaked? In some cases, the spreading of the private voice recording is an offend to one’s privacy, in other cases, it might have legal or even political consequences, just think about the Watergate scandal. In 2015, there was a murder in Arkansas lead to the authorities issuing a warrant to Amazon to retrieve the suspect’s Alexa history. Amazon rejected to cooperate with law enforcement, but after receiving the suspect’s consent, Amazon released the data to investigators. In another case, Mattel planned to build a voice assistant aiming at children, but due to privacy-related complaints, it ditched the plan.  Experts in this field predict IoT and AI are going to generate device-based privacy regulation or industry self-regulation.
A rarely noticed and discussed concern of conversational AI product is cultural bias. The cultural bias from the conversational AI product is an inadvertent bias towards weak ethnic group members or dialect speakers. For example, the oral language in the Americas and Europe are mainly standardized and computation-friendly, which means NLP algorithms can work very well once tuned. The Spanish NLP algorithm compiled for Madrid works for Mexican City, English NLP algorithm compiled in Alabama works well in Massachusetts, Mandarin NLP algorithm compiled in Beijing works for anyone in the world who speaks standard Mandarin Chinese — a computational linguistic product is in favor of strong, standardized culture. However, countries which have long and complicated history and vast landscape like China have many regions inside the country. These regions have their dialects.
By 2014, there were still 30% of its 1.3 billion population cannot speak Mandarin (Putonghua, the Standard Chinese speaking language),  and significant portion of the people who “can” speak Mandarin merely use it daily, they speak the Chinese language in their dialect — different from dialects in America, the pronunciation difference between Chinese dialects may surpass the pronunciation difference between Spanish and Italian. As those dialects are so different, people from one region may hardly be able to understand the spoken language in another region; even their geographic distance is just 100 km. Although the official spoken language of China is Mandarin, that is the case only in China’s 300 cities; for the majority of the population who live in the 2,850 county-level areas, their oral language is the dialect. For those who can speak Mandarin, if their daily spoken language is dialect, they feel embarrassed and weird to talk to a machine in standard Mandarin. But if they try to talk with the machine in a tone that they are comfortable with, they will be frustrated to find the machine cannot understand them at all. As each dialect group is relatively small, it is not commercially viable to develop dialect-compatible NLP models in current approaches. This situation will hinder the folks in the vast non-Mandarin region from experiencing conversational AI. The technology “bias” towards these massive and highly diversified dialect language population around the globe will continue to exist, new approaches to the problem are expected.
3.2. Application Possibilities
3.2.1. Personal Scenarios
20 years ago when Bill Gates built his Xanadu 2.0 as his private mansion in Medina, Washington, people were fascinated by its reported hi-tech features, such as tailored living experience, interactive digital assistants. Today, thanks to the development of the network, machine learning algorithms, sensors and protocols of IoT devices, an average family can build their own “Smart Home” without digging into the pocket. And due to the nature of conversational AI — ubiquitous, natural, identifiable — it becomes the ideal hub for smart home systems, excels tablets or smartphones. By now, Alexa-compatible home appliance ranges from a light bulb to a fridge.  It can be expected that when the connectivity of home appliances reaches a watershed, new life scenario will reveal, such as automated living maintenance, real-time wellness monitoring, etc.
Another experience brought by Alexa is shopping over voice. After setting up account information in the Alexa App, users can simply name the item to buy over voice, Alexa will add the item to the shopping cart, if the user confirms, it will place the order and pay the bill. The customer is happy with the elegant shopping experience; however, not all suppliers are pleased about this. Different from brick-and-mortar shopping or shopping with a web browser, in the voice interaction scenario, “browsing” more than two stock keeping units (SKU) under one category over voice is arduous and not desirable in user experience aspect, so Alexa typically offers less than two SKUs for a category. But which one or two of the hundreds or thousands of candidate SKU to be set as the default or “recommended”? Big firms with famous brands might be happy to see this scene — when weaker competitors’ goods might never have the chance to be considered for the voice interaction. Will this make Amazon/Amazon Alexa the accomplice of helping oligopoly? The world of retail has never seen such situation brought in by the voice assistant. Amazon is said to be working on this problem.
Aside from fascinating experience for consumers, voice assistants like Alexa or chatbots from Facebook and other developers are believed to have great potential in commercial scenarios. For example, when you observe the work pattern in a McDonald’s kitchen, all workers are working under the command of the directors, if the automation progress goes on and one day the automated kitchen appliance could interpret voice command from the director, the director’s productivity will be significantly unleashed, the efficiency of serving will be greatly improved. Another use case is customer service, answering customers’ call used to be a high-cost business. With the development of conversational AI, companies like Amtrak, Comcast are adopting the technology to provide a 24*7 service. To serve the enterprise conversational AI market, Microsoft revealed the Chatbots and Conversation As a Platform (CAAP), Facebook has offered Facebook Bot Engine to developers to build tailor-made chatbots.  In some advertisements, the Alexa Echo Dot was marketed as a tool for office communication and coordination, it was also a trial in the commercial market. Besides, entertainment and tourism can also benefit from voices assistants like Alexa, for example, communicating with characters in games by speaking to them, virtual tour guides, etc.
In conclusion, after more than ten years of the touch screen human-computer interaction utilization and the boom of mobile Apps, people’s visual interaction space are almost full. The channel to interact with computers by voice has yet been developed. The human-computer interaction over voice will follow the touchscreen to become a new mainstream interaction channel. Companies like Amazon, Google who have the research and development superpower and in the meantime, have large user data, will perform better.
The voice interaction security is a novel area for the security field. By far, the privacy of the user data relies on companies’ self-regulation. For the long-term healthy development of the industry, a standard or protocol to manage the data governance is necessary. In the meantime, the inadvertent bias towards non-standard language speakers is a cultural bias, which raises the question — how to make the technology more inclusive?
Voice assistant like Alexa is the optimal interaction portal for smart homes, and potentially encourage the oligopoly of suppliers in various sectors. The application of voice assistant like Alexa in enterprise use cases just begins to show its potential to lower cost and boost performances.
Originally published on Ryan’s blog—Catalium: https://www.catalium.net/brief_history_of_amazon_alexa_and_beyond/
* Being different from the physical product Echo, Alexa is the virtual assistant behind the shell. It can be installed on non-Amazon products.
† NIPS, a top-level academic meeting on machine learning and computational neuroscience conference.
- James Vlahos. Inside the Alexa Prize. Wired.com (2018) https://www.wired.com/story/inside-amazon-alexa-prize/
- James Vlahos. A Son’s Race to Give His Dying Father Artificial Immortality. Wired.com (2017) https://www.wired.com/story/a-sons-race-to-give-his-dying-father-artificial-immortality/
- Joshua Brustein. The Real Story of How Amazon Built the Echo. Bloomberg.com (2016) https://www.bloomberg.com/features/2016-amazon-echo/
- Matthew B. Hoy (2018) Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants, Medical Reference Services Quarterly, 37:1, 81–88, DOI: 10.1080/02763869.2018.1404391
- Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn, Behnam Hedayatnia, Ming Cheng, Ashish Nagar, Eric King, Kate Bland, Amanda Wartick, Yi Pan, Han Song, Sk Jayadevan, Gene Hwang, Art Pettigrue (2018). Conversational AI: The Science Behind the Alexa Prize. arXiv:1801.03604 [cs.AI]
- Ram Menon. The Rise of Coversational AI (2017) https://www.forbes.com/sites/forbestechcouncil/2017/12/04/the-rise-of-conversational-ai/
- Jeff Cotrupe. (2016). Conversational A.I.: It’s A Bot Time for a New Conversation on Customer Engagement Stratecast Perspectives & Insight for Executives (SPIE) Volume 16, Number 15
- Amazon Press Release. Amazon Celebrates Biggest Holiday; More Than Four Million People Trialed Prime In One Week Alone This Season (2017) http://phx.corporate-ir.net/phoenix.zhtml?c=176060&p=irol-newsArticle&ID=2324045
- Liam Tung. Google Home Mini Flaw Left Smart Speakers Recording Everything ZDNet.com (2017) http://www.zdnet.com/article/google-home-mini-flaw-left-smart-speaker-recording-everything/
- Add Robertson. LG Put webOS and Amazon Alexa on a Fridge The Verge (2017) https://www.theverge.com/ces/2017/1/4/14166240/lg-webos-amazon-alexa-fridge-announce-ces-2017
- People’s Daily. China Still Has 30% of Its Population Cannot Speak Standard Mandarin (2014) http://finance.people.com.cn/n/2014/0924/c1004-25720746.html