Taipei, Monday, Apr 06, 2020, 10:41


DSP and Language Libraries are the Keys to Smart Voice

By Vincent Wang
Published: Dec 06,2018

The success of the Amazon Echo has changed the entire style of the smart voice application market, and now when people discuss smart voice, Siri is not what comes to mind; instead people think of Alexa. Furthermore, in terms of application scenarios, households rather than mobile apps are the priority.

More on This

VIA Unveils Edge AI System Line for Automotive, Enterprise IoT, and Smart City Applications

TAIPEI, Taiwan - VIA Technologies, Inc., today showcased its family of high-performance Edge AI systems for the Automotive...

VIA Launches VIA SOM-9X20 Featuring Qualcomm Snapdragon 820 Embedded Platform

TAIPEI, Taiwan - VIA Technologies, Inc., today announced the launch of the VIA SOM-9X20 system-on-module (SoM) powered by the Qualcomm Snapdragon 820 embedded platform...

Currently in the American market, smart speakers have an increasingly stable status. According to a market research study of 1,000 American adults conducted by the research organization PwC, 65% of consumers will use independent smart speakers while cooking, whereas only 37% will use mobile device to carry out voice assistance. In addition, when handling multiple tasks and watching television, smart voice speakers are their primary choice.

Household appliances are the main market for smart voice technologies

This study reveals that household applications are undoubtedly the main market for smart voice technologies. More importantly, the vast majority of consumers (93%) feel satisfied with these voice assistants; however, they are least satisfied with voice assistants on mobile phones.

Therefore, it is certain that the proportion of smart voice assistants in digital homes will rapidly increase. IDC’s newest report points out that the number of global shipments of smart household devices, which includes smart speakers, digital media adapters, lighting, and thermostats is expected to reach 5.495 million units in 2018, which is a 26.8% increase from last year. The two most popular types of these devices are smart speakers and media products which are expected to account for 71% of the smart household product market in 2018.

Another international research organization, Juniper Research, noted that in the next five years the Amazon Alexa and Google Assistant smart home assistant devices are expected to experience a 1000% growth rate. They also predict that the amount of voice assistant users will climb from 25 million in 2018 to 275 million in 2023 and that the main force behind this growth will be a massive increase in the number of smart home solutions.

As the scope of smart voice household applications takes off, related technology supply chains will also usher in a new golden era in which chip and module providers are the main beneficiaries in terms of profits. However, in contrast to the traditional independent and fragmented voice design solutions, the new generation of smart voice will tend towards comprehensive solutions in form and integrate chips, software, and cloud storage while also being equipped with support for AI and machine learning technologies.

Outstanding voice quality – high-functionality and low power consumption are the hard reasoning behind DSP

Realtek is a well-established Taiwanese brand-name audio chip provider, and their chips are already used extensively in a wide variety of consumer devices and PC platforms. As they enter into the smart voice era, Realtek’s DSP is shifting from high computing power, large capacity, and energy conservation to Neural Network Learning and video algorithms, and the 4-8 macs of the past have given way to 300-1000 macs in the present day. Moreover, Realtek has reduced their DSP volumes in order to correspond with trends in lightweight consumer products.

“The most natural voice can be stored on every kind of device,” said Realtek Computer Peripheral Business Unit Vice President Chuting Su.

Realtek believes that an outline of the many voice interface technologies can be divided into two major items: software and hardware and then can be divided into communication between humans and instructional communication between humans and machines, which are different technologies. Prior to voice recognition, there is a pre-processing stage which is related to voice quality and the biggest challenge on a technical level.

“I believe that the greatest challenge is figuring out how to create high-quality voice processing,” said Chuting Su.

Figure 1 :   Realtek Computer Peripheral Business Unit Vice President Chuting Su
Figure 1 : Realtek Computer Peripheral Business Unit Vice President Chuting Su

He gave the example of interpersonal communication in a coffee shop with a noisy atmosphere. We still are able to clearly receive our speaking companion’s messages; however, for machines this is not the case. “We have a kind of excessive expectation of being able to stand in distant or noisy environments, and machines will still be able to understand our speech.” Simply stated, users have high expectations of being able to successfully receive quality voice messages from human-machine interfaces, and they anticipate them to be as clear and unobstructed as in interpersonal communications.

“In the hardware portion, we specialize in Codec and DSP,” said Chuting Su.

He stated that in the past Realtek allowed end users to select the environment that they wanted to use and then provided a corresponding strategy. The current trend is to enable the software to detect the use environment by itself and specifically and minutely distinguish whether the environment is in a coffee shop, a restaurant, or a household etc., and Realtek has already engaged in this endeavor for twenty years.

Chuting Su also noted that the hardware challenges of voice interfaces are not the Codecs themselves but rather Digital Signal Processors (DSP). If in order to increase the rate of recognition, we call upon SoC at every turn, it will consume a considerable amount of power.

Although Realtek is neither a microphone nor a speaker manufacturer, Chuting Su still points out the position of these two hardware items and notes that they directly influence the quality of sound reception.

Communications Network Business GroupZ Director Shen Jia-qing specially introduced their flagship Wi-Fi SoC chip Amoeba, which is the name of a eukaryotic organism which has the ability to change its shape and adapt. Like its namesake, the Amoeba product can be utilized in almost every kind of IoT application, and this type of chip integrates Wi-Fi and MCU with abundant I/O interfaces.

Figure 2 :   Communications Network Business GroupZ Director Shen Jia-qing
Figure 2 : Communications Network Business GroupZ Director Shen Jia-qing

Deep learning technologies solve semantic problems of human natural language

Unlike Realtek’s long-term investment in audio technologies, VIA Technologies, Inc. began from the processor side, and with advantages on the computing end they recently shifted to the development of Artificial Intelligence (AI) technologies and are utilizing them for smart voice applications.

OLAMI is a smart voice assistant solution which was independently developed by VIA, and in the future it will be integrated into smart electronic billboard, video wall, and Internet of Things (IoT) applications. OLAMI is based in technologies related to deep learning voice recognition and computer vision, and it is also equipped with speech detection, echo cancellation, and noise suppression speech recognition. With an understanding of natural language, it provides a one-stop solution for dialogue management and speech synthesis.

VIA Embedded Business Department Director Wu Yi-pan stated that the establishment of voice databases is the pre-foundation threshold for developing AI. Positioning has to be clearly understood, and the size of the market must be considered. In addition, databases have to be established, location established, and the scenes have to be defined before finally converging with the applications end.

Figure 3 :   VIA Embedded Business Department Director Wu Yi-pan(left) and Embedded-Smart Cities Product Marketer Guo Yu-fan.
Figure 3 : VIA Embedded Business Department Director Wu Yi-pan(left) and Embedded-Smart Cities Product Marketer Guo Yu-fan.

He pointed out that 70% of human senses rely on the eyes. However, it is also necessary to integrate the sense of hearing; therefore, as a trio, machine vision, intelligent speech, and artificial natural language have a mutually indispensable and intimate relationship with one another. The key technologies here are developed with the algorithms and logic of graphics chips, and they land in various industries in the form of B2B.

“The technological core is the same with the customer’s requirements stacked out in different permutations,” said Wu Yi-pan.

“Customers have a movie-like imagination of AI,” said VIA’s Embedded-Smart Cities Product Marketer Guo Yu-fan. He believes that first there must be an understanding of clients’ final objectives, and the intermediate structure must be combed through to help draw a clear middle map. In terms of promotion, convergence from expectations to reality is the biggest challenge.

He pointed out that the biggest limitation is the Natural Language Processing (NLP) portion as there are too many complex models of human language.

VIA’s OLAMI natural language human-machine interactive solutions cover a number of vertical areas of general semantic scenarios. The use of massive knowledge bases, support for complex semantic spatial modeling of billions of lexicons and self-defined syntax analysis, as well as in-depth semantic analysis technology for patent research and development and OSL syntax description of language have enabled developers and enterprises to rapidly build numerous applications in accordance with their own requirements and reduce development expenditures.

(TR/ Phil Sweeney)

1879 Read

comments powered by Disqus