Qi Chao, Triangle Beast: How to eliminate machines’ misunderstanding of humans

Qi Chao, Triangle Beast: How to eliminate machines’ misunderstanding of humans

[51CTO.com original article] On July 21-22, 2017, the WOTI2017 Global Innovation Technology Summit with the theme of artificial intelligence, hosted by 51CTO, was grandly held at the Beijing Renaissance Hotel. During the summit, 30+ AI stars, dozens of wonderful speeches and roundtable forums on the theme of artificial intelligence were slowly unveiled. In addition to the wonderful speeches in the venue, there were also hands-on laboratories and technology experience areas built specifically for AI enthusiasts outside the venue, all of which made this conference full of highlights.

On the afternoon of July 21, at the main venue of WOTI2017, Qi Chao, co-founder and CTO of Triangle Beast, gave a wonderful speech entitled "How to Eliminate Machines' Misunderstanding of Humans". The following is the transcript of the speech, let's take a sneak peek!

[[197682]]
Qi Chao, Co-founder and CTO of Triangle Beast


Hello everyone, I am Qi Chao. The name of Triangle Beast is very strange. This name comes from the three of us founders. We are a relatively complementary team. One of us is a colleague of Baidu, and another is responsible for brand promotion. He previously worked in Amway China and Ogilvy PR. The problem we hope to solve is the semantic part, which involves the problems faced by human-computer interaction systems. Today's topic hopes to share from the technical aspect, because I think the main responsibility of the CTO is to explain technical issues, grasp the direction of technology, and so on.

Human-computer interaction is not a particularly new topic. We hope that people can talk to machines freely, let machines understand human language, perform tasks assigned by humans, or have more natural communication with them. So historically, there have been many ups and downs in the AI ​​industry, especially in recent years. There have been several peaks and troughs. In recent years, breakthroughs in voice technology have made speech recognition and synthesis much better than before, which has led to a wave of voice assistant products. Then everyone realized that this kind of product gradually disappeared from the market and entered a period of decline. With the development in recent years, more products have emerged, and the market has an upward trend, especially recently the country has released a strategic goal to drive AI technology, or this concept, upward, and many products have emerged, such as Microsoft Xiaobing, Baidu Nuomi and other products.

It mainly consists of several parts. The first part is free dialogue with humans. All languages ​​that cannot be understood and captured will throw out a natural search result, which is an interruption of the dialogue process. Moreover, natural communication inevitably requires some purposeless and non-task-driven chat processes. So here, one necessary aspect is to be able to have open-domain chats with people more smoothly like humans. The second part is that as a robot, as a tool in a certain aspect or field, it also needs to provide some services or information to humans, including question-and-answer skills, or I want to help you achieve the service function of ordering meals and air tickets. There is also active behavior, because smart cars are not only reflected in responses, but also in active aspects, being able to understand you and push the information or services you want to obtain.

The entire human-computer interaction system involves many technical modules, which can be considered as a large integration of natural language processing or related technologies, presented as the export of the overall technology. This picture is intended to express the gradual integration process from the bottom up. The technologies designed at the bottom level include deep learning, reinforcement learning, and basic natural language processing technologies. As a basic module, this system must be built into a very good and strong foundation. In addition, it involves semantics and information retrieval. In order for machines to understand human language, they must be taught content or knowledge. The source of this knowledge is realized through data, and we have to do a lot of data mining work.

Looking from the bottom to the top, the second layer is a combination of technical modules, including natural language understanding, decision-making process, recommendation, knowledge base, planned inference, or classification clustering and sentiment analysis, which are all indispensable modules for conversational robots. The light green part is integration, forming subsystems, including open domain chat systems, retrieval-based question-answering systems, retrieval systems based on structured data, and task-driven conversation systems. I hope it can understand the user status more clearly. We need to use the technology of user information and user models to better reflect the personalization of people. Finally, there is topic recommendation. The upper layer is the integration at the system API level, which divides the subsystems into external services and displays them in different hardware or different product forms. Here I emphasize several aspects and share with you my understanding of the main directions of the Triangle Beast human-computer interaction system.

The first is human-computer interactive chat. He hopes that the machine will behave more like a human and have unlimited dialogue with humans, just like your friend, who can chat with you about topics that he is interested in and that you are also interested in. There are many aspects involved here, including purposeless chat and task-driven systems. A control module is needed to integrate these contents with the system and make auxiliary decisions. Let me give you an example of this decision. For example, when a user says the word "apple", it has many meanings. It may be a movie starring Fan Bingbing, or it may be a fruit, a product purchased through an e-commerce channel; it may be a company, and he wants to know the news about Apple. Which service should be used to complete it? It needs to be controlled, integrating the chat, the upper and lower relationships of the conversation, and the user's interests, and finally deciding on a question that should be responded to. This is our open language conversation. When you don't have this kind of drive, the entire conversation goes smoothly. When you can't decide which service I should go to, and you don't want your conversation to be awkward, the open language chat plays a big role in smoothing the entire conversation process, guiding more information needs and service needs, and meeting such a foundation.

When we were working on products like DuMi and XiaoIce, we saw some data that more than 70% of users' PVs belonged to open chat dialogues. This is a standard that the human-computer dialogue system has done very well, and it has the ability to have free and open dialogues. There are many ways to implement it. The best way to use it between products is based on retrieval, which is divided into two parts. The first part is at the bottom of the PPT. We need to obtain a large amount of text from public information on the Internet, from daily conversations between people, and use deep learning and machine learning to complete conversations between people. Conversations between people are necessary means. We also put a lot of effort into mining data here. We now have 50 billion (English) accumulated from communities and forums. We need to clean these corpora, wash away the private data of conversations between people, and turn them into data that can be used directly in the system. When online, we will use various models and algorithms to fit this data. When a user sends a question to the robot, we find something very similar to what someone said in history, and use the other party's reply as a feature of the information, turning it into a follow-up that can be used for reply.

Another method that is more popular in academia is different from the previous one in that the system requires a large search engine or corpus retrieval system to search for similar sentences in conversations between people and use the previous corpus to search for a probability model offline. Generally, we use two more important models of deep learning to fit the data. There is no need for an online retrieval process, but instead we generate our response word by word.

The topic we just talked about belongs to the category of purpose-driven conversations, like the chats between you and your friends. There is also a purposeful way of conversation, such as booking tickets, or robots that complete ordering operations. This kind of robot conversation is driven by task purpose. We hope that the chat time is as long as possible, and the robot and people can communicate seamlessly. Here are a few products for you. For example, the best chat system is that a boy has chatted with it for more than 20,000 sentences in two months. These 20,000 sentences can dig out a lot of data or personalized characteristics of users that are not experienced in many tools. In addition, robots will eventually serve people, and this part hopes to get better and better. For example, when I go to a restaurant to order food, I don’t want to have multiple rounds of conversations with the restaurant waiter. I hope it will be as fast as possible, as long as the order is completed. In this process, I hope to achieve the shortest path, another different architecture method, divided into about four parts.

The first is that we need to understand what the other person is saying. This includes two parts. One is that I want to understand the intention of the sentence, such as booking a ticket or looking up a piece of information. After analyzing this intention, we need to analyze the appearance under this intention, such as a flight ticket from Beijing to Shanghai. The second is that in order to complete the task, we need to collect different types of status. For example, to book a flight, we also need to obtain the time he wants to book the flight. In this case, we need to maintain a state, a collection, to see if the state we have collected now is enough to complete the task. If not, the third stage will make a strategy at the same time to ask for clarification or directly display a certain result. The fourth is the robot reply. Task-driven robots are generally designed by product managers and use templates to generate replies.

The above are two types of robots, one is a task-driven robot, and the other is a non-task-driven robot. This is how conversational robots are implemented. On the other hand, it is more abstract. How do we judge a robot, is it good or bad? Here are a few examples to compare the different types of robots mentioned above.

First, we will compare the open robots. We hope that the conversation of a robot is relevant. If you ask if you have eaten and the robot tells you that you are not sleepy, such a conversation cannot continue, so relevance needs to be measured. On this basis, we need to discuss the fun. Even if the other party is a person, you may not be able to continue the conversation. For example, if the robot always laughs or always asks you to drink hot water, a normal person will not be able to keep the conversation going. So fun is also a quality standard and factor to maintain the conversation. The second is satisfaction. If every sentence is relevant and interesting, but the overall experience is like a neurotic person, who helps you with the topic here and there, the experience is not good. So we need to measure naturalness and smoothness. The third is user activity, which is measured from two aspects. The first is the average number of rounds each person gives the robot to talk, and the second is the average number of times the same user comes and the average number of user conversations. This is used to judge the quality of the robot.

This is an example of whether it is relevant or not, and whether it is interesting or not. The manual scoring method is subjective for chatbots. Different people have different opinions. Therefore, such evaluations usually use multiple people to score the same data. In this way, an average score is finally taken. Smoothness and naturalness are also scored subjectively. These are some examples. The higher the score, the better, and the lower the score, the worse. When a robot speaks clearly, but the content of each sentence is not what you like, your impression of this robot will definitely be very low, so there is a subjective score here. This is an example of objective data.

On the other hand, the task-driven dialogue evaluation system is divided into four dimensions. The first is the intent recognition dimension, which is accurate or not. The second is the recall. The third is the policy completion. When this task requires multiple rounds to complete, such a system will definitely score lower. Another is that it does not help you complete the dialogue goal, which is a very important factor. The fourth is natural language generation, which looks at the degree of comprehensibility and naturalness. For example, in the intent recognition area, a song by Andy Lau is used as a positive example, and the negative example is whether you like Andy Lau. Both sentences mention the name of the singer, but the first sentence is whether you hope to find this song, which is correct, and the second is an open question. The recall rate and accuracy of the slot granularity, for example, a song by Andy Lau today, although I can understand his intention to find a song, if today is put in front, today is a song by Andy Lau, the meaning is completely different. What we want to evaluate here is such a situation. If we want to play this song correctly, we need to parse out that the singer is Andy Lau and today is the project of this song. If both are parsed out, it is correct.

This example is still about finding a song. We hope that the interaction process can be completed smoothly. In the first example, the completion rate is considered to have been achieved, and the system will play the song, which is very good. The following dialogue is that he still didn't understand after many times, and finally the user gave up, and the interaction ended in failure. This is an example of an unfinished task. On the right is the judgment of completion. One round is what we most want to see. If the system and the user interact for many rounds, the user will basically lose patience at this time. We have seen that many speakers or in-car interaction methods will make people feel very embarrassed. You said that he didn't analyze it many times, making you lose patience, and the probability of using it next time is very small.

The above is our analysis of the field of human-computer dialogue, including some technical implementation methods and examples. It may take a long time to elaborate on this, but today I just want to start a discussion with you.

Triangle has been working on implementing solutions in this regard. First of all, we have toB. Our partners are listed in the PPT. Of course, this is not complete. We have several major areas. In terms of hardware, in addition to mobile phones, we also have smart speakers and Xiaomi TV solutions. In addition, we cooperate with some companies, such as shopping malls and media, to build robots together. In addition, we also have good cooperation with BAT companies. Now we have chat dialogue systems with Alibaba, Tencent, and Baidu.

You may know Triangle from Luo Yonghao’s press conference. We were mistakenly called Unicorn. Luo Yonghao made corrections on the spot. We did not implant the overall dialogue solution into the mobile phone, but embedded the dialogue word segmentation module into it. Recently, this matter has been hyped again. We have performed semantic fragment analysis. Many large modules exceed the scope of words. Triangle is a local solution for mobile phones. It tries to achieve the minimum entry accuracy with very small mobile phone resources. In March of this year, we implanted the semantic interaction system for the first time at the Xiaomi TV press conference. We solved the problem of users finding the TV resources they want to watch through voice when using Xiaomi TV.

As an Internet TV, Xiaomi TV already has more content resources. It is not like when we were young and we could watch dozens of TV channels with just one remote control. Now there are millions of TV resources, and voice interaction is a very good experience. We don’t want to just make a remote control. If we just make a remote control, it is a problem that voice manufacturers can solve. What we need to solve is that when users are looking for movies and TV series, they often can’t remember the name of a movie, or can’t remember the full name, such as "The Shawshank Redemption", most people search for Shawshank. Others can’t remember the correct name, such as the British drama "Sherlock", but many people will call it Detective Charlotte. If there is no semantic understanding, it may become "Charlotte's Troubles". So we will solve two problems in Xiaomi's solution. When you can't remember the name of a movie, or can't remember the accurate and complete name of a movie, I will help you with semantic correction. In addition, just like men and women have different perspectives when shopping in a mall, many people go shopping directly for a target, while many people just browse. We need to use data mining technology to label movies, and then add semantic understanding to solve this problem in this scenario. I hope to find a movie with a very handsome actor in the Oscars, movie reviews from users, and discussions on forums from fans. We add rich text representations to the movies, do semantic matching and understanding online, find the range of movies he may want, and then let him make further choices.

This is another cooperation we have with the Hong Kong New World Group. We want to create a scene here, where the information desk on the first floor of the mall becomes a mobile information desk, because every time I go to the mall, I am confused. I often cannot find the exact location of the merchant I want to find, and even have to ask the waiter, and even ask where the toilet is. At this time, I hope to have a portable information desk to ask questions about the mall, and even help me guide my interests. From left to right, there are robots with different conversations. You can chat casually, like Xiaobing, to help you solve problems and have smooth conversations. The second is a consulting conversation, including how to park, business hours, and other information questions and answers related to the mall. The third focuses on solving two scenarios, one is catering recommendations, and the other is retail shopping guides. It hopes to help you solve the problem that when you don’t know what restaurants there are in this mall, but you have a need to eat, we will recommend restaurants for you. Such robots, adding WeChat public accounts, can bring new shopping experiences. The above is all I shared today, thank you!

51CTO reporters will continue to bring you exciting reports from the WOTI2017 Global Innovation Technology Summit, so stay tuned!

[51CTO original article, please indicate the original author and source as 51CTO.com when reprinting on partner sites]

<<:  SF Express's Liu Zhixin: Artificial intelligence helps logistics upgrade

>>:  uSens Linggan Wang Xiaotao: This is the best time for the development of gesture interaction

Recommend

TCP waves four times: Why four times? The principle is revealed!

introduction Hello, everyone. I am your technical...

WiFi is slow and stuck, maybe it’s a traffic jam

The NBA Finals are coming! But when using WiFi to...

What are the six components of structured cabling?

The six major components of structured cabling ar...

What is the relationship between API, ESB, ServiceMesh, and microservices?

Introduction I mentioned before that I would like...