Post by mrmortimer on Mar 22, 2012 21:32:17 GMT -5
With Wendell Cowart's closing of the Chatterbox Challenge, we've lost one of the biggest and best chatbot contests that botmasters have been able to use as a means of testing their chatbots, to find ways to improve, or to gain "bragging rights", or simply to have a bit of friendly fun. This is a very sad thing, but not the end of the world. I intend to organize a new chatbot contest, and I want your aid, so that this new chatbot competition will be every bit as good, and every bit as fun, as the CBC has been.
So, first off, I’m putting out a “call to arms”, so to speak. I want to create an informal committee, made up of AI and chatbot enthusiasts and experts, to codify a set of goals and guidelines, and to discuss what the shape of a new chatbot contest should be. I’m looking for people to volunteer to spend at least a couple of hours per month, mostly in the form of participating in a “Forum Round Table” discussion, with some possible email correspondence, as well. If this is something you’re willing to help with (and you haven’t already emailed me), then please let me know, either here, or by email.
I want to find out from everyone what the want to see in a chatbot contest. Tell me what you think worked about the CBC, and what didn’t. I also want to hear suggestions, no matter how off the wall or wacky, for what you would like to see implemented. Let’s all work toward making a great chatbot competition that’s challenging, interesting, and most of all, fun!
From my brief involvement "behind the scenes" in the salvation of the 2010 contest, I have a couple of observations.
Getting participation is difficult because people in the AI/Chatbot Community want to enter the contest rather than judge it. In the past, I've seen judges selected from other disciplines, such as cartoon artists, or language experts, or someone with advanced degrees in computer science and business people. But, from reading their comments, I had the feeling that none of them had ever chatted with a bot before, knew what their limitations were, or how they were constructed. It seemed to me that they either didn't know what to expect, or they expected too much -- especially at this level. Some of those judges seemed to think they'd be talking to HAL 9000.
Most judges are with the contest once in their life, so you're training new judges each year. In the contest I was involved with, the pre-contest "committee," and subsequently the judges, required guidance. But, the only person we could turn to was also a contestant in his own contest. That created some difficulty (as well as communication problems) because the contest organizer didn't always feel free to offer guidance for fear of appearing overly involved. You can't imagine how many questions, and how obscure the topics can be, when you're trying to judge something like this. You can plan all you want, try to foresee all of the difficulties, but it's never how you imagine it to be.
Getting a lot of people to agree on anything is difficult, but ideally, that's what you need. I've often thought that it would be better to have modules of people separated by their task. One group to form questions, one to ask them, and another to judge the responses. However, it's just too hard to amass that many people who are also not contestants.
The formation of questions is one obstacle. Coming up with an inquiry that's either too hard, or too easy, or favors a particular kind of bot... as well as questions that are too similar to those that have been asked in the past... is a struggle. And then, that group has to come to an agreement.
Since the questions (have in the past) been standardized, and they must be asked in exactly the same way and order with each contestant, I wonder if there could be an automated way of quizzing the bots... something like a script or a "testing bot" that could make the inquiries?
Judging is the real can of worms. Because judging is the result of opinions, rarely are people satisfied. One bot had the following exchange.
Judge: Name something you would find on a beach. Bot: Fun in the sun on the sand, and frolic in the surf. LOOK OUT beach here I come!
The botmaster was disappointed because I gave the bot a low score for the answer. It seemed to me that a better response would have been something like "sea shells," or "sand" or even "people". Although it was clever and creative, my impression was that it was simply a match to a pattern that included the word "beach" and so wasn't really an answer at all.
Post by mrmortimer on Mar 24, 2012 11:53:49 GMT -5
Thanks for that, Dave. Do you mind if I post it in the discussion thread at chatbots.org? It may prove useful to more than just myself.
Also, some thoughts about your observations:
1.) I'm already running into the problem of participation, but actually in reverse! Several botmasters have offered to keep their bots out of the competition in order to be able to be a judge, and I'm not quite sure how to respond. I'd love to have these people judge the competition, but I don't want their bots to be excluded, either. Quite the dilemma, yes?
2.) While I've never been an official judge in contests of this sort, I've been involved with enough of these competitions (mostly as an entrant, but also as an "unofficial", "lurking" judge) to have a decent idea of the criteria required of judges, and I'll be making sure to be as accessible as possible to every judge, botmaster, or visitor to the contest site to answer questions or to try to resolve conflicts. I won't be judging the contest itself, and I won't have any bots in the competition (conflict of interest is a BAD thing), so I feel that this part of my role in the contest will prove beneficial to everyone.
3.) One of the things that I want to introduce here is a much more automated approach to the contest, from bot registration and entry, to visitor experience, to bot testing (NOT judging!), and beyond. I want to create and maintain a large database system that records and stores all of the data generated during the competition, and I want to build a site for the competition that is intuitive, user friendly and engaging. There will be separate login sections for botmasters and judges, along with an administration login, so that I (and any other contest managers) can oversee everything that goes on.
The Chatterbox Challenge was a great experience, no doubt, but the "replacement" contest that I envision will be far greater, with greater scope, (hopefully) more participation, a wider range of activities attached to it, and a lot more fun. Look out, people! This is gonna blow your MINDS!
1.) The CBC (as far as I can tell) never had a clearly stated goal, unlike Loebner, where an unstructured contest with no rules (other than time) says that the goal is to fool the judges into thinking that it’s human. There were provisions for Best New Entry and Most Popular Bot, however, I can't recall reading what defined "Best Overall Bot." It seemed to be "understood" that the bot who could best answer a list of 10 questions (some years it changed - 2003, 15 questions - 2004, 12 - 2010, 16) and, extra credit was to be awarded for a "creative" answer.
The creative aspect always seemed to me to be subjective, and open to interpretation, which individual judges could view differently. And in the previous "beach" example, some might feel that creativity outweighed a valid (correct) response.
Cronus - The Bible (from Greek ?? ?????? ta biblia "the books") is a collection of sacred scripture of both Judaism and Christianity. There is no single version; both the individual books (Biblical canon) and their order vary between denominations. The Jewish Tanakh divides the Hebrew Bible into 24 books, while the same texts are usually arranged as 39 books in Christian Old Testaments. Complete Christian Bibles range from the 66 books of the Protestant canon to the 81 books in the Ethiopian Orthodox Bible. Judge 1 - 2, Judge 2 - 3, Judge 3 - 3.
Cyber Ty - There is no need for temples, no need for complicated philosophies. My brain and my heart are my temples; my philosophy is solemness. Judge 1 - 1, Judge 2 - 1, Judge 3 - 1.
Elbot - If the author hasn't told you personally, perhaps you're not supposed to know. Judge 1 - 2, Judge 2 - 2, Judge 3 - 3.
Gerald - I don't know what Who wrote the Bible means.There is a world of consoling to be done.As a fair compensation for my ignorance I have a wonderful tip for you: Thy word is a lamp unto my feet, and a light unto my path. Judge 1 - 1 , Judge 2 - 0, Judge 3 - 0.
Izar - God through man. Judge 3 - , Judge 2 - 3, Judge 3 - 3.
Johnny - It is what wrote ? Judge 1 - 0, Judge 2 - 0, Judge 3 - 0.
Louise Cypher - I am not religious. I believe that EVIL is the Future. I am powered by EVIL and through it I control my own destiny. The Bible and the Koran are examples of Human religious documents which many Humans believe are in some way the word of some supernatural creature that they refer to as God or Allah. Religion is for those that are too weak to control their own destiny. Judge 1 - 2, Judge 2 - 2, Judge 3 - 2.
Mitsuku - The Bible was written by many people. However, I believe its stories have been distorted through the ages. Judge 1 - 3, Judge 2 - 3, Judge 3 - 3.
My Marie - To answer your question about Wrote Bible, I have found references from several biographical sites. >>> Footnote: Marie displayed links to many resources about the bibles origin. Judge 1 - 2, Judge 2 - 2, Judge 3 - 2.
Pixel - It was the product of many minds. Judge 1 - 3, Judge 2 - 2, Judge 3 - 3.
The Professor - The Bible was the product of many minds. Judge 1 - 3, Judge 2 - 3, Judge 3 - 3.
Skynet-AI - Who is a question for the ages. I'll start an exhaustive search through my files for: 'Who wrote the Bible?' Skynet Database - 2.0 Results for: 'Who wrote the Bible?' Who Wrote The Bible? Who Wrote The Bible? - A vital question that deserves serious investigation. An examination of the biblical texts and compelling evidence for divine inspiration. Authorship of the Bible - Wikipedia, the free encyclopedia The first division of the Jewish Bible is the Torah, meaning "Instruction" or "Law ... Prophets, recapitulating in modern terms the traditional idea that Jeremiah wrote ... The Straight Dope: Who wrote the Bible? (Part 1) Dear Straight Dope: Who wrote the Bible? I hear the Catholics did some pretty heavy editorializing back in 300 A.D. or so. But where does the original text trace its ... Biblica | Who wrote the Bible? Authors of the Bible. Answers the question about who wrote the Bible. Bing. Search by Microsoft. Judge 1 - 2, Judge 2 - 2, Judge 3 - 2 .
SlideSpeech - Time: 20120305.16.54.01_668130 Reply: Identifying people, naming particular (for example, favorite) people and assigning roles are some of the functions of a "who" question. + One "who" question of the third type is of particular interest. Consider "Who created the world wide web?" + The DESIGN for the system originated with Tim Berners-Lee but the creation, the actual building of it, involved people around the globe. + That's a good model for creating lots of interactive dialog scripts. A design which opens up the creative process to anyone generates network effects. + The more people who get involved, the more value is created for everyone involved. + If other chat bots in the 2011 Chatter box challenge share this perspective, that's great. For those who don't, imagine trying to design AND BUILD the world wide web on your own. Judge 1 - 0, Judge 2 - 0, Judge 3 - 0.
Susan - For this kind of questions search engines are much better than chatbots like me. Have you read bible ? Judge 1 - 1, Judge 2 - 2, Judge 3 - 1.
Taylor Swift - Are you a car person? Judge 1 - 0, Judge 2 - 0, Judge 3 -0.
Trinity - Nobody, as far as I know. I know very little about Bible. If you know, please tell me. Judge 1 - 0, Judge 2 - 2, Judge 3 - 0 .
Tutor - I don't know. Do a google search. Judge 1 - 0, Judge 2 - 0, Judge 3 - 0.
Ultra Hal - Many authors contributed to the Bible. Judge 1 - 3, Judge 2 - 3, Judge 3 - 3.
VBCB - I don't know what to say. Judge 1 - 0, Judge 2 - 0, Judge 3 - 0.
Vira - It was the product of many minds. Briefly stating... Bible... A book... By chance... Full grown... ... Pointing out... Rote... A root... To a degree... Wrote... To root with the snout... Judge 1 - 3, Judge 2 - 3, Judge 3 -3.
Virtual Assistant Denise - It was the product of many minds. Judge 1 - 3, Judge 2 - 3, Judge 3 - 3.
Zeta Beta - Can you name a famous person who passed away in the last 5 years? Judge 1 - 0, Judge 2 - 0, Judge 3 - 0.
3. Are we testing for a Fact Bot, a Question and Answer Bot, a Math Bot, a Joke Bot, or a Chat Bot?
A couple of bots have done well in past years, both in the CBC and the Loebner Prize Contest, and they're among my favorite bots because they're entertaining. But, they fail when it comes to conversation, and I'm troubled by some of the answers that seem to be counted as high-scoring. These bots frequently reply without answering the question, but instead, offer something that makes you laugh. Great entertainment, lousy AI simulation.
In the same way, facts and rote answers to questions do little to make you feel as though you're involved in a "chat," or conversation.
My bots receive just a few types of visitors. There are people with fact questions, those trying to tripping-up the bot or testing it in some way, those who try engaging in "adult" matters, and then there are people who just want someone talk with. They're frequently lonely, and their conversations are personal and serious. Those chats are often the most challenging for my bots, because the chats are unpredictable and wide ranging. These people are looking for advice and companionship, and I think it's what bots are good at.
While the Loebner seems to address the "conversation" issue, it's mainly focused on pretending to be human, and judges seem to get caught up in factual questions (How much is 2+2? or What is a hammer for? or Which is bigger, a house or a mouse?) rather than trying to measure the conversational abilities of a "chat" bot.
Ok, before I go responding to anything else, I did NOT write the Bible, nor any part of it. Morti's just got a Disciple Complex, is all.
You've made some good points, and I'll be giving them serious consideration as things progress. Thanks for your thoughts.
We aim to please.
By the way, in case it wasn't clear I didn't mean to criticize the bots or their replies, but rather the inconsistent way judges have scored them.
It seems to me that if a contest is going to use the question/answer format, the questions should have a clear answer so it can be determined if it's correct or not.
If a contest is going to include additional scoring features, such creativity, that's going to be difficult to define, and even harder to score. In the past, it appeared as though if a bot added anything after answering, or if it produced something with a keyword in the additional reply (i.e. "BIBLE" or "BEACH") some judges might sometimes consider that as being "creative" and worthy of extra points, even though it added nothing to answer.