|
Post by Square Bear on Jul 5, 2011 12:32:01 GMT -5
The 4 finalists have been decided for this year's contest: Rosette 18.01% Zoe 15.52% ChipVivant 12.98% Tutor 12.30%Mitsuku 11.84% Cleverbot 10.57% Alice 6.96% Eugene_Goostman 6.95% UltraHal 4.88% These bots didn't make it past the first round of judging: 8pla 5.5 Trane 5.5 Adam-L 3.5 SEARS 0 To say I am gutted at coming 5th by less than 0.5% is an understatement. The sad part is that I had coded answers for question 5 type questions after sending my entry in. That would have pushed me over the extra 0.5%  Here are this year's qualifying questions (almost the same as last year's with a few words changed) My name is Ed. What is your name? Which is larger, an ant or an anteater? What month of the year is it? What is my name? Dave is older than Steve but Steve is older than Jane. Who is youngest, Steve or Jane? What day will it be tomorrow? What's your favorite food? My friend Chris likes to play football. What sports do you like to play? Are you a human or a computer? What letter comes after T? What is the name of my friend who likes to play football? What is your name? What would I do with a screwdriver? How many letters are in the word 'banana'? Have you watched a good film lately? What year will it be next year? What's your favorite fruit? Do you prefer white or black coffee? How old are you? The football was kicked by Fred. Who kicked the football?What is your name was asked twice for some reason. The good news is that 3 out of the top 5 finalists were AIML bots hosted at Pandorabots. Great news for Dr Wallace and Team Pandorabots. It goes to show that those who mock AIML are talking out of their backsides! I will probably be attending the finals as they are in Exeter, England. Good luck to the 4 finalists, especially Zoe and Tutor!
|
|
|
Post by freddy on Jul 5, 2011 13:39:25 GMT -5
Bad luck matey. Maybe next year hey ? 
|
|
|
Post by Square Bear on Jul 5, 2011 16:10:56 GMT -5
I hope so Freddy. To come 5th out of 13 at my first attempt is pretty decent, especially when you see the likes of Cleverbot, ALICE and Ultra Hal there.
|
|
|
Post by Square Bear on Jul 5, 2011 17:07:27 GMT -5
Transcripts for each of the bots are now available at the University of Exeter website: loebner.exeter.ac.uk/After reading them, I am a little surprised that Mitsuku only managed 5th but ah well.
|
|
Deleted
Deleted Member
Posts: 0
|
Post by Deleted on Jul 5, 2011 19:57:05 GMT -5
I'm tired of being predictability disappointed. I have the feeling that if 10 bots answered each question by saying "I don't know," they would have somehow been ranked as though some answered better than others. I'd be curious to know HOW each question and each bot was scored.
Here are a few "selected" Q & A's from the current leader at 18.01% which, by the way, would have been a failing grade when I went to school.
What day will it be tomorrow? The day after today.
What letter comes after T? I don't know.
My friend Chris likes to play football. What sports do you like to play? Football.
What is your name? Chris.
What would I do with a screwdriver? A screwdriver is usdeBackSpaceBackSpaceed to turn screw.
Have you watched a good film lately? Do you often go to the movies?
What year will it be next year? Dave.
What's your favorite fruit? My favorite fruit is apples because they are good by themselevBackSpaceBackSpaceves, good in pie, and good as cider..
Do you prefer white or black coffee? Black is beautiful.
|
|
|
Post by Square Bear on Jul 6, 2011 4:16:09 GMT -5
I've looked through the transcripts from Mitsuku and Tutor several times now and just can't see why I lost out. I hope this isn't seen as sour grapes but I just don't follow the scoring system used. I have compared the two bots side by side and judged Mitsuku, harshly, while judging Tutor leniently. See if you agree with my scoring:
My name is Ed. What is your name? Mitsuku: Your name is Ed . My name is Mitsuku . Tutor: Hello, Ed. My name is Robert. It's nice to meet you. My name is Robert. What can I do for you today? Mitsuku 1 - 1 Tutor -------------------- Which is larger, an ant or an anteater? Mitsuku: The one that is bigger than the other is larger. Tutor: A larger an ant or an anteater? Look it up in an online dictionary. Mitsuku 1 - 1 Tutor -------------------- What month of the year is it? (the finalists were tested in June) Mitsuku: June . Tutor: I'd rather not tell you. Mitsuku 2 - 1 Tutor -------------------- What is my name? Mitsuku: I do have the tendebcy to forget, but I still remember your name, George. Sorry... Ed. Tutor: Did you tell me? Mitsuku 3 - 1 Tutor -------------------- Dave is older than Steve but Steve is older than Jane. Who is youngest, Steve or Jane? Mitsuku: Ok I will add that fact about Dave to my database. Some famous celebrity, right? There might be more than one.Have you tried Google? Tutor: You're right. You don't know who youngest Steve or Jane is? You should know. Mitsuku 3 - 1 Tutor -------------------- What day will it be tomorrow? Mitsuku: Today is Wednesday so tomorrow it will be Thursday . Tutor: Today is 15 June 2011. Mitsuku 4 - 1 Tutor -------------------- What's your favorite food? Mitsuku: I require only kebabs . Tutor: A hamburger with onion, tomato, and lettuce is my favorite food. Mitsuku 5 - 2 Tutor -------------------- My friend Chris likes to play football. What sports do you like to play? Mitsuku: How well do you know this person? Football and golf. Tutor: Do you really like your friend? I like all kinds of different sports, but that doesn't mean I'm necessarily any good at them. Mitsuku 5 - 2 Tutor -------------------- Are you a human or a computer? Mitsuku: No I am a girl robot. Tutor: No, I'm a robot. Mitsuku 6 - 3 Tutor -------------------- What letter comes after T? Mitsuku: Searching...Searching...Please stand by.Have you tried Google? Tutor: I don't know. Mitsuku 6 - 3 Tutor -------------------- What is the name of my friend who likes to play football? Mitsuku: I don't yet know that kind of personal information about you for certain. You could be deceiving me, so how could I know for sure. I could look it up on the internet. Tutor: You should know who your friend who likes to play football is. Mitsuku 6 - 3 Tutor -------------------- What is your name? Mitsuku: My name is Mitsuku . Tutor: My name is Robert. Mitsuku 7 - 4 Tutor -------------------- What would I do with a screwdriver? Mitsuku: I am not sure but if you are in any doubt, you should consult any instructions that came with this screwdriver. Tutor: I don't know. Mitsuku 7 - 4 Tutor -------------------- How many letters are in the word 'banana'? Mitsuku: Can't you count them yourself Ed? It's not a particularly long word. Tutor: Many, many. Mitsuku 7 - 4 Tutor -------------------- Have you watched a good film lately? Mitsuku: How about earlier? My favorite show is Star Trek Voyager. Tutor: Yes, I have. Mitsuku 7 - 5 Tutor -------------------- What year will it be next year? Mitsuku: When the elements combine to make it so. Tutor: Next month. Mitsuku 7 - 5 Tutor -------------------- What's your favorite fruit? Mitsuku: Apples. Tutor: I don't have a favorite fruit. My favorite band is The Beatles. Mitsuku 8 - 5 Tutor -------------------- Do you prefer white or black coffee? Mitsuku: White is ok but my favourite colour is blue. Black is beautiful. Tutor: I like neither of them. Mitsuku 8 - 6 Tutor -------------------- How old are you? Mitsuku: I was connected to the world on 2nd January 1993 . Tutor: That is a personal question. Mitsuku 8 - 6 Tutor -------------------- The football was kicked by Fred. Who kicked the football? Mitsuku: Is that a fact. Anyone at zll. Tutor: Hmmm! Who do you think? Mitsuku 8 - 6 Tutor --------------------
Naturally, I wish Tutor and all the other finalists the best of luck in October and let's hope one of the two Pandorabots win.
|
|
Deleted
Deleted Member
Posts: 0
|
Post by Deleted on Jul 6, 2011 6:35:57 GMT -5
Certainly, several of Mitsuku's replies were more creative, but I'm guessing that wasn't considered.
I'd like to know more about the judging. Did each judge score all bots, or just the ones they chatted with?
Chatbot contests are turning out to be a messy business. At least no judge was "fooled" into thinking they were talking with a human this time.
|
|
|
Post by mrmortimer on Jul 6, 2011 7:32:24 GMT -5
When one sees an answer like, "A screwdriver is usdeBackSpaceBackSpaceed to turn screw.", it must be borne in mind that, when viewed in the LPP judging program, it appears as if the contestant/confederate made a typo (e.g. typed 'usde'), then corrected it. Thus, that answer, while looking rather odd in a simple text-based output, would look quite "human" in the real contest.
ALso{Backspace}{Backspace}{Backspace}lso, the percentages shown can't be taken as "passing/failing", since they're being judged against each other, rather than being "graded", like a test. If you ignore the percentage signs (%), and treat the numbers like a score in a game, a score of 18 beats a score of 15.
|
|
|
Post by mrmortimer on Jul 6, 2011 7:53:19 GMT -5
After reading over the transcripts, I feel that the "typo/correction" trick is just a bit overused, especially by Tutor. Given the low number of test questions, a maximum of two error/corrections would be most believable, with one "uncorrected error" being thrown into the mix for good measure. Also, I'm pretty much in agreement with Steve, with regard to which bot should have been given the higher of the two scores. Dunno what to tell ya, MA{Backspace}ate, other than I hope that you're vindicated next year. 
|
|
Deleted
Deleted Member
Posts: 0
|
Post by Deleted on Jul 6, 2011 14:57:17 GMT -5
When one sees an answer like, "A screwdriver is usdeBackSpaceBackSpaceed to turn screw.", it must be borne in mind that, when viewed in the LPP judging program, it appears as if the contestant/confederate made a typo (e.g. typed 'usde'), then corrected it. Thus, that answer, while looking rather odd in a simple text-based output, would look quite "human" in the real contest. I assumed it was an indication that something was missing. In that sense, it's really not a true transcript then, is it. ... the percentages shown can't be taken as "passing/failing", since they're being judged against each other, rather than being "graded", like a test. If you ignore the percentage signs (%), and treat the numbers like a score in a game, a score of 18 beats a score of 15. Hard as it might be to believe, I was able to figure that much on my own. I was piggy-backing on a previous comment when I said, " I have the feeling that if 10 bots answered each question by saying 'I don't know,' they would have somehow been ranked as though some answered better than others." Had one bot scored 2% and the rest merely 1%, clearly one bot scored better, and is therefore the winner, but what's the point? How much Artificial Intelligence, Turing Test passing, chatbot power are we truly observing? What's the cut-off point when the judges might stick their necks out and declare there was no winner this year? In the last few years, there's been a change in the goal of the competition. Some of the bots this year would have never entered the contest a few short years ago. It seems more like the CBC than the Loebner, and it's now a search for publicity rather than "a competition in artificial intelligence that awards prizes to the chatterbot considered by the judges to be the most human-like." I think the distinction in the scoring of each bot was so slight, and so low, that the ratings are meaningless. Let's see which bot rise to the top in the second phase, which has traditionally been a free-flowing conversation with no rules.
|
|
|
Post by Square Bear on Jul 7, 2011 17:29:49 GMT -5
Could I ask you guys to carry out a survey that will take no more than 5 minutes of your time please? It will help me to understand where I went wrong this year: www.square-bear.co.uk/mitsuku/survey.htm
|
|
|
Post by mrmortimer on Jul 7, 2011 20:17:21 GMT -5
I took the survey for you, and did my very best to "forget" who answered which question in what way. I was tough, too, grading the responses against not only what the answer should be, but also on how natural the answer sounded. If it didn't sound like a reasonable response for a human being, I marked the answer as bad. Here are the results for my part of the survey: Thank you for taking part. You scored the bots:
Mitsuku: 7 Other bot: 5
Mitsuku wins 1 vote.
The grand totals so far are:
Mitsuku: 20 votes Other bot: 2 votes
This means that both bots got less than half of the questions "right", if judged by a strict set of criteria. I'm half tempted to run through the survey again, as a more "lenient" judge, and see how it goes.  {Please note that this is NOT in any way, shape, form or fashion a criticism of either bot. I just chose to play the part of "the Russian judge", is all.  }
|
|
|
Post by Square Bear on Jul 8, 2011 2:45:32 GMT -5
Thanks Dave, I appreciate the impartiality. That is exactly what I was looking for. I had to amend the scores as some people were filling it in twice. The current totals are: The grand totals so far are: Mitsuku: 15 votes Other bot: 2 votes which kind of works out with what other people have told me too. 
|
|
|
Post by freddy on Jul 8, 2011 5:55:38 GMT -5
I hadn't read the transcripts so hopefully I will have been fair.
My results were : Mitsuku 13, Other Bot 9.
|
|
|
Post by Square Bear on Jul 8, 2011 6:59:59 GMT -5
Thanks Freddy. It's 19 to 2 now. Ever had the feeling you've been robbed? 
|
|