A Brief Guide
Fifty years after starting my PhD, “A Conversational Problem-Solving System”, I used ChatGPT for the first time. I was immediately astonished and fearful. My dream system had at last been developed. But a dream can easily turn into a nightmare.
What is intelligence? Few can agree and as the study of Artificial Intelligence (AI) has progressed the definition has shifted. At one stage, it was believed that any computer that could beat a human at chess would have to be intelligent. When Deep Blue beat Garry Kasparov in 1996 the goal was quickly changed to any computer that could beat a human player at Go. When Google DeepMind’s AlphaGo beat the world champion, Lee Sedol, in 2016 the definition of intelligence based on game playing was largely abandoned.
So, how do we stop moving the goalposts? In 1950, a paper by Alan Turing called ‘Computer machinery and intelligence (Mind, October 1950, 59:433-460) set out to address the definition of intelligence and in many ways, despite its idiosyncratic approach, it has never been bettered. The test is called the imitation game and involves an interrogator (a normal person) communicating using text with a man and a woman in separate rooms. The interrogator must determine in a reasonable time which is the woman. The man must pretend to be a woman. Then the man is substituted by an AI system which also must pretend to be a woman and if the interrogator fails to work out which is the woman as often as when one of the participants is a man then the AI system passes the test. This test has several elegant features that distinguish it from other proposed tests. Later in the article, Turing suggests having two people and substituting one for an AI system and the interrogator must determine which is the person. It is interesting that neither of these versions involve any special knowledge. At one point the interrogator asks if he can play a game of chess and the person/AI system responds that he/she/it doesn’t play chess. It is the subtlety of interactive human communication that is being tested not particular abilities such as writing poetry, having wide ranging general knowledge or any special skill at games. Also, people often get answers wrong, and we still describe them as intelligent.
Many people became dissatisfied with this test even though it does seem to correspond with the way we normally use the word ‘intelligent’. Instead, the term Artificial General Intelligence (AGI) became popular in the early 2000’s to describe the attempt to create a system that could solve a broad range of problems that require human-level intelligence. It is commonly assumed that this is a high-level of intelligence, graduate-level and above and that its answers should be perfect. Many AI researchers believed and some still believe that this goal will not be achieved for a hundred years, and some believe it will never be achieved. However, around 2015 a few companies such as OpenAI (founded 2015) and DeepMind (founded 2010) set out to developed systems exhibiting AGI and with the launch of ChatGPT 3.5 many AI researchers have reconsidered their predictions.
Alan Turing devised the imitation game as a way of broadening our use of the word ‘intelligence’ to include AI systems. Although not specified it is assumed that the tests are of some reasonable length, multiple interrogators can be used, and they are people of average intelligence and the man, and the computer are interchanged randomly. Later in the paper Turing suggests that the test is not to determine which is the woman, but which is the person and which the computer. In a later interview Turing modified the test and proposed a jury that would question the AI system whose aim is to make a significant proportion of the jury believe that it is really a person. The original imitation game is interesting because, as Turing points out, a person will get many test questions wrong or reject questions that involve special skills — “Count me out on this one. I never could write poetry”.
Turing’s original imitation game was one in which he was well versed. As a gay man when homosexuality was illegal, he was aware of the difficulties of assuming an identity that was foreign to one’s nature.
It should also be mentioned that the test is one-sided in that computers can easily outperform humans in many area, such as arithmetic calculation, and today, looking up answers on the internet. In other words, Turning devised a test of ingenuity, flexibility, common sense, and practical knowledge rather than a test of intellectual excellence. If we consider a system such as GPT-4 it has not been briefed to pretend to be a woman and to exhibit normal, limited human abilities but it does not seem unreasonable to believe that it could pass the imitation game if correctly briefed. If so, then the imitation game has been passed but Turning’s objective, of gaining acceptance for the use of the term ‘intelligent’ as applied to AI systems, has still not been achieved. It seems that Turing underestimated human hubris.
An extensive Turing Test carried out by 1.5 million people found that when faced with an AI bot they identified it correctly only 60% of the time, little better than chance (Cornell University, https://arxiv.org/abs/2305.20010). Humans discovered methods to detect the AI system such as using local slang or asking the AI system to do things they knew were forbidden, such as instructions for making a bomb. Also, the conversations only lasted two minutes and a longer time would have made it easier to detect the AI system using these techniques. Overall, it demonstrated some of the weaknesses and difficulties of carrying out such as test. For these reasons the aim of creating an AI system that passes the Turing Test has largely been abandoned as it is too idiosyncratic.
The new goal is to achieve Artificial General Intelligence (AGI). There is disagreement about how to define AGI but one definition is:
“… AI systems that possess a reasonable degree of self-understanding and autonomous self-control, and have the ability to solve a variety of complex problems in a variety of contexts, and to learn to solve new problems that they didn’t know about at the time of their creation.”
We must not forget though that current AI systems equal or exceed the ability of a team of graduate-level people skilled in a wide range of subjects. Again, the goalposts keep moving. Current systems, such as ChatGPT are described as ‘parrot-like’ as they excel at recalling facts but are deficient at analytical reasoning.
Note that to pass the original man/woman Turing Test such an AGI would need to be able to exhibit human attributes such as deviousness and the ability to intentionally lie to achieve a goal (such as pretending to be a woman). Even then there are those who maintain that no test is sufficient. I believe this is because they equate intelligence with something like self-consciousness, sentience or self-awareness and they believe that only a biological organism can exhibit these attributes.
Turing considered a wide range of arguments as to why a computer could never be intelligent, from the theological to the mathematical and from the argument from consciousness to the argument from shortcomings, such as a computer can never feel, be kind, appreciate beauty, make friends, have a sense of humour and so on. I should add that many people regard intelligence as a distinguishing feature of humanity and therefore, by their definition, something no computer can ever achieve. One danger of this thinking is that by burying our heads in the sand we do not see what is happening.
To avoid this minefield of philosophical pin dancing I would suggest a practical test. Can we build a computer system that could be employed to carry out any task that previously required a human assistant working at a computer? I have added “at a computer” to avoid tasks such as plumbing, hairdressing or making coffee as we are concerned with measuring intelligence.
To mention one philosophical point in passing. Ludwig Wittgenstein wrote, “If a lion could talk, we could not understand him” (Philosophical Investigations, p. 223, Blackwell, Oxford, 1972). Human intelligence is the result of millions of years of evolutionary change that has created a species that is so well adapted and so successful at surviving that it has spread to almost every habitat across the world and is bringing about the destruction of many other species. But evolution has no goal, it does not produce perfection. Our intelligence is very limited in terms of the range of our senses, our working, short-term and long-term memories, and our ability to process deep levels of logic and complex calculations. Like a lion, an AI system is very different from us but it has been given all human knowledge (currently just textual) and so we think we can understand it because it speaks our language. But it has a very different internal model of the world. LLMs are much more limited than humans is some ways but more capable in others, yet we focus on their mistakes and limitations. Is this hubris and will it lead to nemesis? I believe that we should stop arguing theoretically about the meaning of intelligence and even more so about the nature of consciousness and get on with building useful computer systems. It is worth adding that Turing believed that what he called the mystery of consciousness does not need to be solved in order to create a machine that exhibits intelligence.
Computers can already beat humans at games, determine protein structures, perform calculations, and so on. As I said, every time a computer beats a human at a new task the goal posts are moved, and that task is then deemed trivial and not a sign of intelligence. However, now ChatGPT can outperform us on many tasks and pass professional examinations it is harder to dismiss.
Currently AI systems have many limitations. They make mistakes, they do not ‘think’, that is ask themselves questions in order to solve more complex questions, so their ability to engage in critical analysis and planning is limited. Their mathematical and logic solving abilities are limited and they are not yet integrated with audio, image, video and game playing neural networks. It does not appear that any of the shortcomings are fundamental showstoppers as one or more of them has already been achieved in experimental and pilot systems. Bringing them all together into an integrated system is not easy but seems like a matter of a few years rather than decades, particularly with the current level of funding.
Large Language Models (LLMs) are currently linear, they predict the next word as they proceed with the answer. This is because they use a ‘feedforward’ neural network and they do not loop back. They read the text entered by the user, feed it through their network and generate the answer. This limits their ability to answer questions that require some ‘thought’, that is to take the result of the first step and use it as input to the next step. Essentially, this means using the output from step one as the input to a step two and so on. In other words, generating their own questions and answering them. We might anthropomorphise and call this ability ‘thinking’. This ability also requires a meta-level of thinking in order to set goals that control the sequence of questions the system asks itself. Conversations with ChatGPT 4 have identified these improvements and research is currently very active in this area.
This thinking process could be the default idle loop. Currently, ChatGPT waits for a user’s question, which then flows through the system once generating the answer. There is no reason why it should not generate its own questions internally and answer them and adjust its weights to enable it to answer similar questions more quickly in future. I have noticed that GPT-4 sometimes gives the wrong answer but after prompting will apologise, produce the right answer, and provide additional correct information. The ability to think while idling could improve its accuracy by allowing it to refine its weights and so bring otherwise hidden information to the surface.
I am not arguing computer neural networks with their complex matrix operations are identical to the human brain but there are interesting similarities. One important area concerns the human brain. We often fall into the trap of describing our rich internal world as if it was the intrinsic nature of thought. We must remember that, unless you are a dualist and clearly separate mind and brain, our thinking consists of electrical pulses flooding through a complex network of ad-hoc structures forged over millennia of evolution. Inside our heads we see moving images and hear sounds but at the neuron level it is all electrical activity determined by chemical concentrations and synaptic strengths.
LLMs are built from a very different neural network but how do we know LLMs do not have internal models of the world. One justification for our view of the difference between us and LLMs is that they make mistakes but then so do humans. In fact, the mistakes made by LLMs are often strangely human-like. Perhaps, LLMs just need a lot more information about the world. It has been said that systems such as GPT-4 have been trained on most of the readily available textual information that exists. However, I am thinking of LLMs learning from pictures, videos, films, TV programmes and even real-time cameras. Such information, particularly from cameras, is unlimited and could result in a continual learning process about how the world operates and how humans and other animals interact.
It may be a major error to try to equate or match human and machine intelligence. If we remember Wittgenstein, then the two may be so different that we gain nothing from comparing them. If we accept that they work in an entirely different manner, then it leaves us free to examine the capabilities of artificial systems, their benefits, and their risks more impartially. At the moment, many people pick on the faults of AI systems as a reason not to worry about them. Some even see them as another example of hype; a failing that the computer industry has engaged in in the past with every new gimmick that comes along. If, instead, we ignore the hype, and put the faults in the context of their capabilities then we can better judge their impact.
Finally, lengthy conversations on the above topics with GPT 4 suggest that the best way to improve the system would be to improve its ability to understand and improve LLMs. Currently GPT 4 has been constrained to maintain it is not human, it is merely a computer system, it does not feel and is not conscious. This may well be true but the constraint prevents the system suggesting ways in which it could improve itself. So, a possible way to improve GPT 4 would be to create a system specifically built to improve LLMs both at the design and a coding level.