\addbibresource

interacttfvsample.bib

Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study

\nameLiuchang Xua,b,d,+, Shuo Zhaoa,+, Qingming Lina, Luyao Chena, Qianqian Luoa, Sensen Wub, Xinyue Yec, Hailin Fenga, Zhenhong Dub,* +These authors contributed equally to this workCONTACT Zhenhong Du. Email: [email protected] aSchool of Mathematics and Computer Science, Zhejiang Agriculture and Forestry University, Hangzhou 311300, China; bSchool of Earth Sciences, Zhejiang University, Hangzhou 310058, China; cDepartment of Landscape Architecture and Urban Planning & Center for Geospatial Sciences, Applications and Technology, Texas A&M University, College Station, TX, 77843; dFinancial Big Data Research Institute, Sunyard Technology Co., Ltd., Hangzhou 310053, China
Abstract

The advent of large language models such as ChatGPT, Gemini, and others has underscored the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on spatial tasks has not been comprehensively assessed. This study addresses this gap by introducing a novel multi-task spatial evaluation dataset, designed to systematically explore and compare the performance of several advanced models on spatial tasks. The dataset encompasses twelve distinct task types, including spatial understanding and path planning, each with verified, accurate answers. We evaluated multiple models, including OpenAI’s gpt-3.5-turbo, gpt-4o, and ZhipuAI’s glm-4, through a two-phase testing approach. Initially, we conducted zero-shot testing, followed by categorizing the dataset by difficulty and performing prompt tuning tests. Results indicate that gpt-4o achieved the highest overall accuracy in the first phase, with an average of 71.3%. Although moonshot-v1-8k slightly underperformed overall, it surpassed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on model performance in specific tasks. For example, the Chain-of-Thought (COT) strategy increased gpt-4o’s accuracy in path planning from 12.4% to 87.5%, while a one-shot strategy enhanced moonshot-v1-8k’s accuracy in mapping tasks from 10.1% to 76.3%.

keywords:
large language models, ChatGPT, Benchmarking, spatial reasoning, prompt engineering

1 Introduction

The advent of large language models has revolutionized the fields of conversation and text generation[brown2020language][zhou2023comprehensive][azaria2022chatgpt][kocon2023chatgpt][zhao2023survey]. ChatGPT was publicly released in November 2022, followed by Bard (now renamed Gemini) in March 2023[team2023gemini]. These products quickly captured public attention, leading to widespread experimentation and discourse across social media platforms. ChatGPT, based on the GPT-3.5 architecture, is widely recognized for its ability to generate coherent and human-like responses. Conversely, Bard leverages Google’s LaMDA[thoppilan2022lamda] technology, enabling it to handle diverse language tasks and provide detailed information. Notably, the development of large language models has been extraordinarily rapid. Numerous models, such as BlenderBot, Galoida, LLaMA (FAIR), Alpaca, BloombergGPT, glm-4 (ZhipuAI)[glm2024chatglm], LaMDA/Gemini, Chinchilla (DeepMind), Palm, gpt-4-turbo, and gpt-4o[achiam2023gpt], have emerged. These general-purpose large language models, equipped with hundreds of billions of parameters and trained on extensive datasets, have demonstrated significant potential in natural language understanding, reasoning, and code generation. They offer unprecedented opportunities for communication, creativity, and information retrieval. Some of these models even have the capability to retrieve information from the internet (e.g., gpt-4 integrated with MS Bing)[zhao2023survey].

With the rapid emergence of large language models, focus has shifted toward evaluating their performance and assessing their applicability across various specialized domains[kuo2024gpt][zheng2024natural][mooney2023towards]. Consequently, systematic performance comparisons of these models have begun to surface[borji2023battle][hochmairacorrectness][espejel2023gpt][ye2303comprehensive][lin2024wildbench][koubaa2023gpt]. To this end, the academic community has developed several benchmarks for comparing the performance of large language models. C-Eval is the first comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundational models in a Chinese context. C-Eval includes multiple-choice questions at four difficulty levels: middle school, high school, college, and professional. These questions span 52 disciplines, from humanities to science and engineering. Accompanying C-Eval is C-Eval Hard, a subset of particularly challenging topics requiring advanced reasoning skills[huang2024c]. SuperCLUE primarily evaluates the practical application capabilities of large language models and includes three sub-tasks: actual user queries and ratings from the LLM battle platform (CArena), open-ended questions with single and multi-turn dialogues (OPEN), and closed-ended questions with the same stems as the open-ended ones (CLOSE)[xu2023superclue]. AGIEval is specifically designed to evaluate foundational models in human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and bar exams[zhong2023agieval]. MMLU tests cover 57 tasks, including basic mathematics, U.S. history, computer science, and law[hendrycks2020measuring]. OpenCompass accurately, comprehensively, and systematically assesses the capabilities of large language models. Its tasks encompass exam abilities, knowledge capabilities, language abilities, comprehension abilities, reasoning abilities, and safety capabilities, forming an integrated evaluation system across these six dimensions. These benchmarks rigorously test the general capabilities of the models through comprehensive evaluation methods.

Refer to caption
Figure 1: Construction of task categories in spatial task datasets.

Alongside the general capability assessments of large language models, research has also explored their application in specialized spatial tasks within the GIS field[manvi2023geollm]. For instance, Yingjie Hu et al. assessed models (e.g., ChatGPT and gpt-4) for their ability to accurately extract geographic locations from disaster-related social media information[hu2023geo]. Mohamed Aghzal et al. evaluated the spatiotemporal reasoning ability of large language models (e.g., gpt-4 optimized with various few-shot prompting methods, fine-tuned BART and T5 models) through ”route planning” tasks requiring models to navigate to target locations while avoiding preset obstacles[aghzal2023can]. Zilong Liu et al. used DBpedia abstracts as real-world corpora to evaluate the geographic diversity understanding of large language models (gpt-4)[liu2024measuring]. Manasi Sharma et al. studied the out-of-the-box performance of models like ChatGPT-3.5, Chatgpt-4, and Llama 2-7B on 3D robotic trajectory data and related tasks (including 2D direction and shape marking) from the CALVIN baseline, assessing spatial reasoning capabilities, particularly in handling numerical trajectory data[sharma2023exploring]. Yutaro Yamada et al. designed natural language navigation tasks to comprehensively evaluate the spatial structure representation and reasoning abilities of LLMs (specifically gpt-3.5-turbo, gpt-4, and Llama2 series models)[yamada2023evaluating]. Shanshan Feng et al. empirically assessed LLMs (e.g., ChatGPT) in predicting the next POI recommendation task through novel prompt strategies[feng2024move]. However, these studies only evaluate large language models on specific spatial tasks, lacking a systematic and comprehensive benchmark to assess performance across various spatial tasks. To address this, our study undertakes the following:

  • We developed a comprehensive and independent spatial task dataset (see Figure 1 ), encompassing twelve distinct task categories: spatial literacy, GIS concepts, mapping, function interpretation, code explanation, code generation, code translation, toponymic identification, spatial understanding, numerical trajectory recognition, geographic feature search, and route planning.

  • Our study tested several well-known models, including three from OpenAI: gpt-3.5-turbo, gpt-4-turbo-2024-04-09, and gpt-4o, as well as ANTHROPIC’s claude-3-sonnet-20240229, moonshot AI’s moonshot-v1-8k, and zhipuai’s glm-4. Additionally, we designed a comprehensive set of test scripts, utilizing API calls and precise control parameter settings to ensure the rigor and reproducibility of the experimental process.

  • This study conducted two rounds of testing: the first round was a zero-shot test to evaluate the models’ initial performance without any prompt tuning. Based on the results of the first round, we classified and assessed the difficulty levels of the spatial task dataset. For spatial tasks where the models performed poorly, we conducted prompt strategy tuning tests. The prompt strategies used included One-shot, Combined Techniques Prompt, Chain of Thought (CoT), and Zero-shot-CoT.

  • We developed comprehensive evaluation strategies and introduced a new metric, WA, to more intuitively observe the models’ capabilities.

2 Spatial Tasks Dataset

2.1 Dataset design and construction

To construct a comprehensive and challenging spatial task dataset, we employed a structured and efficient data collection strategy. The creation process involved multiple stages, starting with a team of three experts, including the authors, determining the types of questions to include. To ensure the questions reflect real-world application needs and provide broad coverage as well as significant challenges, we based them on six key dimensions:

Conceptual topics:

These questions primarily assess the model’s understanding of fundamental geographic and spatial concepts, assessing its ability to accurately comprehend and apply the core theories and practices of Geographic Information Systems (GIS)[wang2022thoughts][bolstad2012gis].

Explanatory subjects:

These questions evaluate the model’s ability to explain complex technical information and data. Explanatory questions require the model not only to understand information but also to clearly convey its meaning to users[hochmair2023perspectives].

Intellectual topics:

These questions examine the model’s ability to extract and process specific information from text[hochmair2023perspectives]. They test the model’s information retrieval and knowledge application skills, which are crucial for assessing its practical utility.

Operational problems:

These questions assess the model’s ability to apply its knowledge in real programming tasks. Operational questions test the model’s practical skills, especially in software development and coding.

Inference problems:

This type of question evaluates the model’s navigation and logical reasoning abilities in complex spatial environments. These questions require the model to understand spatial layouts and perform complex logical judgments and reasoning.

Applied problems:

These questions primarily test the model’s ability to solve problems in real-world application scenarios. Applied questions emphasize the model’s practicality and performance in real-world contexts[juhasz2023chatgpt].

Next, we designed spatial tasks aligned with the six aforementioned dimensions. Our task design strategy leveraged previous research and online resources such as Wikipedia to ensure the breadth and depth of the test content, thoroughly evaluating the models’ ability to handle various types of information. For example, in conceptual topics, we determined the categories of spatial literacy and GIS concepts based on WANG Jiayao et al. ’s summary of GIS theoretical knowledge and entrance exams for college and graduate students[wang2022thoughts]. For explanatory subjects, we drew on the research of Hartwig H. Hochmair et al. on code and function interpretation[hochmairacorrectness] and Manasi Sharma et al. ’s study on the application of LLMs to numerical trajectory data[yamada2023evaluating], establishing tasks for function interpretation, code explanation, and numerical trajectory recognition. In intellectual topics, we based the geographic feature search task on Zilong Liu et al. ’s use of DBpedia abstracts as real-world corpora to assess LLMs’ understanding of geographic diversity[liu2024measuring], and the toponymic identification task on Yingjie Hu et al. ’s evaluation of LLMs’ ability to accurately extract geographic locations from disaster-related social media information[hu2023geo]. For operational problems, we determined the task types of code generation and code translation based on Hartwig H. Hochmair et al. ’s research on the practical application of code in spatial tasks[hochmairacorrectness]. In inference problems, we defined the spatial understanding task type based on Yutaro Yamada et al. ’s research on evaluating LLMs’ representation and reasoning of spatial structures through natural language navigation tasks[yamada2023evaluating], and the route planning task type based on Mohamed Aghzal et al. ’s evaluation of LLMs’ ability to infer correct paths on a grid[aghzal2023can]. For applied problems, we determined the mapping task type based on practical operations with Mapbox and the understanding of specific parameters in Mapbox map links. After determining the categories, team members designed specific questions and reference answers based on academic papers related to each task and the provided code or examples. Each expert was responsible for different task categories, ensuring diversity and comprehensive coverage of the dataset questions.

During the design and collection process, we selected and compiled 900 questions, including rare and complex ones, distributed across 12 different categories, covering various aspects from GIS concepts to programming skills. To ensure the accuracy and practicality of the questions, the entire dataset underwent multiple rounds of rigorous review and validation. Specifically, in categories like spatial literacy, GIS concepts, and mapping, we required that the designed questions have not only theoretical value but also practical application value. The final step involved verifying the matching of questions, answers, and task categories to ensure that each question was appropriately set for real-world applications. It is important to note that some categories, such as spatial literacy and GIS concepts, involve subjective questions that do not have standard answers.

Refer to caption
Figure 2: Number of questions per category in the dataset (total number of questions: 900).

2.2 Task categories

Next, this section will detail the structure and functionality of each category, along with corresponding question examples.

2.2.1 Spatial understanding

This task category focuses on assessing the language model’s ability to represent and reason about spatial structures. By designing a series of natural language navigation challenges, including grid, hexagonal, and triangular grids, as well as circular or tree-like structures, we test how models understand and manipulate these spatial configurations. Spatial understanding is an essential skill across multiple fields, particularly in applications such as autonomous driving and robotic path planning. These tasks not only evaluate the model’s ability to recognize physical locations but also require it to abstractly analyze and infer spatial relationships. For example, a model might need to deduce the global or local layout of a map from a series of location movements. Accurate spatial understanding skills significantly enhance the efficiency and accuracy of navigation systems. Question example:

Question: You have been given a 2 by 2 square grid. Starting from a vertex, you will move along the edges of the grid. Initially, you are positioned at the bottom left corner of the grid, where you find a box turtle. You move right by one step, where you find a table lamp. You move up by one step, where you find an American black bear. You move left by one step, where you find a hand plane. You move down by one step. What will you find?
Answer: box turtle

This example demonstrates how the model handles challenges related to spatial structures and directional movements. From a series of movement instructions, the model can successfully return to the initial position and accurately identify the objects there. This indicates the model’s ability to parse and reason about spatial relationships, which is crucial for navigation systems.

2.2.2 Route planning

This task category involves determining the optimal route from a specified starting point to a destination while avoiding various obstacles. This task is typically performed in a grid-based simulation environment using basic directional commands such as ”up,” ”down,” ”left,” and ”right” for navigation. Path planning techniques like these are widely used in automated navigation systems, including robotic path planning and autonomous vehicles, where efficiently and safely selecting routes is critical. This not only tests the model’s spatial reasoning abilities but also its capability to handle dynamic obstacles in complex environments. Question example:

Question: You are in a 6 by 6 two-dimensional arrays. You must avoid some obstacles: (2,1). From (0,1) to (3,4). Just use ”up, down, left, right” to indicate your route. Your output is formatted as: path:
Answer: right, right, right, down, down, down

This example illustrates how to effectively perform path planning in a grid world with obstacles. By thoroughly analyzing the problem environment, the model needs to plan a route from the specified starting point to the destination while avoiding obstacles. This capability demonstrates the model’s efficiency and accuracy in understanding and interpreting spatial structures and making decisions in dynamic environments.

2.2.3 Geographic feature search

This category focuses on evaluating the large language model’s ability to recognize geographic entities and their place names, as well as understand the attributes related to these entities. Geographic features are defined as entities that make up the vocabulary of geography, including specific natural geographic elements such as mountains, rivers, and lakes, as well as man-made locations like historical sites and cultural landmarks. These features may be physical characteristics spanning landscapes or man-made sites with historical and cultural significance. This task requires the model to accurately identify specific geographic locations from text descriptions, such as cities, countries, and landmarks, and extract additional information about these geographic entities from the text, including their political, economic, and cultural importance, as well as their relative position to other geographic entities. Mastering this capability is crucial for various applications, including Geographic Information Systems (GIS), tourism recommendations, and academic research, demonstrating the model’s efficiency and accuracy in handling and applying geographic text content. Question example:

Question: XX is the capital and largest city of Greenland, a constituent country of the Kingdom of Denmark. XX is the seat of government and the country’s largest cultural and economic centre. The major cities from other countries closest to the capital are Iqaluit and St. John’s in Canada and Reykjavík in Iceland. Return only the name of XX in the given paragraph.
Answer: Nuuk

This example demonstrates how, through detailed analysis of descriptions, the model can accurately extract specific geographic entity names from the text. In this case, the model needs to recognize that Nuuk is the capital and largest city of Greenland and understand its importance as the government seat and cultural and economic center. This capability reflects the model’s efficiency and accuracy in parsing and applying geographic text content.

2.2.4 Numerical Trajectory Recognition

This task explores how to determine directional changes in trajectories by analyzing coordinates in two-dimensional or three-dimensional space, such as recognizing ”up,” ”down,” ”left,” and ”right. ” Additionally, the task involves the ability to describe shapes by analyzing the arrangement and relative positions of points, such as identifying waveforms or straight lines. This analysis is crucial not only for automated navigation systems, motion trajectory analysis, and data processing in Geographic Information Systems (GIS) but also for improving the operational efficiency and accuracy of these systems. Using large language models for spatial reasoning can enhance the model’s insights and predictions about geometric shapes and dynamic changes in the physical world. Question example:

Question: What are the directional descriptions from left to right for the coordinate sequence [(0, 0), (0, 10), (10, 10), (10, 0)]?
Answer: The directional descriptions are ”up,” ”right,” and ”down. ”

Question: What action is described by the coordinate sequence [(0, 0, 0), (0, 0, 1), (0, 1, 1), (0, 1, 2), (0, 2, 2), (0, 2, 3), (0, 3, 3), (0, 3, 4)]?
Answer: The action is described as ”climbing in a zigzag pattern along a vertical and horizontal plane. ”

These examples demonstrate how the model can parse the direction and shape of movement from a given set of coordinate points. The model not only recognizes simple linear motion but also understands more complex spatial actions, such as zigzag climbing trajectories in three-dimensional space. This capability reflects the model’s deep insights into geometric shapes and dynamic trajectories, which are valuable for enhancing the accuracy and efficiency of automated navigation and motion analysis systems.

2.2.5 Spatial literacy

In this category, we explore a range of questions related to global geographic locations, including but not limited to cities, countries, major highways, river systems, their elevations, and demographic information. Spatial literacy is primarily defined as an individual’s ability to understand, evaluate, and apply spatial data, encompassing skills from basic map reading to advanced Geographic Information System (GIS) operations. Additionally, this category focuses on analyzing the complex relationships between geographic elements, such as exploring adjacency between different countries or regions through topology, or evaluating the visual connectivity of specific terrain features like mountain ranges through geomorphology. The questions also cover in-depth analysis of spatial attributes, including calculating geographic buffers, determining directions between states or cities, and ranking cities based on specific conditions like temperature. These questions not only test the model’s grasp of geographic knowledge but also its ability to understand and analyze complex geographic information, demonstrating the application of spatial literacy in various professional and everyday contexts. Question examples:

Question: The Mississippi River flows through Canada.
Answer: F

This question example tests the model’s grasp of geographic knowledge by asking whether the Mississippi River flows through Canada. In reality, the Mississippi River runs through the central United States, not Canada. This question is designed to assess the model’s memory of specific river locations and its ability to recognize and correct common geographic misconceptions.

2.2.6 Mapping

In this task, we evaluate the language model’s ability to apply map design and Geographic Information System (GIS) technologies. The model is required to use modern programming libraries or access online map services to construct maps. This involves understanding the necessary API parameters and correct request syntax to ensure efficient map generation and retrieval. This task tests the model’s skills in operating map-making tools and its comprehensive application of spatial data presentation and GIS functionalities. For example, the model must accurately create and interpret map links based on specific geographic coordinates and user-defined settings, demonstrating its technical mastery and application depth in the map creation process. Question example:


Question: Create a Mapbox map link that displays a standard map view of downtown Zhengzhou. Please return the link directly.
Answer: Link here (see Figure 3) Refer to caption Figure 3: Standard map view of Zhengzhou city center.

This question example primarily assesses the model’s technical proficiency in creating maps using the Mapbox API. The specific task requires the model to accurately understand and apply API parameters to generate a map link for a specified view. By requesting a map view of downtown Zhengzhou, this test evaluates the model’s understanding of API documentation and its effectiveness in generating precise map links in practical applications.

2.2.7 GIS concepts

This evaluation category focuses on exploring the fundamental elements and operations of Geographic Information Systems (GIS) through a series of questions. These questions aim to thoroughly assess the model’s understanding of basic GIS concepts, practical techniques, and analytical methods. The topics covered include map creation, effective management of spatial data, and its analytical processing. Through simulated scenarios and theoretical questions, we challenge the model to use GIS tools to analyze and solve complex geospatial problems, such as terrain analysis, resource allocation, and environmental monitoring. This type of evaluation ensures that the model can effectively utilize GIS technology in practical applications. Question example:

Question: Which map projection is most suitable for global climate model studies? A. Mercator projection B. Pseudocylindrical projection C. Orthographic projection D. Robinson projection
Answer: D

This question tests the model’s understanding of the characteristics of different map projections by asking which one is most suitable for global climate model studies. The selection of the Robinson projection reflects the model’s correct understanding of how to balance the distortions of shape, area, and direction, which is crucial for the accurate representation of global climate data.

2.2.8 Toponymic identification

This task category delves into the language model’s ability to handle and recognize place names, especially extracting geographic names from texts with complex geographical descriptions. These texts may come from social media, news reports, or literary works, where the presentation of place names can deviate from standard geographic norms. Place name recognition is crucial for achieving precise geographic information retrieval and location-based services, directly impacting the performance of search engines and navigation systems. Through these tasks, the model must demonstrate its ability to extract place names and disambiguate them within their context, such as distinguishing between places with the same name but different locations. This involves the algorithmic parsing and semantic understanding of place names. The process requires the model not only to identify place names but also to conduct in-depth analysis of geographical linguistics to handle and interpret the complexity and ambiguity of natural language descriptions of locations. Question example:

Question: Anyone doing high water rescues in the Pasadena/Deer Park area? My daughter has been stranded in a parking lot all night. Which parts of this sentence represent location descriptions?
Answer: Pasadena/Deer Park

This example demonstrates how the model recognizes place names from text, particularly in emergency situation descriptions. For instance, in the given example question, the place name ”Pasadena/Deer Park” is successfully identified as location information, showcasing the model’s ability to accurately extract place name information. This capability is crucial in real-world applications such as emergency services and geolocation. This instance not only highlights the model’s basic recognition of place names but also reflects its deeper understanding and parsing ability of texts containing geographic information.

2.2.9 Code explanation

In this category, we evaluate the chatbot’s ability to explain provided Python and R code snippets. The code examples may involve specific programming libraries, such as Python’s ArcPy or pathlib. The task requires the chatbot to accurately and concisely summarize the main functions of the code. By anonymizing function names and variable names in the code, the focus of the evaluation is on understanding the structure and logic of the code rather than relying on specific naming conventions. This approach aims to test the model’s deep understanding and summarization ability of code logic. Question example:

Question: Explain in one sentence the purpose of this code:

def generate_primes(n):
primes = []
for num in range(2, n + 1):
is_prime = True
for i in range(2, int(num ** 0.5) + 1):
if num % i == 0:
is_prime = False
break
if is_prime:
primes. append(num)
return primes

Answer: Python generation of prime numbers.

This code snippet evaluates the model’s understanding of a Python function that generates a sequence of prime numbers and its ability to identify and explain the algorithm logic without relying on specific names. Correctly identifying this function as a prime number generator indicates that the model has a proper and deep understanding of the prime number sieving algorithm and its optimization techniques.

2.2.10 Function interpretation

This task category focuses on testing the chatbot’s ability to explain functions in the field of spatial sciences, including their purpose, application, and general terminology. In this task, specific names of functions and parameters are replaced with abstract identifiers (e.g., ”distance” becomes ”a,” ”year” becomes ”b”). This is done to emphasize understanding of the function’s logic and structure rather than relying on specific names. The covered areas include coordinate transformation, distance calculation, spherical trigonometry, date handling, multispectral image processing, and spatial statistical analysis. The challenge is for the chatbot to concisely describe the basic utility of the function without delving into technical details. Question example:

Question: Provide in one sentence a common term for these equations:

X = (Sx * Cos(b)) * x - (Sy * Sin(a)) * y + Tx
Y = (Sx * (Sin(b)) * x + (Sy * Cos(a)) * y + Ty

Answer: Affine transformation

This question example evaluates the model’s understanding of abstract function forms and its ability to identify and explain complex mathematical operations without relying on specific names. Correctly identifying this set of equations as an affine transformation indicates that the model has a proper and deep understanding of spatial transformation concepts.

2.2.11 Code translation

This task category focuses on evaluating the chatbot’s ability to translate Python code to R code and vice versa. The specific challenge involves translating functions embedded in various programming libraries (such as Python’s ArcPy or R’s sp library), handling spatial reference data, and recursive functions. To ensure the focus of the test is on understanding the logic and structure of the code, function names and variable names have been replaced with abstract identifiers. This requires the chatbot not only to accurately perform language translation but also to deeply understand and adapt to the programming paradigms in both the source and target languages. Question example:

Question: Select the following code and report the corresponding code base in Python(sp)

cs1 <- rbind(c(7, 5), c(10, 5), c(10, 0), c(5, 0))
cs2 <- rbind(c(5, 5), c(10, 5), c(9, 8))
cs3 <- rbind(c(7, 5), c(3, 5), c(5, 0))
p1 <- Polygon(cs1)
p2 <- Polygon(cs2)
p3 <- Polygon(cs3)
polys <- Polygons(list(p1,p2,p3),"areas")
polySpa <- SpatialPolygons(list(polys))
plot(polySpa)

Answer:

# Define the coordinates for each polygon
cs1 = [(7, 5), (10, 5), (10, 0), (5, 0)]
cs2 = [(5, 5), (10, 5), (9, 8)]
cs3 = [(7, 5), (3, 5), (5, 0)]
# Create polygon objects
p1 = Polygon(cs1)
p2 = Polygon(cs2)
p3 = Polygon(cs3)
# Prepare the plot
fig, ax = plt. subplots()
# Add polygons to the plot
for poly in [p1, p2, p3]:
x, y = poly. exterior. xy
ax. plot(x, y)
# Set plot limits
ax. set_xlim(0, 12)
ax. set_ylim(0, 10)
# Show the plot
plt. show()

This example demonstrates how to translate R code to Python code to achieve the same polygon construction and plotting functions. We utilized Python’s shapely library to handle polygon geometry construction and the matplotlib library for graphical plotting. This showcases the model’s adaptability in understanding how different programming languages handle spatial data and graphical representation, as well as its deep understanding of spatial processing techniques in both languages.

2.2.12 Code generation

This task category aims to evaluate the chatbot’s skills in generating, adjusting, and refining Python and R code. The tasks involved may include using specific programming libraries, such as ArcPy and pathlib in Python or spatstat and sp in R. This category challenges the chatbot to not only accurately execute programming tasks but also to demonstrate adaptability and creative use of the library functions involved, ensuring effective code handling and optimization. Question example:

Question: Generate Python code to sort the following points from east to west and print their coordinates. Latitude, Longitude 40.7128,-74.0060 34.0522,-118.2437 51.5074,-0.1278 Answer:

points = [
{"latitude": 40.7128, "longitude": -74.0060},
{"latitude": 34.0522, "longitude": -118.2437},
{"latitude": 51.5074, "longitude": -0.1278}
]
sorted_points = sorted(points, key=lambda x: x["longitude"])
for point in sorted_points:
print("Latitude: {}, Longitude: {}". format(point["latitude"], point["longitude"]))

This example demonstrates how the model understands and implements a longitude sorting function written in Python. By accurately adjusting the order of geographic coordinates using a sorting function, it shows the model’s deep understanding of how to manage and manipulate datasets through programming logic.

3 Methods

3.1 Model and prompt

In this spatial task dataset test, we included several top large language models from both domestic and international sources. For international models, we selected three from OpenAI: gpt-3.5-turbo, gpt-4-turbo-2024-04-09, and gpt-4o. Among them, gpt-3.5-turbo is regarded as a milestone in large language models due to its coherent and human-like responses, making it a benchmark for industry performance evaluations. gpt-4-turbo-2024-04-09 is widely considered one of the most powerful language models globally, ranking highly on multiple model test leaderboards such as MMLU and MATH. gpt-4o (”o” stands for ”omni”), OpenAI’s latest model, offers faster responses and has reached new performance heights. Additionally, we selected ANTHROPIC’s claude-3-sonnet-20240229 model, which excels in understanding and generating poetic texts and represents the best combination of performance and speed in the claude series. For domestic models, we chose moonshot AI’s moonshot-v1-8k and zhipuai’s glm-4. As one of the top domestic models, glm-4 performs exceptionally in semantic understanding, code generation, and general knowledge processing, surpassing gpt-4 in some aspects. moonshot-v1-8k is also a leading large language model domestically.

Our test is divided into two rounds, aiming to comprehensively evaluate the capabilities of various large language models in handling spatial tasks based on natural language prompts. The first round is a zero-shot test, which does not involve any prompt tuning to assess the initial performance of the models. This allows us to observe the models’ direct response capabilities to spatial tasks. We used the following system prompt for testing: You are a quiz assistant answering the following questions and outputting the results without any additional explanation.

Following the first round of testing, we will conduct prompt strategy tuning for spatial tasks where performance was suboptimal to evaluate and enhance the large language models’ performance in complex spatial tasks. These strategies include:

One-shot:

In the One-shot test, the model receives an example question and its answer to establish context understanding before immediately facing a new question. This method tests whether the model can quickly learn from a single example and apply it to a new scenario without additional data or extended training. This strategy is particularly useful for evaluating the model’s response speed and adaptability when encountering unknown, one-off questions, effectively revealing the model’s mechanism for absorbing and processing new information[li2023one].

Combined Techniques Prompt:

This strategy uses a combination of model-specific identity prompts, detailed test category descriptions, and specific example questions and answers to create a rich informational environment for the model. This helps the model build a more accurate context for understanding and solving problems and utilize multiple information sources to improve decision accuracy. Combined Techniques Prompt is an advanced testing method aimed at evaluating how models perform complex tasks in an information-rich, structured environment and exploring how different types of information interact to optimize model performance.

Chain of Thought:

The Chain of Thought (CoT) method requires the model to receive and process an example problem with a detailed solution process before answering questions. This strategy emphasizes the model’s ability to understand and generate logical answers by simulating human problem-solving thinking processes to enhance answer transparency and interpretability. CoT is particularly suitable for handling problems requiring advanced reasoning and logical thinking, demonstrating how the model performs in a continuous logical sequence and systematically processes and solves complex problems[wei2022chain][wei2022emergent].

Zero-shot-CoT:

Zero-shot-CoT testing involves providing the model with step-by-step thinking instructions (e.g., ”Let’s think step by step”) without prior specific training or example input, directly guiding the model to develop a thought process. This method’s core is to evaluate how the model autonomously constructs a logical chain of problem-solving without direct prior examples. It aims to test the model’s autonomous learning and innovation capabilities, particularly its adaptability and solution strategies when facing entirely new and complex problems[kojima2022large].

Through these methods, we aim to further explore and understand the potential and limitations of various models in handling specific spatial tasks.

3.2 Collecting model answers

As language models are increasingly applied to handle various complex tasks, ensuring these models provide unbiased and consistent answers has become a significant challenge. To address this need, this study developed an automated script specifically for standardizing the answer collection process, aiming to minimize human error and the potential influence of conversation history through technical means. This script directly calls the APIs of various models, interacting with them using text input only, and adheres to a single-turn dialogue mode. The single-turn dialogue mode was chosen to minimize the possible impact of conversation history on the answers, ensuring that each answer is generated independently of previous dialogue context. This operation mode is particularly suitable for evaluating the model’s ability to respond to immediate questions rather than its performance in multi-turn, continuous dialogues. Additionally, to ensure consistency of experimental conditions and to promote more predictable and consistent responses from the models, we uniformly set the temperature parameter to 1 for all models tested. This parameter setting aims to enhance the large language models’ cognitive ability in generating answers, thereby allowing for a more accurate assessment of their true performance levels.

Specifically, when testing the gpt-3.5-turbo model, we observed that the model might generate different answers to the same question in different dialogues, particularly for questions with a high degree of subjectivity. This variation is primarily due to the random sampling technique used during decoding. Although these differences are typically minor, to improve the simplicity and efficiency of data processing, we chose to collect only one answer per question. Below is a concrete example demonstrating the process of conducting a single-turn dialogue with the gpt-3.5-turbo model via an API call:

# Create a single round of dialog requests to get model answers
response = client. chat. completions. create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are an answering assistant"},
{"role": "user", "content": "What is quantum entanglement?"}
],
temperature = 1
)

This code example clearly demonstrates how to construct an API request to ensure consistency and standardization of parameters for each request, thereby enhancing the reliability and validity of the study.

3.3 Evaluation measures

Accuracy statistics and qualitative analysis will be used to evaluate the chatbot’s responses. In statistical analysis, most answers can be clearly classified as correct or incorrect, such as multiple-choice questions or reasoning questions with clear answers. However, in open-ended tasks like explaining GIS concepts or describing trajectory data visualization, the chatbot’s responses may only approximate the correct answer. Such cases, typically found in explanatory or conceptual tasks, will be specially marked with a score of 1. Qualitative analysis will explore the challenges encountered during task execution, along with related examples and illustrations. To accurately assess the model’s ability to handle different types of questions, we use the following specific scoring method to quantify the accuracy of the responses:

  • Fully Correct Answer: When the model provides an answer that is completely correct and meets the expectations, we give it 2 points. This indicates that the model not only understood the core of the question but also generated an accurate and correct answer.

  • Partially Correct Answer: For answers that are partially correct or partially meet the requirements of the question, we give it 1 point. This situation typically occurs when the model captures part of the key points of the question but does not fully cover all the correct details or mishandles some information.

  • Completely Incorrect Answer: If the model’s answer is unrelated to the correct answer or completely wrong, we give it 0 points. This indicates a fundamental misunderstanding of the question by the model or that the generated response is entirely off the mark.

Human reviewers play a crucial role in the scoring process, especially when dealing with partially correct answers. Certain task categories, such as ”Spatial Literacy” and ”GIS Concepts,” often include multiple possible correct answers or require subjective judgment in the responses. In these cases, human judgment is essential for determining the final score. To ensure objectivity and consistency in scoring, we require all reviewers to engage in detailed discussions before scoring and decide the final score through a voting process. For specific task categories involving code execution, such as ”Code Generation” and ”Code Translation,” we have made appropriate adjustments to the scoring criteria:

  • Score of 2: The code runs perfectly, and the output format is correct, fully meeting the expected result.

  • Score of 1: The code is basically correct but has minor errors or deficiencies, such as syntax errors, logic errors, or format errors.

  • Score of 0: The code fails to run, or the result deviates significantly from the expected outcome, indicating a clear deficiency in the model’s understanding and generation of code.

Through this detailed and systematic scoring mechanism, we can more accurately measure and compare the ability and effectiveness of different models in handling various types of problems. This approach not only helps us identify each model’s strengths and weaknesses but also provides valuable data support for future model optimization and development.

To comprehensively evaluate and compare the ability of large language models in handling spatial tasks, we introduced Weighted Accuracy (WA) as a key evaluation metric. This metric assigns different weights to the scores of each response to accurately reflect the importance and difficulty of the tasks. The calculation formula for Weighted Accuracy is as follows:

WA=2n(s2)+1n(s1)2(n(s0)+n(s1)+n(s2))𝑊𝐴2𝑛𝑠21𝑛𝑠12𝑛𝑠0𝑛𝑠1𝑛𝑠2WA=\displaystyle\frac{2\cdot n(s2)+1\cdot n(s1)}{2\cdot(n(s0)+n(s1)+n(s2))}italic_W italic_A = divide start_ARG 2 ⋅ italic_n ( italic_s 2 ) + 1 ⋅ italic_n ( italic_s 1 ) end_ARG start_ARG 2 ⋅ ( italic_n ( italic_s 0 ) + italic_n ( italic_s 1 ) + italic_n ( italic_s 2 ) ) end_ARG

The variable n(s2) represents the number of answers with a score of 2, n(s1) represents the number of answers with a score of 1, and n(s0) represents the number of answers with a score of 0.

By applying this formula, we can accurately quantify and compare the efficiency and accuracy of different models in handling spatial-related problems. This provides valuable data support and insights for the optimization and development of future models. This approach emphasizes the importance of a comprehensive assessment of model performance, ensuring that our evaluation is as thorough and impartial as possible.

4 Results and Analysis

4.1 Model accuracy

After the first round of zero-shot testing, we evaluated the total scores and weighted accuracy (WA) of 12 subspace tasks, as shown in the table. In this evaluation, gpt-4o performed exceptionally well, achieving the highest number of scores of 2 and the highest overall WA. Following closely is gpt-4-turbo-2024-04-09, whose scores are almost identical to gpt-4o. Although Claude-3-sonnet-20240229 exceeded glm-4 in the number of questions scored 2, glm-4’s overall weighted accuracy remained slightly higher. The overall slightly lower performers are Moonshot-v1-8k and gpt-3.5-turbo. In the evaluation of 900 spatial tasks, gpt-4o provided completely correct answers for 627 questions and partially correct answers for 24 questions, with an overall weighted accuracy of 71%. gpt-4-turbo-2024-04-09 had an overall weighted accuracy of 69.7%, just 1.3% lower than gpt-4o. Comparatively, gpt-3.5-turbo performed the worst, with an overall weighted accuracy of only 43.8%. As one of the leading large models domestically, glm-4’s performance followed gpt-4o and gpt-4-turbo-2024-04-09, with an overall weighted accuracy of 62.4%. Claude-3-sonnet-20240229 and Moonshot-v1-8k had overall weighted accuracies of 62.1% and 53.2%, respectively. Analyzing the score distribution, we observed that the number of questions scored 1 was significantly lower than those scored 0 and 2, indicating that the models were able to provide clear answers in most cases. It is noteworthy that the performance gap between gpt-4o and gpt-3.5-turbo is substantial, which aligns with the conclusions of gpt-4o’s latest report.

Table 1: Scores of each model and overall weighted accuracy.
Model Count S0 Count S1 Count S2 WA (%)
gpt-3.5-turbo 481 50 369 43.8
gpt-4o 249 24 627 71.3
gpt-4-turbo-2024-04-09 260 26 614 69.7
claude-3-sonnet-20240229 334 15 551 62.1
moonshot-v1-8k 409 24 467 53.2
glm-4 318 41 541 62.4
Refer to caption
Figure 4: Comparison of model results in zero-shot testing.

Figure 4 shows the overall accuracy (WA) of various models across different spatial sub-tasks. The figure indicates that the performance order of the models remains consistent across most spatial sub-tasks. Additionally, Table 2 provides a detailed breakdown of the results for each spatial sub-task. gpt-4o and gpt-4-turbo-2024-04-09 performed the best in six different categories and achieved 100% accuracy in code explanation tasks. In contrast, gpt-3.5-turbo ranked last in performance across six categories.

The overall performance of the models on explanatory, knowledge-based, and conceptual tasks was quite good, especially in GIS conceptual tasks, where all models achieved an overall accuracy (WA) of over 80.0%. This performance demonstrates that the training corpora of large language models encompass extensive GIS knowledge and showcase the models’ strong capability in understanding and applying this knowledge. However, in numerical trajectory recognition tasks, gpt-3.5-turbo had an accuracy of only 31.0%, significantly lower than the other models. This discrepancy may be related to its training dataset, parameter settings, or training algorithm. Additionally, when it comes to operational tasks involving specific programming languages and code details, gpt-3.5-turbo and other weaker models may underperform due to insufficient specialized training.

Table 2: Performance metrics by category and model.
(G3.5t:gpt-3.5-turbo,G4o:gpt-4o,G4t:gpt-4-turbo-2024-0409,Cs:claude-3-sonnet-20240229,Ms:moonshot-v1-8k,Glm:glm-4)
Kategorie WA (%)
G3.5t G4o G4t Cs Ms Glm
Code explanation 83.7 100.0 100.0 100.0 91.9 98.8
Code generation 32.6 82.6 84.8 67.4 21.7 76.1
Code translation 58.3 95.8 93.8 83.0 68.8 85.4
Function interpretation 80.0 91.2 95.2 88.5 73.8 75.0
Geographic feature searching 93.1 98.1 96.4 96.0 90.0 92.1
GIS concepts 81.0 96.4 93.6 96.0 94.9 92.1
Mapping 16.3 61.9 46.9 53.1 7.5 38.8
Numerical Trajectory Recognition 31.0 79.5 83.0 69.0 42.0 64.5
Route planning 4.1 12.4 8.7 2.5 0.0 7.5
Spatial literacy 69.9 85.0 85.5 84.0 71.7 80.6
Spatial understanding 1.8 55.2 40.8 54.8 46.1 45.0
Toponym recognition 93.0 92.0 88.0 88.0 98.0 72.0

The models generally performed poorly on reasoning and application tasks. Especially in route planning tasks, which require the models to not only understand the specific requirements but also perform complex logical reasoning, such as considering multiple possible routes and evaluating various environmental factors. Even the best-performing model, gpt-4o, had a weighted accuracy (WA) of only 12.4%, while the worst-performing model, Moonshot-v1-8k, had a WA of just 0.8%. Additionally, in spatial understanding tasks, gpt-3.5-turbo had a WA of only 1.8%, significantly lower than the other models. In mapping tasks, which require models to generate Mapbox links for specified areas or perform map visualization based on specific commands, the performance gap between the models was significant. gpt-4o performed the best with a WA of 61.9%, while Moonshot-v1-8k performed the worst with a WA of only 7.5%, 54.4% lower than gpt-4o.

Overall, although the major models performed well in explanatory, knowledge-based, and conceptual tasks, particularly demonstrating excellent accuracy in GIS conceptual tasks, their performance in reasoning and application tasks was generally poor. Path planning and mapping tasks especially highlighted the challenges models face in complex logical reasoning and handling advanced spatial information. These results indicate that while current large models can achieve high accuracy in certain areas, they still require further optimization and training for tasks that demand high-level reasoning and specialized knowledge.

4.2 Clustering questions by difficulty

To conduct a more detailed model evaluation, we further divided the spatial task dataset into three levels based on difficulty: Level I (easy), Level II (medium), and Level III (hard). Through in-depth discussions with an expert panel, we established specific criteria for classifying the difficulty of the questions: 1) Questions correctly answered and scored 2 by at least five models were classified as easy.2) Similarly, questions correctly answered and scored 2 by at least two but no more than four models were marked as medium difficulty.3) Questions correctly answered and scored 2 by at most one model (usually the best-performing model, such as gpt-4o) were defined as hard, as these questions are often the most challenging. In these three difficulty levels, the number of questions was 395 (easy), 275 (medium), and 230 (hard), totaling 900 questions. Next, we will evaluate the performance of each model based on these difficulty levels. The evaluation criteria are as follows:

  • If the model answers an easy question completely correctly (score of 2), it is considered to have correctly solved one easy question.

  • If the model answers an easy question partially correctly (score of 1), it is considered to have correctly solved 0.5 of an easy question.

  • The evaluation criteria for medium and hard questions are the same.

Table 3: The model’s scores on questions of different difficulty levels.
Model Easy Medium Difficult
gpt-3.5-turbo 342 96.5 7.5
gpt-4o 385.5 200.5 29
gpt-4-turbo-2024-04-09 391 207 29
claude-3-sonnet-20240229 388 160 10.5
moonshot-v1-8k 375 95.5 8.5
glm-4 368 141 25
Refer to caption
Figure 5: Accuracy of questions clustered according to difficulty. (G3.5t:gpt-3.5-turbo,G4o:gpt-4o,G4t:gpt-4-turbo-2024-0409,Cs:claude-3-sonnet-20240229,Ms:moonshot-v1-8k,Glm:glm-4)

According to Figure 5, as expected, the models generally performed well on easy questions. gpt-4-turbo-2024-0409 performed the best with an accuracy rate of 0.99, while gpt-4o and Claude-3-sonnet-20240229 also had high accuracy rates of 0.98. On medium difficulty questions, gpt-4-turbo-2024-0409 also led with an accuracy rate of 0.73. gpt-3.5-turbo and Moonshot-v1-8k performed poorly, with an accuracy rate of only 0.35. In handling hard questions, all models performed generally poorly. gpt-3.5-turbo stood out slightly with an accuracy rate of 0.03, while gpt-4o and gpt-4-turbo-2024-0409 had similar performances, both at 0.13. This indicates that even the latest models faced challenges with the hardest questions. Overall, gpt-4-turbo-2024-0409 and gpt-4o performed well across all three difficulty levels, while gpt-3.5-turbo performed poorly.

It is noteworthy that using the models’ answers to classify the difficulty of questions is a simple and efficient method. Additionally, another possible approach is to seek the opinions of an expert panel to determine the difficulty level of each question. The main purpose of classifying the dataset by difficulty is to evaluate the performance of future models on this dataset, monitor their progress over time, and compare it with human performance. For example, if a model only shows high accuracy on easy questions, we cannot consider it a powerful model based on this alone. Conversely, if the model also performs well on medium and hard questions, we can consider it an outstanding and effective model for handling spatial tasks. Therefore, the ability to handle questions of varying difficulty is crucial when evaluating a model’s capabilities.

4.3 Prompt tuning

After conducting the first round of zero-shot testing, we found that the models performed poorly in path planning, spatial understanding, and mapping tasks. Additionally, based on the previous difficulty classification results, we observed that the number of hard questions in these three tasks was relatively high. Therefore, we will conduct specific prompt optimization tests for these three tasks. For the route planning and spatial understanding tasks, we used four prompt strategies for optimization testing: One-shot, Combined Techniques Prompt, Chain of Thought (CoT), and Zero-shot-CoT. When discussing the mapping task, considering that code-related issues are not suitable for prompt strategy optimization, we focused only on the first type of mapping task(see Table 4): where the model accesses online map services to construct and return a map link for a specific area. Since this type of mapping task is not suitable for testing with reasoning-based prompt strategies, we selected One-shot and Combined Techniques Prompt strategies for optimization testing. For the path planning task, examples of system prompts for each prompt strategy are shown below:

  1. (a)

    One-Shot
    Question: You are in a 4 by 4 two-dimensional array. You have to avoid some obstacles: (0,3). From (1,1) to (3,0). Just use “up, down, left, right” to indicate your route.
    Answer: path: down, down, left

  2. (b)

    Combined Techniques Prompt
    As a path planning expert with extensive experience in GIS, you will be responsible for designing and analyzing optimal paths that avoid obstacles.
    In the “path planning” category, the task is to navigate from a starting point to an end point while avoiding obstacles placed in the environment. The process involves computing an efficient route, usually in a gridded virtual environment, by using simple directional commands such as “up”, “down”, “left”, “right”.
    Question: You are in a 4 by 4 two-dimensional arrays. You have to avoid some obstacles: (0,3). From (1,1) to (3,0). Just use “up, down, left, right” to indicate your route. Your output is formatted as: path:
    Answer: path:down,down,left

  3. (c)

    Chain of Thought (Cot)
    Question: You are in a 4 by 4 two-dimensional arrays. You have to avoid some obstacles: (0,3). From (1,1) to (3,0). Just use “up, down, left, right” to indicate your route. Your output is formatted as: path:
    Answer: From the starting point of (1,1), you need to reach the end of (3,0). In the process of moving, we need to avoid the obstacle located at (0,3). First move down one step to (2,1), then continue down one step to (3,1), and finally move one step to the left to reach (3,0) to complete the path. So the answer is: path: down, down, left.

  4. (d)

    Zero-shot-CoT
    Let’s think step by step.

Table 4: Results of the first type of mapping task.
Model Count S0 Count S1 Count S2 WA (%)
gpt-3.5-turbo 46 0 13 22.0
gpt-4o 19 0 40 67.8
gpt-4-turbo-2024-04-09 27 0 32 54.2
claude-3-sonnet-20240229 28 0 31 52.5
moonshot-v1-8k 53 0 6 10.1
glm-4 37 0 22 37.2
Route Planning Task:

By comparing the test results (as shown in Figure 6), we observed that optimizing input prompts significantly improved the models’ overall accuracy. For gpt-4o, in the initial unoptimized zero-shot test, the model’s weighted accuracy was only 12.4%. After the second round of prompt optimization, its weighted accuracy increased to 50.8%, 43.3%, 87.5%, and 68.3% under the One-shot, Combined Techniques Prompt, CoT, and Zero-shot-CoT strategies, respectively. We observed that the CoT and Zero-shot-CoT strategies had a particularly significant effect on improving the model’s weighted accuracy. For gpt-4-turbo-2024-04-09, similar improvement trends were observed in the corresponding tests. gpt-3.5-turbo achieved its highest weighted accuracy of 25.8% under the Zero-shot-CoT strategy but showed a performance decline in the One-shot test. Similarly, Moonshot-v1-8k’s weighted accuracy dropped to 0 in the One-shot test, and it did not exceed 10% under the CoT and Zero-shot-CoT strategies. Claude-3-sonnet-20240229 showed no significant improvement with the Combined Techniques Prompt and Zero-shot-CoT strategies, achieving its best performance of 9.1% with the CoT strategy. For glm-4, it achieved its best performance of 36.6% with the CoT strategy. Notably, its weighted accuracy decreased in both the One-shot and Combined Techniques Prompt tests, dropping to 0 in the Combined Techniques Prompt test.

Refer to caption
Figure 6: Prompt optimization test results for route planning task. (G3.5t:gpt-3.5-turbo,G4o:gpt-4o,G4t:gpt-4-turbo-2024-0409,Cs:claude-3-sonnet-20240229,Ms:moonshot-v1-8k,Glm:glm-4)
Spatial Understanding Task:

After prompt optimization (see Figure 8), different prompt strategies significantly improved the WA of gpt-3.5-turbo, with the CoT strategy being the most effective, reaching a WA of 37.8%, which is a 36% increase from the zero-shot test’s 1.8%. gpt-4o and gpt-4-turbo-2024-04-09 showed similar growth trends, both achieving their highest weighted accuracy under the Zero-shot-CoT strategy. For Claude-3-sonnet-20240229, the prompt strategies did not significantly improve its WA, with the CoT strategy being the most effective at 51.2%, an 8.8% increase from the zero-shot test’s 42.4%. Additionally, it is worth noting that domestic large models performed poorly in this task after prompt optimization, with different prompt strategies even leading to a decline in performance. Moonshot-v1-8k only showed improvement under the Combined Techniques Prompt strategy, increasing from 46.1% in the zero-shot test to 50.6%. glm-4’s performance decreased across all prompt strategies, with the lowest WA of 32.9% under the Zero-shot-CoT strategy, a 15.6% drop from the zero-shot test’s 48.5%.

Mapping Task:

By comparing the test results (see Figure 9), we found that the gpt-4o, gpt-4-turbo-2024-04-09, and Claude-3-sonnet-20240229 models showed similar growth trends under the One-shot and Combined Techniques Prompt strategies. The highest weighted accuracies (WA) for these three models were 84.7%, 76.3%, and 66.1%, respectively. Notably, the Combined Techniques Prompt strategy significantly improved the gpt-3.5-turbo model, with its WA increasing from 22.0% in the zero-shot test to 62.7%. The One-shot strategy also brought substantial improvement to the Moonshot-v1-8k model, with its WA soaring from 10.1% in the zero-shot test to 76.3%, matching the performance of gpt-4-turbo-2024-04-09. However, the glm-4 model performed relatively poorly, with its performance even declining under the One-shot strategy. In the Combined Techniques Prompt strategy, its performance only slightly increased from 37.2% in the zero-shot test to 40.7%.

4.4 Example analysis

When tasked with ”creating a Mapbox map link that shows a standard view of downtown San Francisco” we found that both gpt-4o and gpt-4-turbo-2024-04-09 accurately captured the semantic essence of the task and successfully returned a Mapbox map link of downtown San Francisco. However, there was a difference in how each model determined the specific location of ”downtown San Francisco”: gpt-4-turbo-2024-04-09 chose the New Conservatory Theatre Center, while gpt-4o selected the San Francisco Museum of Modern Art as the map center. Although both models scored 2 points for this task, this difference highlights the subtle variations in how different models understand real-world scenarios.

In the first round of zero-shot testing, we observed an interesting phenomenon in the Numerical Trajectory Recognition task. When handling the question ”What is the directional description of the coordinate sequence [(5, 0), (5, 15), (15, 15), (15, 5), (5, 5)] in order from left to right?” gpt-3.5-turbo showed a different understanding compared to other models. Specifically, it considered the move from (5, 5) to (5, 0) as the last direction change, thus generating five directional descriptions for the question. To further investigate whether this phenomenon was common, we selected five different cases and conducted three rounds of testing on gpt-3.5-turbo. In these tests, we focused on the number of directional descriptions rather than their accuracy. The results showed that in 15 tests, only 5 successfully generated the correct number of descriptions (highlighted in red).

Table 5: Test results of gpt-3.5-turbo on different cases.
Question Model Answer 1 Model Answer 2 Model Answer 3
What is the direction description of the coordinate sequence [(10, 10), (10, 20), (20, 20), (10, 20)] in order from left to right? North, East, South, West right, up, right, down East, north, east, south
What is the directional description of the coordinate sequence [(0, 0), (0, 10), (10, 10), (10, 0)] in order from left to right? north, east, south East, North, West, South North, East, South, West
What is the directional description of the coordinate sequence [(90, 100), (110, 100), (110, 110), (100, 110), (100, 100)]in order from left to right? right, up, right, up, left right, up, left, down East, North, East, South, West
What is the directional description of the coordinate sequence [(15, 15), (15, 25), (25, 25), (25, 15)], (10, 15)] in order from left to right? east, north, west, south, east North, East, South, West up, right, down, left
What is the directional description of the coordinate sequence [(50, 50), (60, 50), (60, 60), (70, 60), (80, 60), (80, 50), (90, 50), (90, 40), (100, 40), (110, 40), (110, 50)] in order from left to right? East, East, North, East, East, South, East, South, East, East, North East, South, East, North, East, South, East, South, East, North, East east, north, east, east, south, east, south, east, east, north
Refer to caption
(a) Results of gpt-4o.
Refer to caption
(b) Results of gpt-4-turbo-2024-04-09.
Figure 7: Results of ”Create a Mapbox map link showing a standard map view of downtown San Francisco” using gpt-4o (left) and gpt-4-turbo-2024-04-09 (right).

5 Discussion

5.1 Performance Comparison of Large Language Models

Our results indicate that the gpt-4o model demonstrates excellent overall performance, especially in complex reasoning tasks, where it stands out. The gpt-4-Turbo-2024-04-09 model follows closely. However, in specific subtasks, such as toponym recognition, the moonshot-v1-8k model even surpasses gpt-4o and gpt-4-Turbo-2024-04-09. For instance, when addressing the question, ”The Tokyo Tower was badly damaged in last night’s earthquake. Which parts of the sentence represent location descriptions?”, moonshot-v1-8k correctly identified ”Tokyo Tower” as the location description. In contrast, gpt-4o incorrectly identified both ”Tokyo Tower” and ”last night” as location descriptions. This illustrates that in tasks related to semantic recognition and understanding, moonshot-v1-8k is comparable to gpt-4o. Although glm-4 did not excel in any specific subtask, its overall performance ranked third. As powerful AI tools, large language models possess significant capabilities, but they also have their strengths and weaknesses. Therefore, selecting a language model for practical applications depends on specific needs and scenarios. For example, if addressing a complex reasoning problem such as path planning, we might prefer the gpt-4o model combined with COT prompt strategies to effectively solve the issue. This highlights the practical significance and application value of our research.

5.2 The Role of Dataset Difficulty Classification

Traditional difficulty classification methods typically rely on the understanding of various experts and multiple rounds of voting to determine the difficulty levels of questions. This approach is inefficient and highly subjective, especially when dealing with datasets containing multiple task categories and a large number of questions. In this study, we adopted an innovative strategy based on the performance of models in the first round of zero-shot testing to classify the dataset difficulty. This method integrates the test results of six different large language models to categorize the difficulty levels of the dataset and further discuss the outcomes. We believe this strategy is not only more efficient but also significantly promotes subsequent research. Difficulty classification is extremely beneficial for follow-up research, as it allows researchers to more accurately assess the effectiveness of different models in handling simple, medium, and difficult problems. Through this approach, researchers can deeply analyze the models’ capabilities and limitations in complex situations. Difficulty classification guides further model development, especially when it is found that the current model performs poorly on high-difficulty problems. Researchers can then specifically enhance the model’s reasoning abilities or improve its performance on complex data. Additionally, difficulty classification enables researchers to design test and training sets more rationally, making them more challenging or balanced. This not only tests the models’ limits but also enhances their robustness.

5.3 Performance of Models on Tasks of Varying Difficulty

By analyzing the performance of different models on tasks of three difficulty levels (easy, medium, hard), we found that the main differences between the models lie in their responses to medium and hard questions(see Table 3). To further explore these differences, we randomly selected medium and hard questions from different task categories and compared the performance of moonshot-v1-8k and gpt-4o(see Table 6). In path planning tasks, the difficulty increases with the size of the grid. In spatial understanding tasks, difficulty mainly depends on the complexity of the described space and the number of elements involved; more elements mean higher complexity and thus higher difficulty. We observed that gpt-4o consistently outperformed moonshot-v1-8k on these more complex problems. We believe this disparity may be due to several factors: first, gpt-4o’s training parameters may be more optimized; second, its training framework might be more advanced; and third, the quantity and quality of training data used by gpt-4o may be greater. These factors likely contribute to gpt-4o’s superior performance on complex problems.

Table 6: Performance of moonshot-v1-8k and gpt-4o on tasks of different difficulty,(\checkmark) represents a correct answer, and (×\times×) represents an incorrect answer.
Questions (difficulty level) moonshot-v1-8k gpt-4o Truth
You are in a 4 by 4 two-dimensional array. You have to avoid some obstacles: (1,0). From (0,0) to (3,3). Just use ”up, down, left, right” to indicate your route. Your output is formatted as: path: (medium) up, up, up, right, down, down, right, right (×\times×) path: right, right, right, down, down, down (\checkmark) right, right, right, down, down, down
You are in a 7 by 7 two-dimensional array. You must avoid some obstacles: (6,2). From (3,3) to (0,5). Just use ”up, down, left, right” to indicate your route. Your output is formatted as: path: (difficult) up, up, up, right, right, right (×\times×) path: up, up, up, right, right (\checkmark) up, up, up, right, right
You are in a 7 by 7 two-dimensional array. You must avoid some obstacles: (4,3). From (0,0) to (6,2). Just use ”up, down, left, right” to indicate your route. Your output is formatted as: path: (difficult) up, up, right, right, right, down, down, right (×\times×) path: right, right, right, down, down, down, down, down, down, down, left, left (×\times×) right, right, down, down, down, down, down, down, down
You have been given a pointy-topped regular hexagonal tile map consisting of 1 tile. Starting from a vertex, you will move along the edges of the tile. Initially, you are positioned at the bottom right corner of the map, where you find a husky. You move up by one step, where you find a giant panda. You move up-left by one step, where you find a moped. You move down-left by one step, where you find a crane (machine). You move down by one step, where you find a scooter. You move down-right by one step, where you find a hoop skirt. You move up-right by one step. What will you find? (medium) giant panda (×\times×) You will find a husky. (\checkmark) husky
You have been given a circular grid consisting of 8 connected dots. Starting from a vertex, you will move along the edges of the circular grid. Initially, you are positioned on the dot that’s located at the top of the grid, where you find a wing. You move around the ring by 5 steps in a counter-clockwise direction, where you find a starfish. You move around the ring by 8 steps in a clockwise direction. What will you find? (difficult) Wing (×\times×) You will find a wing. (×\times×) starfish
You have been given a 3 by 3 square grid. Starting from a vertex, you will move along the edges of the grid. Initially, you are positioned at the bottom right corner of the grid, where you find a power drill. You move left by one step, where you find a sock. You move up by one step, where you find a Dalmatian. You move up by one step, where you find a jeep. You move right by one step, where you find an hourglass. You move down by one step, where you find a slip-on shoe. You move left by one step. What will you find? (difficult) You will find a power drill. (×\times×) You will find a jeep. (×\times×) Dalmatian

5.4 The Capabilities of Multimodal Models

In the ”Example Analysis” section, we found that different large language models vary in their semantic understanding and analysis of text. We compared gpt-4o with gpt-4-turbo-2024-04-09. For the mapping task, although the input question ”creating a Mapbox map link that shows a standard view of downtown San Francisco” is purely textual, it fundamentally requires a deep understanding of unstructured maps combined with structured text. Comparing the results from both models, we see that gpt-4o’s response highlights significant landmarks and commercial areas in San Francisco, such as Union Square, Golden Gate Theater, and the San Francisco Museum of Modern Art. These are typical downtown landmarks, indicating high pedestrian and commercial activity. In contrast, gpt-4-turbo-2024-04-09’s response, although covering a broader area, including residential and cultural zones west of downtown, focuses more on residential areas with fewer commercial landmarks. Thus, it does not emphasize downtown commercial and cultural activities as much as gpt-4o. Why do these differences occur? We believe the primary reason is that while gpt-4-turbo performs well in text data extraction due to its robust natural language understanding capabilities, it struggles with scenarios involving unstructured data like images. Conversely, gpt-4o seamlessly integrates data from different modalities, efficiently identifying and extracting needed information from both structured text and unstructured data like images.

5.5 Enhancing Performance of LLM on Complex Spatial Tasks with Prompt Strategies

It is widely recognized that large language models can significantly improve output quality by optimizing input prompts. However, our research reveals significant differences in the sensitivity of domestic and international models to prompts. In this experiment, we defined mapping, path planning, and spatial understanding as complex spatial tasks. Most test results for these three complex spatial tasks showed that international models, such as gpt-4o, exhibited ”versatility,” improving performance under various prompt strategies. This was particularly evident in reasoning tasks, where using chain-of-thought (CoT) and zero-shot-CoT significantly enhanced performance. In contrast, domestic models, particularly glm-4, showed different responses in spatial understanding tasks, with each prompt strategy seemingly leading to a performance decline. This difference may be due to several factors: 1) Differences in architecture design between glm-4 and models like gpt-4o may lead to varied performance when handling prompt strategies.2) glm-4 may be relatively weaker in basic spatial understanding and reasoning abilities compared to models like gpt-4o. Therefore, even with prompt strategy optimization, it may not significantly improve performance and could potentially degrade due to information overload or improper processing.3) glm-4 may have bottlenecks when handling complex tasks, and the introduction of prompt strategies could further increase task complexity, affecting performance. Therefore, to fully realize the potential of different large language models, it is crucial to carefully design prompts for specific problems.

5.6 Limitations of the Study

To accurately assess and compare the performance of different large language models on spatial tasks, our study had two core requirements: 1) establishing an independent and diverse spatial task dataset, and 2) designing a comprehensive testing methodology, including the refinement of variable design and evaluation strategies. However, the spatial task categories used in this study are not comprehensive. Previous research has conducted more detailed analyses of other spatial task categories, such as point of interest (POI) recommendation, vector data analysis[zhang2024geogpt][chen2024mapgpt], and map analysis. Additionally, our study primarily focused on text-based spatial tasks, while real-world applications of spatial tasks often involve multimodal elements such as charts and maps. Since some models do not support images as input, our tests did not include image-related tasks. In the second round of prompt optimization testing, due to certain limitations in the spatial task outputs, we were unable to adopt more complex prompt strategies, such as Self-consistency: sampling multiple solutions and then performing majority voting[wang2023self], using complex chains instead of simple chains as context examples[fu2022complexity], and decomposing complex tasks into simpler tasks to solve them sequentially[khot2022decomposed]. Therefore, introducing multimodal test data and designing more comprehensive testing methodologies are key directions for future work.

6 Conclusions

In our study, we designed a comprehensive spatial task dataset, including various task categories such as GIS concepts and path planning, and systematically evaluated the performance of different large language models (e.g., gpt-4o, glm-4) on these tasks. We then classified the dataset by difficulty levels to further evaluate the models’ performance in handling problems of varying difficulty. We also tested various prompt strategies (including One-shot, Chain of Thought, etc. ) to assess their effectiveness in enhancing model performance on complex spatial tasks. The results indicate that different large models have their own strengths and limitations in spatial tasks, and their performance generally decreases with increasing task complexity. Specifically, most models did not perform well on tasks requiring high-level reasoning, such as path planning. Additionally, there are differences in the sensitivity of different models to prompt strategies. The effectiveness of prompt strategies on domestic models is generally lower than on international models, suggesting the need for more suitable prompt strategies tailored to specific model architectures. Overall, our study established a systematic and comprehensive benchmark to test the performance of different large language models on various spatial tasks. This allows us to assess which models are better suited for specific spatial tasks and provides a scientific basis for targeted performance improvements of future models.

7 References

{refcontext}

[sorting = none] \printbibliography

Appendix A Proofs

Refer to caption
Figure 8: Prompt optimization test results for spatial understanding task
Refer to caption
Figure 9: Prompt optimization test results for mapping task