Comparing the performance of ChatGPT and state-of-the-art climate NLP models on climate-related text classification tasks

. Recently, there has been a surge in general-purpose language models, with ChatGPT being the most advanced model to date. These models are primarily used for generating text in response to user prompts on various topics. It needs to be validated how accurate and relevant the generated text from ChatGPT is on the specific topics, as it is designed for general conversation and not for context-specific purposes. This study explores how ChatGPT, as a general-purpose model, performs in the context of a real-world challenge such as climate change compared to ClimateBert, a state-of-the-art language model specifically trained on climate-related data from various sources, including texts, news, and papers. ClimateBert is fine-tuned on five different NLP classification tasks, making it a valuable benchmark for comparison with the ChatGPT on various NLP tasks. The main results show that for climate-specific NLP tasks, ClimateBert outperforms ChatGPT.


Introduction
The recent advancements in artificial intelligence, specifically in Natural Language Processing (NLP), have given rise to groundbreaking large language models that are capable of understanding natural human language [1].These large language models (LLMs) are trained on diverse datasets, including books, articles, web pages, and more, enabling them to understand and generate humanlike responses to a wide range of prompts [2].
The availability of publicly accessible language models has facilitated their integration into our daily lives, empowering individuals to automate repetitive tasks, improve writing skills, generate high-quality content, and even assist in information search and validation, surpassing traditional search engines.One prominent example is ChatGPT (Generative Pre-trained Transformer), developed by OpenAI, which has gained widespread adoption and found utility in various realworld applications such as chatbots, content generation for blogs and articles, code correction, language translation, and even medical diagnosis and treatment [3].
One significant advantage of ChatGPT is its potential to simplify and facilitate complex problem-solving tasks, impacting various real-world scenarios.Several studies have examined the challenges and opportunities of ChatGPT concerning environmental and climate-related issues.The authors of [4] present a comprehensive analysis of ChatGPT's application in environmental research.They provide ten real-world examples that highlight its versatility in explaining concepts related to PFAS, microplastics, life cycle assessment, and the circular economy.The authors also discuss how ChatGPT can offer personalized programming assistance, including code generation, syntax error identification, and clarification of complex syntax.While cautioning about the challenges of misinformation and biases in AIgenerated responses, they advocate for responsible integration of AI in decision-making processes.They stress the importance of human judgment and discourage excessive reliance on AI, particularly in addressing public environmental concerns.The authors' balanced perspective encourages the responsible use of AI tools in research and beyond.
In [5], the authors present insights into the utilization of ChatGPT in biology and environmental science.They demonstrate how ChatGPT can simplify complex tasks in these fields by analyzing the answers to 100 important questions in biology and 100 important questions in environmental science.The authors acknowledge the potential benefits of ChatGPT for academic activities while highlighting the need to address potential risks and harms.Despite these limitations, they express optimism in fully harnessing recent technological advancements to push the boundaries of biology and environmental science.
In [6], the authors investigate ChatGPT's potential application in climate change research, highlighting its value in data analysis, interpretation, communication, outreach, decision-making support, and climate scenario generation.These tasks demonstrate ChatGPT's ability to simplify complex climate data and aid in policy decisions.However, also in this paper, the authors balance optimism by acknowledging ChatGPT's limitations.The authors highlight how the quality and volume of training data impact the model's output quality, leading to diverse utility across various research questions.The authors also address significant challenges, including the model's potential difficulties in understanding complex scientific concepts, lack of contextual awareness, and possible biases in the training data.
The literature review indicates the potential and limitations of using ChatGPT for climate-related tasks.To provide quantitative and qualitative measures of ChatGPT's usefulness, we conducted a study comparing its performance with the state-of-the-art domain-specific language model, ClimateBERT [7] [8].ClimateBERT is trained on climate-related research paper abstracts, news, and company reports and fine-tuned for five distinct downstream NLP tasks: (1) climate detection classification, (2) climate sentiment analysis, (3) climaterelated commitments and actions classification, (4) climate change specificity classification, and (5) climate change disclosure category classification.For the comparison, we selected all five downstream tasks and used the test dataset provided by ClimateBERT for these tasks.Employing prompt engineering best practices, we created multiple prompt variants to classify the text into relevant categories based on the selected downstream task.To compare the result of the ChatGPT, we compare its performance with the results obtained by the finetuned ClimateBERT models.The experiments that were conducted can be found in the GitHub 1 repository, along with detailed instructions on how to replicate the results.
The paper is structured into three primary sections: Methodology, Results and Discussion, and Conclusion.In the Methodology section, we detail the strategies and tools we utilized in our research, focusing specifically on how ChatGPT was utilized.In the Results and Discussion section, we present the outcomes of our study, and in the Conclusion, we summarize our research findings, consider their potential impact, and suggest possible future directions.

Methodology
The evaluation methodology consists of three main steps: (1) loading the necessary data from the datasets, (2) feeding the data into the models to obtain predicted labels for each paragraph, and (3) comparing the predicted labels with the actual labels to assess model accuracy and performance.
Although the ClimateBert and ChatGPT models have different operating mechanisms, the process of label prediction varied slightly.The ClimateBert model executed the task fine-tuned on the provided data without user interaction, while ChatGPT required a user prompt to generate a response.
In the following text, first, we will describe the used datasets, and then the details about ClimateBERT and ChatGPT models will be given.

Datasets
To test the ChatGPT performance as accurately as possible, we selected all five ClimateBERT downstream tasks.These tasks encompass several variations of the text classification, aiming to capture distinct aspects of climate-related content.

•
Task 5: Climate change disclosure category classification.This is a multiclass classification task, which involves categorizing paragraphs into one of the four recommended categories outlined by the Task Force on Climate-related Financial Disclosures (TCFD).These categories include governance, strategy, risk management, and metrics and targets [9].This classification allows for a comprehensive examination of how climate-related information is organized and reported within the financial sector, shedding light on areas of emphasis and potential gaps in reporting practices.
The datasets used for testing the tasks share a consistent structure, including the paragraph and a categorical label indicating the corresponding class.

ClimateBERT Baseline Models
The base ClimateBert model is pretrained on a corpus comprising climate-related and non-climate-related paragraphs from various sources such as news articles, Wikipedia articles, research papers, and climate reports.It is primarily fine-tuned for climate detection in paragraphs.Multiple variations of the ClimateBert model are available, each fine-tuned for one of the five tasks.These models can be accessed through the HuggingFace Library 2 [10].Additionally, the datasets are uploaded to the HuggingFace Dataset Hub3 for easy loading and utilization in the testing scripts.
For each of the five tasks, the corresponding model and dataset were loaded, and the results were evaluated.

ChatGPT Models
Given that ChatGPT is a universal language model, it requires carefully curated prompts to understand both the task at hand and the desired format of the output.Our assessment of ChatGPT was two-fold: we leveraged our own prompts, adhering to best practices for prompt construction, and alternatively, utilized an existing library.

Prompt engineering-based approach
In our first approach, we examined how effectively ChatGPT could perform text classification tasks using manually designed prompts.To complete this task, we called the OpenAI API in batches and then collected the corresponding responses.Each request to ChatGPT comprised a specific prompt that began by stating that ChatGPT is an expert in the respective field, followed by the paragraph that needed to be classified, and finally, guidelines regarding the expected format of the response.After data collection, each response was mapped to a categorical label based on the output instructions provided in the prompt.Responses that deviated from the anticipated output format, which couldn't be automatically mapped, were manually labeled.This was achieved by examining the content of the response and associating it with the relevant class.After labeling, we compared these assigned labels against the ground truth to determine the results.
Additional efforts were made to make sure that the prompts that we send to ChatGPT are as descriptive as possible and follow patterns defined in [11].The prompt patterns from the catalog that were implemented in our prompts include "The Persona Pattern", which enables the model to take a certain point of view or role, in our case, a climate, sustainability, and environmental expert; "The Fact Check List Pattern", which instructs the model to output the most important points of a text and then use those points as the input in a follow-up prompt and the "Reflection Pattern" in which the model is asked to explain the reasoning behind its response.In all techniques for enhancing prompt engineering, a coherent chain of thought processes comprising a series of intermediate reasoning steps [12] is followed.The utilization of CoT (Chain-of-thought) enables models to generate more comprehensive reasoning processes.However, due to its emphasis on intermediate reasoning steps, there is a potential risk of introducing hallucinations and accumulated errors, thereby constraining the models' effectiveness in addressing complex reasoning tasks.Motivated by the natural carefulness and deductive reasoning processes of humans when completing complex tasks, the authors of [13] aim to empower language models to perform explicit and rigorous deductive reasoning and ensure the reliability of their reasoning process through self-verification.

Classification using off-the-shelf library
The alternative approach involved the use of an off-theshelf library, Scikit-LLM 4 , which allows using ChatGPT as any other conventional Classifier that would be found in the Scikit-Learn library.Scikit-learn 5 library is a machine learning package that includes a large list of machine learning methods like Classifiers and Regressors that can be conventionally used for a selection of data science tasks [14].The Scikit-LLM library already has its prompts set up in the Classifiers, and all that is needed is to provide the labels for the classes and optionally give a few samples to train ChatGPT in the context of the data.
We have used the ZeroShotClassifier from the library in our evaluation of ChatGPT in this approach which is a classifier that only requires the labels for the classes in order to be able to classify the provided text.The responses are returned as predictions in the format of the labels provided to the Classifier as input.In the same manner, as with the other approach, the responses were compared against the actual labels to get the results.

Results and Discussion
This section provides a detailed examination of the results.We utilize two metrics, accuracy [15] and f1-score [16], to evaluate the generalization ability of classifiers and consider class balance.These metrics are presented in Table 1.

ClimateBERT Performance
The results show that the individual fine-tuned models, based on ClimateBert, generally achieve satisfactory performance across all five tasks.
The base ClimateBert model is primarily fine-tuned for text classification to determine whether paragraphs are climate-related.As expected, the model performs exceptionally well on task 1, achieving high accuracy (0.97) and an f1-score of 0.9572.While the other tasks also involve classification, their performances are acceptable but less outstanding than the climate detection task.The sentiment analysis task (task 2) and the classification of paragraphs as commitments and actions (task 3) exhibit similar accuracy and f1-scores, with slightly better results observed for the commitments and actions classification.The specificity classification (task 4) shows lower accuracy and f1-scores, though still satisfactory to some extent.The last task for classifying paragraphs into the TCFD recommended categories (task 5) is the least performing task, which can be attributed to its complexity and multiple classes.

Simple single-stage prompts
In the single-stage approach, we utilize manually designed prompts using a simple Persona Pattern [11].In this method, we asked ChatGPT to assume the role of an expert in climate, sustainability, and the environment.We then submitted the paragraphs to be assessed, explicitly outlining the task and the required output format.The prompts provided to ChatGPT for each task were as follows: • The results follow a similar pattern to the distribution of performance of the ClimateBert models, with the bestperforming task being climate detection, with climate commitments and actions following along with the sentiment analysis what is interesting here is that the classification of disclosure categories (task 5) is performing better than the specificity classification (task 4) even though it is a more complex task and involves multiple classes.The reason for this is most likely that the multiple classes are more separable, and ChatGPT can distinguish between them, leading to better classification scores.

Simple single-stage prompts with ChatGPT 4
As a second experiment we wanted to assess the impact of different versions of ChatGPT on performance.This investigation involved a basic, single-stage experiment with ChatGPT 4, with comparative analysis against ChatGPT 3.5.
The findings revealed that the performances were quite similar.ChatGPT 3.5 exhibited slightly better results for tasks 1 through 4, while ChatGPT 4 demonstrated better performance in task 5. Notably, task 5 is the most complex one, and ChatGPT 4's results in this task were the best among all the experiments with the different versions of ChatGPT.Furthermore, it is important to highlight the cost differences in conducting these experiments.The total expense for running all tests with ChatGPT 4 was around $30.In contrast, the cost with ChatGPT 3.5 was around 3 USD.This cost variation should be considered in light of the observed differences in performance between the two versions.

Classification using off-the-shelf Slikit-LLM library
In the second approach, we used the Scikit-LLM library that provides the ability to use ChatGPT as a Zero Shot Classifier.This approach was easier to evaluate since we did not have to write our own prompts.We just need to set the base model (gpt-3.5-turbo in our case) and the labels to ZeroShotGPTClassifier class.Then we simply call the model to predict the test data, and the model will take care of sending the data to ChatGPT and receiving the response in the correct format of the predicted label.
The results using this approach, compared to the approach where we provide our own prompts, varied across tasks.The model performed better in climaterelated text detection (task 1) and classification into disclosure categories (task 5), but it yielded poorer results in other tasks, such as specificity (task 4), commitments and actions (task 3), and sentiment analysis (task 2).

Complex multiple-stage prompts.
In addition to the approach described in 3.2.1,we explored the use of multiple prompt engineering techniques to assess the extent of performance improvement compared to the simple single-stage prompts.We used the gpt-3.5-turbomodel for this experiment.
The first pattern used is the same one that was used in the approach described in 3.2.1, the Persona Pattern [11].In addition to this pattern, we have used the Fact Check List Pattern [11] that instructs ChatGPT to extract the most important points from the text and then use these generated points as the input in the follow-up prompt instead of the original text.The model was also instructed to provide explanations and reasoning behind its answers, implementing the Reflection Pattern [11].We have noticed that when ChatGPT explains the reasoning behind its response, the information is more correct, and the response corresponds to the actual situation.By using these patterns, we have created a simple three-step chain prompt pipeline in which we first introduce ChatGPT to the topic by having it take the role of an expert in the field, then we instruct him to extract the most important points from the text provided, and then using those points as the input to the actual classification task.This approach was successful in achieving a better performance in most of the tasks but was not better in all tasks; specifically, this approach has the best accuracy in tasks 1, 2, 3, and 5 and the best f1-score for tasks 4 and 5.For all other tasks, it has slightly worse results compared to the other two approaches.
The prompts in which the patterns were applied were the following:

Conclusion
Our study indicates that context-specific models, such as ClimateBert, which have been fine-tuned for specific tasks, tend to outperform general-purpose language models, like ChatGPT, for classification tasks.Particularly for tasks involving climate-related texts, ClimateBert emerged as a more efficient alternative.This suggests that its application could be highly beneficial in tackling real-world climate issues.
Our research uncovered intriguing findings when tackling the most complex tasks, specifically the classification of climate change disclosure categories (Task 5).Here, the performance of ChatGPT closely matched that of the ClimateBert model.This suggests that with well-structured labels, a Zero-shot approach utilizing ChatGPT could yield noteworthy results.
It's important not to underestimate the versatility and adaptability of general-purpose models.These models demonstrated above-average performance even when handling tasks they were not specifically fine-tuned for or data they were encountering for the first time.This contrasts with conventional classifier models, which require pre-training and fine-tuning with appropriate data to discern and perform the tasks assigned.
Our findings also highlight the benefits of prompt engineering.Simple adjustments to the prompts could significantly enhance the performance of ChatGPT without necessitating further training.Given that these models are conversationally trained with diverse userprovided data, their capacity for learning and adaptation to new information is continually improving.
Despite current limitations, we anticipate that the strengths and potential of general-purpose models will continue to grow over time.Their integration into everyday life is likely to extend beyond general tasks to more complex and scientific endeavors, including analyses, diagnoses, and treatments.
For future research, it would be interesting to explore other strategies for enhancing performance, similar to our approach with prompt engineering.This could potentially include task-specific context considerations and developing automated pipelines that leverage the strengths of both context-specific and general-purpose models.
This task focuses on classifying climate-related paragraphs based on whether they pertain to commitments and actions or not.It helps analyze the translation of climate-related discussions into tangible commitments and actionable measures.This task assesses the specificity of climate-related paragraphs, determining whether they are specific or not.It provides insights into the level of detail and granularity in climate-related discussions, aiding in understanding the depth of information available for targeted climate action.

Table 1 .
ClimateBert vs. ChatGPT-based Models performance comparison on five climate-related NLP tasks.