Dynamic Sentiment Analysis with Local Large Language Models using Majority Voting:
A Study on Factors Affecting Restaurant Evaluation

Junichiro Niimi [email protected] Meijo University RIKEN AIP

Abstract

User-generated contents (UGCs) on online platforms allow marketing researchers to understand consumer preferences for products and services. With the advance of large language models (LLMs), some studies utilized the models for annotation and sentiment analysis. However, the relationship between the accuracy and the hyper-parameters of LLMs is yet to be thoroughly examined. In addition, the issues of variability and reproducibility of results from each trial of LLMs have rarely been considered in existing literature. Since actual human annotation uses majority voting to resolve disagreements among annotators, this study introduces a majority voting mechanism to a sentiment analysis model using local LLMs. By a series of three analyses of online reviews on restaurant evaluations, we demonstrate that majority voting with multiple attempts using a medium-sized model produces more robust results than using a large model with a single attempt. Furthermore, we conducted further analysis to investigate the effect of each aspect on the overall evaluation.

Keywords Marketing $\cdot$ Natural Language Processing $\cdot$ Sentiment Analysis $\cdot$ Large Language Model $\cdot$ Quantization

1 Introduction

Nowadays, as consumers spontaneously post their opinions regarding products and services on online platforms, such as social media and review apps, user-generated contents (UGCs) are widely available in various business fields. In other words, the analysis of textual data is crucial for market research in terms of product development, service improvement, and other activities. For example, textual data collected from online platforms and social networks have been utilized for company’s decision-making, such as evaluating [1] and extracting [2] product features, constructing a recommendation system [3], and assessing the effect on purchase intention [4]. However, even in this situation, text data are yet to be fully utilized, in contrast to the amount of accumulated data [5].

To extract, utilize, and understand consumer preferences from textual data, pre-processing the data through assigning labels, classifying the text, and evaluating the sentiment is essential. However, these are labor-intensive tasks for humans, whereas automated analysis using natural language processing (NLP) technology requires domain knowledge such as text mining and machine learning. Companies can choose another option to use crowdsourcing such as Amazon Mechanical Turk (MTurk), which has become widely used in recent years but is costly, particularly when dealing with large-scale data. Moreover, the quality of crowdsourced data has been a serious concern in academia [6, 7]. Thus, as the enormous amount of data has been accumulated, handling big data becomes more challenging and often impractical.

Large language models (LLMs) have become readily available with multiple cloud services, such as ChatGPT by OpenAI (https://chatgpt.com), Gemini by Google (https://gemini.google.com), and Claude by Anthropic (https://claude.ai). These LLMs are used for a wide range of tasks, leading to the continuous development of new services and applications. In addition, several studies proposed automated annotation models using LLMs [8, 9, 10]. LLMs have highlighted advantages such as high processing speed, low cost, and reproducibility compared to human annotators [9]. Moreover, the initial costs for computational resources and training data preparation is lower than those of machine learning. For instance, it has also been shown that ChatGPT operates at one-thirtieth the cost of MTurk [10].

However, from a practical perspective, the confidentiality of in-house data is an important issue for companies. The utilization of cloud-based LLMs in business activities may also pose significant security risks, such as information leakage or data falsification. In fact, some studies [11, 12] have reported that the protection of user data is an obstacle in introducing cloud services. Additionally, there are serious concerns about the network, compliance, and information security of the cloud environment for commercial use. In particular, companies may adopt a more conservative approach in using AI due to concerns regarding unauthorized learning of their data within the AI sector. Therefore, LLMs that operate on local computers have attracted considerable attention. However, according to the existing literature [13], their performance varies significantly depending on hyper-parameters, such as the precision of quantization, and the relationships between these factors have not been fully explored particularly for the marketing research.

Therefore, in this study, we propose a model for sentiment analysis that can be executed on local computing resources using commercially available open-source LLMs. To analyze multiple aspects of the opinion dynamically, the model is built for aspect-based sentiment analysis (AbSA) [14] with arbitrary aspects set by the authors. Furthermore, this study is not limited to the mere extraction of information by the proposed model but demonstrates that the obtained data can be utilized for further statistical analysis. The remainder of this paper is organized as follows. In Section 2, we review related studies to contextualize this research. In Section 3, we introduce our proposed model. An overview and the results of the analysis are presented in Section 4. Finally, in Section 5, we discuss the implications and challenges of this study.

2 Related Works

2.1 Sentiment Analysis

Previous studies on sentiment analysis mainly focused on electronic word-of-mouth (eWOM) and social network posts [15]. These methods can be broadly classified into four groups: rule-based models, machine learning, deep neural networks (DNNs), and LLMs.

2.1.1 Rule-based Models

Several well-known models, such as valence aware dictionary for sentiment reasoning (VADER) [16], semantic orientation calculator (SO-CAL) [17], and TextBlob [18], have been proposed for rule-based sentiment analysis using lexicons. These models have been widely adopted for sentiment analysis [15, 19, 20, 21, 22]. For the advantages, these models do not require a training process since all the evaluations are based on the pre-determined rules and lexicons. these models have high interpretability for the obtained results while machine learning in general is considered as a black-box model [16, 23]. In addition, the duration of inference was significantly shorter than that of the other methods.

However, several challenges remain. For example, a rule-based model assumes that all words used are included in the dictionary, making it difficult to deal with unknown words. Moreover, dynamic capturing of sentiments is difficult because the applicability of negation in a sentence is assessed based on a rule [17]. As for the major problem, the models can only adjudicate polarity across the entire text; hence, they cannot measure individual perspectives as AbSA. Considering these characteristics, while rule-based models certainly have significant advantages in being easy, fast, and inexpensive to implement, they struggle to perform sentiment analysis flexibly from an individual perspective. Therefore, a more advanced approach is required.

2.1.2 Machine Learning

Another approach in sentiment analysis is the use of machine-learning techniques [24, 25]. Major machine learning techniques, such as k-nearest neighbor (kNN) [26], Naïve Bayes (NB), and linear support vector machine (linear SVM) [27], can perform sentiment analysis of textual data as part of the classification task. To handle text using these models, first converting the sentence into word embeddings is necessary. For example, the well-known approaches are term frequency-inverse document frequency (TF-IDF) [28], word2vec [29], and FastText [30, 31]. A previous study [32] which implemented sentiment analysis on the review data of movies and hotels compared the accuracies between kNN and NB. Another study [33] which classified consumers based on online reviews according to the extent of satisfaction with hotels adopted NB. Further study [34] which conducted sentiment analysis on multiple datasets, including Amazon’s online reviews, employed a combination of TF-IDF and linear SVMs.

These machine-learning techniques can estimate the polarity for individual dimensions in addition to that over the whole text as long as they have the correct labels. However, correct labels and training processes are necessary for this purpose. As mentioned previously, preparing a sufficient dataset for training is labor-intensive or costly. Furthermore, other challenges exist in acquiring these features. For example, embedding methods, even word2vec, do not capture the dynamic meaning of words based on the relationship between several sentences. That is, the same word is treated constantly, even if it has different meanings, and it cannot handle word polysemy. In this situation, it is difficult to comprehend the ever-changing interests of consumers and respond flexibly to unknown perspectives that may emerge in the future, which is an important objective in the use of texts for planning marketing strategies.

2.1.3 Deep Neural Networks

With the advancement of neural networks, various DNN models have been proposed for sentiment analysis [35, 36]. Typically, DNN models also adopt embedding techniques to obtain distributed representation. First, convolutional neural network (CNN) based method leverages their strength in capturing local patterns. A study [37] which constructed sentiment classification models on several text datasets adopted a combination of FastText and CNN. The model alternates between convolutional layers and max-pooling layers applied to the 2D feature representations, followed by fully connected layers with ReLU activation, and finally, classification with a softmax function. Second, the combination of attention mechanism [38] and recurrent layers, including recurrent neural network (RNN) [39], gated recurrent unit (GRU) [40], and long short-term memory (LSTM) [41], has been explored in some studies [42, 43, 44] to develop sentiment analysis models that are sensitive to prediction-relevant information. Moreover, bidirectional encoder representations from transformers (BERT) [45] and their advanced models [46, 47] have made substantial contributions to the textual analysis. They can obtain deep-contextualized word representations [48] that dynamically change the meaning of a word based on its interaction with other words in the sentence. One study [49] established a sentiment analysis model using BERT to predict sentiments of user reviews on online platforms. For AbSA, a study [3] simultaneously predicted multiple aspects of hotel evaluation, such as overall, service, and location ratings from user reviews on online platforms. They [3] not only proposed the AbSA model but also constructed a recommender based on predicted sentiments. That is, an easy-to-implement AbSA would lead to a deeper understanding of consumer preferences and certainly contribute to the construction of personalized services.

In addition, multimodal deep learning [50] has received increasing attention. It combines multiple data streams and considers their relationships to construct a robust predictor [51]. In multimodal sentiment analysis [52, 53], a model is extended to handle modalities other than textual data, such as numeric values, images, and audio, and can utilize non-verbal information absent in text to construct relationships among the modalities. A study [54] which focused on user ratings for restaurants constructed a multimodal model that simultaneously integrates the textual data of review texts and the tabular data of user and restaurant information using BERT and cross-attention¹¹1The study [54] uses the same dataset as this study. However, because of different data extraction conditions, a direct comparison is not possible..

DNN models have a strong advantage in terms of their high prediction accuracy because of their multilayer structure and nonlinear modeling [55]. For AbSA, the models can predict multiple dimensions simultaneously with an appropriate loss function. However, using deep learning has drawbacks of high computational costs in terms of both time and resources required for training the model. In general, large-scale computational resources using GPU and large amounts of data are required.

2.1.4 Large Language Models

Recently, a few studies utilized LLMs for the annotation of unstructured data and sentiment analysis [56]. In sentiment analysis, as well as the general usage of LLMs, an analyst creates and passes the prompt which includes instruction and review texts to the LLMs. The sentiment values are extracted from the response. For example, a study [8] employed GPT-3 which is one of the model variants provided in ChatGPT to compare several annotation approaches. Another study [57] uses GPT-3.5 for multiple analyses including sentiment prediction. In terms of comparison with human annotation, a previous study [10] reported that the zero-shot model ²²2In the zero-shot model, a task can be performed with sufficient accuracy without instructing any examples of answers in the prompts. Similarly, the few-shot model can be executed with only a small number of examples. outperformed MTurk’s crowdworker by an average of 25 % in terms of accuracy on the four annotation tasks. Another study [58] used GPT-4 to predict the sentiments of social media posts and reported that the predictions substantially matched the human rating values. Moreover, a recent study [9] also used GPT-4 model to annotate multiple datasets; however, the results presented in their study are limited to summary statistics, and it is unclear whether accuracy was achieved in specific tasks, making it difficult to assess its usefulness.

To summarize, the first and most significant advantage of sentiment analysis using LLMs is their ease of use. These methods can be used in natural language and therefore require less expertise than any of the models described thus far. While research in this field is still relatively limited and many studies are in preprint stages, the existing body of work strongly supports the efficacy of LLMs in text classification tasks, including sentiment analysis, indicating their promising potential. In addition, analyses using LLMs tend to be more accurate than the traditional models, which may be because LLMs in general are pre-trained on a large-scale text corpus. The pretraining process makes them zero- or few-shots learners [59]. In other words, they have high potential for use in marketing research since they can provide sufficient accuracy in annotating unknown data without additional learning (i.e., fine-tuning).

However, most of these models do not attempt multiple annotation trials, which rarely considers the variability and consistency in using LLMs. For business applications, the reproducibility and consistency of results are crucial. Even in the actual human annotation, tasks are generally conducted by multiple workers. If there is a disagreement among workers, solutions such as discussions and majority voting are adopted to determine the final evaluation. One study [9] that examined the change in accuracy by repeating LLM annotations for the multiple times showed that the higher the consistency across multiple annotations, the higher the final accuracy and recommended three or more trials for the annotation of one dataset. In other words, by incorporating the majority voting mechanism into the multiple attempts of LLMs, the performance of the task is expected to increase. Therefore, in this study, we develop LLM sentiment analysis model which generates multiple workers inside the model and each worker votes the sentiment to generate more robust results.

2.2 Local LLMs and Quantization

With the rise of online AI chat services, running LLMs on local devices, such as laptop computers and smartphones, has been explored [60]. As mentioned previously, LLMs are assumed to be executed in abundant environments. In other words, the major challenge is the limited computing resources. One approach for addressing this issue is to apply quantization [61, 62, 63, 64, 65]. Quantization is referred to as "to map floating-point numbers into low-bit integers" [66], which discretizes the continuous values of the parameters on the LLMs to compress the model size and memory usage and accelerate its execution [67].

Quantization can be divided into two approaches: quantization-aware training (QAT) and post-training quantization (PTQ). QAT designs the learning process based on the assumption of quantization. The inference accuracy tends to increase since the model can be adapted for quantization during the early stages of training. However, the quantization is required to be considered during the training process, needing a large amount of resources and expertise compared to PTQ [68], which applies quantization after training. On the other hand, PTQ applies quantization after the training. In PTQ, a smaller amount of data is required for the calibration of the model, and its implementation is more effective than that of QAT [66]. In this study, we focused on PTQ—a widely used technique for quantization owing to its low training costs.

As previously mentioned, quantization generally limits the parameter $w\in\mathbb{R}$ to an integer. For example, $b$ -bit quantization ( $b\in\mathbb{N}$ ) uses the map function $\phi:(\mathbb{R},\mathbb{N})\to[0,2^{b})$ to obtain the quantized parameter $\hat{w}$ [67]. The actual PTQ algorithm has variants such as LLM.int8() [61], GPTQ [62], SpQR [69], AWQ [64], and GGUF [65]. Whichever technique is used, precision of quantization, specifically the number of bits in quantization, should be carefully considered because it generally degrades the performance of the quantized model. In other words, the performance of the model and the precision of the quantization have a tradeoff relationship [13, 70].

In summary, although methods to execute LLMs on resource-constrained devices have been explored, there is still a lack of verification of points such as the relationship between precision and accuracy. Additionally, it is difficult for industries to comprehensively evaluate an optimal model with a suitable balance between execution speed and accuracy in marketing analysis. Therefore, to understand this relationship, we construct AbSA models using pre-trained LLMs that have different numbers of parameters and are quantized with different precisions. Upon selecting the pretrained LLMs, we mainly focused on 4-bit quantization since several studies [68, 71] have shown that 4-bit quantization can perform close to the nonquantized model. Needless to say, it has also been pointed out that the performance of LLMs is highly dependent on the task [66].

3 Proposed Model

3.1 Pretrained Models

In terms of pretrained models, we employ instruction-tuned models that require no additional training. In instruction tuning [72], the model is trained in advance using a combination of various instructions and their expected responses to adapt to a wide range of tasks. In this study, Llama provided by Meta [73] is adopted, which is an open-source LLM permitted for both commercial and academic use only if several conditions are fulfilled. According to the model card [74], latest Llama 3 is trained with more than 15 trillion tokens from publicly available data sources. Furthermore, the fine-tuning process of the model includes more than 10,000 manually annotated data in addition to instructional data, as well as reinforcement learning from human feedback (RLHF) [75]. Considering these learning processes, sufficient accuracy is expected for sentiment analysis of eWOM without additional training.

We focused on three factors: model scale (i.e., number of parameters), precision for quantization, and model architecture. First, in terms of model scale, the latest Llama 3 has two variants: 8 billion (8B) and 70 billion (70B). Second, in terms of precision, we primarily used 4-bit models with additional 3-bit and 5-bit models. Finally, the Llama 2 model was used to validate the impact of the model architecture and pretraining process. Table 1 lists the employed models. We compared the performance and duration of each model in practical marketing research tasks. For quantization, the GGUF format [65] is employed, which has been widely adopted within the LLM community because of its high practicality.

Table 1: Employed Pre-trained LLMs

Model Name	Llama	Scale	Precision	PTQ
Meta-Llama-3-70B-Instruct.Q4_K_M	3	70B	4-bit	GGUF
Meta-Llama-3-8B-Instruct.Q5_K_M	3	8B	5-bit	GGUF
Meta-Llama-3-8B-Instruct.Q4_K_M	3	8B	4-bit	GGUF
Meta-Llama-3-8B-Instruct.Q3_K_M	3	8B	3-bit	GGUF
Meta-Llama-2-7B-Chat.Q4_K_M	2	7B	4-bit	GGUF

In addition, instead of implementing fine tuning, we employ one-shot learning. We included a single annotated example in the prompt and enabled the model to generate an accurate response. The samples used for the instructions were randomly extracted from the dataset, annotated by the authors, and excluded from the test data. Since Llama 3 has a large context window of 8192 tokens, it is sufficient to wrap up the entire text, including the prompt, one-shot example, and review texts. To extract the polarity value as structured data, the model is explicitly required to output text in the JSON format. Once obtained, the text is parsed into tabular data.

3.2 Majority Voting Mechanism

In the field of machine learning, ensemble learning is often employed to reduce the error rate and to generalize results. It is a combination of predictions by multiple models and has been shown to be robust to noise and outliers [76]. The utility of learning has been widely confirmed (cf. review articles and books [76, 77, 78]), particularly when models have different types of prediction errors. In other words, the proposed model is effective when the responses exhibit moderate diversity. In ensemble learning, majority voting is utilized particularly for multinomial classification, in which each model votes for one class, and the class with the most votes is considered as the final predictor [77]. While ordinary majority voting used the average evaluation value, LLMs may occasionally produce unexpected values such as outliers or missing values. Therefore, in this study, we adopt the median value for the metric of majority voting.

To implement this mechanism, the reproducibility parameters of the model were used. In most machine learning methods, by specifying the initial value of the random number generation (i.e., seed value), it is possible to ensure reproducibility in training and inference. In other words, by repeating annotations with different seed values, the model can generate the results obtained by multiple fictitious workers. Based on the responses of each virtual worker, we create two variables: 1) $m^{k}_{w}$ , an indicator of whether dimension $k$ is mentioned in worker $w$ , and 2) $v^{k}_{w}$ , an indicator of how high the sentiment is if mentioned (where dimension $k=\{1,2,...,14\}$ and worker $w=\{1,2,...,5\}$ ).

First, for each worker $w$ ,

	$\displaystyle m^{k}_{w}$	$\displaystyle=\begin{cases}1&\text{(if~{}}k\text{~{}is mentioned)}\\ 0&\text{(otherwise)}\end{cases}$		(1)
	$\displaystyle v^{k}_{w}$	$\displaystyle=\begin{cases}v&\text{(if~{}}k\text{~{}is mentioned)}\\ 0&\text{(otherwise)}\end{cases}$		(2)

are obtained (sentiment $v=\{1,2,...,5\}$ ). Second, voting was conducted using the median as follows:

	$\displaystyle m^{k}$	$\displaystyle=median(m^{k}_{w}\|~{}w=1,2,3,4,5)$		(3)
	$\displaystyle v^{k}$	$\displaystyle=\begin{cases}median(v^{k}_{w}\mid v^{k}_{w}\neq 0)&\text{(if }% \exists~{}w\text{ s.t. }s^{k}_{w}\neq 0\text{)}\\ 0&\text{(otherwise)}\end{cases}$		(4)

This implies that $v^{k}$ , the median of the sentiment values for the dimension $k$ , is calculated only among nonzero $v^{k}_{i}$ . Finally, the voted evaluation $s^{k}$ was obtained as follows:

\displaystyle s^{k}=m^{k}v^{k}

(5)

Thus, this mechanism employs two-stage majority voting to assess the level of sentiment on the dimension $k$ : 1) determining whether the dimension is mentioned or not, and 2) evaluating the sentiment polarity, utilizing the median for both stages.

In addition, the randomness of the response of LLMs is determined using a temperature parameter. One study [9] suggests 0.2 and above for temperature while another study [79] pointed out that the result becomes highly random and not consistent with higher temperature. Therefore, we use a temperature value of 0.2 for a moderate randomness.

4 Analysis

To validate the proposed model, this study consists of three consecutive analyses (Study 1-3). An overview of the analysis is presented in Fig. 1. First, in Study 1, we conduct a sentiment analysis for restaurant reviews posted on an online platform. In the analysis, we construct multiple models with different hyper-parameters and examine the changes depending on the settings. We do not use a majority voting mechanism here because Study 1 is an exhaustive analysis using many variants, including large-scale models. In Study 2, we integrate the majority voting mechanism into the model chosen in Study 1. We iterate the inferences multiple times on a single task and validate the consistency among the trials and the change in accuracy through voting. Finally, in Study 3, two linear regression models are established for further analysis using the obtained aspect-based sentiments. The two models predict the actual and predicted ratings for the restaurant using individual aspects of the evaluation, such as the price of the restaurant and taste of the dishes. We compare the estimated parameters of the models and demonstrate that the annotation results of the proposed model do not differ from the ground truth.

Refer to caption — Figure 1: Overview of the Analyses

In Studies 1 and 2, the performance of each model was evaluated using three metrics: concordance rate (Acc.), Pearson’s correlation coefficients (Corr.), and the root mean square error (RMSE). In general, classification tasks are assessed using metrics such as precision, recall, and F1 score, which assess how well the predictions align with the actual labels. However, because our target variable is ordinal, the magnitude of the prediction errors is important. If a prediction is incorrect, the extent of the error–-whether it is a large deviation or a neighboring value–-matters. Therefore, we evaluate the extent of the discrepancy between the predicted and actual values.

4.1 Data Overview

In this study, we use the Yelp Open Dataset [80], an open dataset publicly available for academic use. Yelp is an online platform on which users post evaluations and reviews about various facilities, including restaurants, stores, and public institutions. Some studies [54, 81, 82, 83] have utilized it to analyze user reviews. The data contain the ratings $nStars_{ij}$ and review texts $review_{ij}$ that user $i$ posts on facility $j$ (where $i=1,2,\dots,n$ ; $j=1,2,\dots,m$ ). One post was randomly extracted from each user. If a user posted a review for the same restaurant multiple times, only the most recent review was considered for sampling. Therefore, the sample size matches the number of target users extracted.

In addition, each establishment has category tags, such as Restaurant, Coffee & Tea, and Bar; thus, we can extract the target instances by choosing the tags. In this study, only restaurants holding a physical store in a fixed address were included in the analysis, and therefore we extracted only those tagged with Restaurant and excluded those of Fast Food, Food Trucks, Nightlife, and Bar. We sample 1000 instances from the data, and Table 2 lists the summary statistics. The numbers of characters and tokens are counted using the Tiktoken tokenizer [84] which is also used in Llama 3.

Table 2: Summary Statistics of the Data

		Mean	Std	Min.	Max.
Textual Data
	#Characters	392.062	302.190	61	2425
	#Tokens	88.164	67.973	13	570
Evaluation Data
	#Stars	3.933	1.371	1	5

4.2 Study 1: Effects of Model Scale and Precision on the Performance

First, we predict the overall rating (i.e., the number of stars given) of restaurants in Study 1. By comparing the accuracy and processing speed of predictions among multiple pre-training and reference models using machine learning methods, we explore the best-balanced model that optimizes both prediction accuracy and processing speed. In the analysis, annotation was performed using a 5-point Likert scale, in accordance with the fact that the overall rating by actual users ranged from one to five stars. In addition to overall sentiment, 14 dimensions were simultaneously predicted for Study 2 (cf. Table 3, in Study 2).

Table 3: Aspects of the Sentiment

Dimensions	Explanations
$overall$	overall rating on the restaurant
$price$	price of the restaurant
$menu$	variety of menu
$dishes$	taste of dishes
$dessert$	taste of desserts
$cleanliness$	cleanliness of the restaurant
$atmosphere$	atmosphere of the restaurant
$congestion$	congestion of the restaurant
$noise$	noise in the restaurant
$attitude$	attitude of other customers
$enjoyment$	other entertainment service, such as
	live music, DJs, and cafe seminar
$child$	child-friendliness
$couple$	suitability for couples
$access$	ease of access

Eight reference models were established to evaluate the model performance. First, we construct three DNN models: feed-forward neural network (FFNN), bidirectional LSTM (Bi-LSTM), and convolutional neural network (CNN). The FFNN model uses BERT (pre-trained model: bert-base-uncased [45]) for text vectorization, which has the 712-dimensional pooled output of the [CLS] token from BERT and is passed through three fully connected layers and classified using a softmax function. The Bi-LSTM model is constructed in accordance with the previous study [85]. We used word2vec for the vectorization and combined Bi-LSTM and multi-head attention, which addresses the long-term dependencies of the context and add weighted importance to the relevant information. ³³3The actual previous study [85] grouped the target variable into positive, neutral, and negative. The CNN model is established with the previous study [37] which used FastText for obtaining word embedding. The model alternates between convolutional layers and max-pooling layers applied to the 2D feature representations, followed by fully connected layers with ReLU activation, and finally, classification with a softmax function. Second, we used three rule-based methods, including VADER, SO-CAL, and TextBlob. These methods are based on pre-determined dictionaries and rules, so there are no adjustable parameters. Our predictions are generated directly from these polarity scores without any additional training⁴⁴4The obtained continuous values were adjusted to a range in $[1,5]$ .. Third, for machine learning methods, we use Linear SVM. Since machine learning methods cannot directly handle the textual modality, we vectorize the text using TF-IDF in accordance with the previous study [34]. To create sufficiently sized features for prediction with TF-IDF, we set the dimensionality to 4000. Finally, in addition to these well-known models, a benchmark at chance level is created. Instead of generating completely random predictions, we calculated the proportion of sentiment labels from the training data and used it as weights to generate predictions for the test data. Regarding the additional dataset for training and validation, we randomly sampled 5000 observations for training and 1000 for validation set, ensuring no duplication among the datasets.

The results are shown in Table 4. The proposed models (Model 1–4) except Llama-2 (Model 5) outperformed the reference models. Notably, Llama 3 with 70B parameters (Model 1) demonstrated the best performance across all metrics. In terms of the number of parameters, the model with 70B in 4-bit (Model 1) is superior to that with 8B in 4-bit (Model 3), indicating that a larger-scale model achieves better performance within the same precision of 4-bit. In other words, the scale of the model generally contributes on the prediction accuracy as well as previous studies have shown [86, 87]. Second, in terms of precision in quantization for 8B models (Models 2–4), in contrast to expectations, the accuracy improved with lower precision. Third, as all Llama 3 models (Models 1–4) outperformed the Llama 2 model (Model 5), it can be concluded that the improved model structure and pre-training process contributed to the predictions.

Regarding the reference models, performance was in the following order: DNN models (Models 6–8), rule-based models (Models 9–11), machine learning models (Model 12), and chance level. Although the model using DNN achieved high accuracy as expected, it is noteworthy that most of them did not surpass that of Llama 2. Additionally, among the DNN models, the highest accuracy was shown when BERT was used for acquiring word embeddings (Model 6). Although Model 7 employs more complex architectures of LSTM and Multihead Attention, Model 6 with FFNN was superior in prediction. This confirms the significant improvement of BERT over word2vec and FastText. Second, all rule-based models (Models 9–11) showed better performance than the machine-learning model (Model 12). This result can be attributed to the relatively short length of the review texts and low text complexity. Alternatively, this could be due to the poor generalization performance of the machine-learning methods to imbalanced data, as restaurant ratings tend to gather at extreme values, such as 1 or 5. However, all the models scored above the chance level.

Finally, a comparison of the processing times among the LLMs showed that only Model 1, with 70B parameters, required a significantly longer processing time. Compared with Model 2, Even when analyzing only 1,000 samples, a total durations differ in more than 16 hours for a 2.4% improvement in the prediction error. From a practical standpoint, this increase in the processing time for improvement of the accuracy cannot be considered a reasonable trade-off. It is true that Model 2 has a lower accuracy than Model 1; however, its RMSE of 0.551 indicates that the model still effectively predicts sentiment. It accurately estimates higher values as high and lower values as low, with the fastest processing speed among the proposed LLMs. Therefore, this study employs Llama 3 with 8B parameters and quantized in 3-bit (Model 2) for Study 2 and 3.

Table 4: Comparison of the accuracy and the processing time (Study 1, ascending in RMSE)

Model Name		Llama	#Params	#Bits	Corr.	RMSE	Acc.	Time ( $s$ )
LLM models
	Model 01: llama-3-70b-instruct_Q4_K_M	3	70 billion	4	0.929	0.521	0.779	$64.879\pm 6.114$
	Model 02: llama-3-8b-instruct_Q3_K_M	3	8 billion	3	0.913	0.551	0.755	$4.560\pm 1.768$
	Model 03: llama-3-8b-instruct_Q4_K_M	3	8 billion	4	0.909	0.562	0.749	$5.443\pm 2.222$
	Model 04: llama-3-8b-instruct_Q5_K_M	3	8 billion	5	0.892	0.617	0.756	$5.230\pm 2.020$
	Model 05: llama-2-7b-chat_Q4_K_M	2	7 billion	4	0.791	0.860	0.721	$5.986\pm 3.850$
Reference models
	Model 06: BERT + FFNN	-	-	-	0.792	0.941	0.639	-
	Model 07: word2vec + LSTM + Multihead Attention	-	-	-	0.789	0.930	0.636	-
	Model 08: FastText + CNN	-	-	-	0.700	1.098	0.596	-
	Model 09: VADER	-	-	-	0.667	1.111	-	-
	Model 10: TextBlob	-	-	-	0.646	1.120	-	-
	Model 11: SO-CAL	-	-	-	0.661	1.374	-	-
	Model 12: TF-IDF + Linear SVM	-	-	-	0.641	1.134	0.627	-
	Model 13: Chance level	-	-	-	0.019	1.850	0.358	-

Note. Bold text indicates that the proposed model performs better than all reference models, while shading behind the indices represents the highest performance. Time represents the average duration and the standard deviation to process one review. Those not marked with an Acc. indicate cases where the prediction is in the continuous values and therefore the accuracy cannot be calculated.

4.3 Study 2: Integration of the Majority Voting Mechanism

Thus far, we found the best model for sentiment analysis to assess overall ratings in Study 1. However, since evaluating the detailed aspects is significantly more difficult than evaluating the overall rating, ambiguity of the evaluation occurs during the annotation. The fact that such fluctuations were resolved by voting in the actual annotation is also an important clue for this study. Therefore, in Study 2, we incorporated majority voting by virtually generating five annotators in one LLM and examined whether the robustness of the results increased depending on the introduction of voting. As explained, the mechanism operates in two stages with $m^{k}$ (the presence or absence of the mention of dimension $k$ ), $v^{k}$ (the ratings of the dimension $k$ ), and $s^{k}$ (the final evaluation for dimension $k$ ), based on the median of the five workers (see Section 3.2 for details).

Table 5 presents an example of the aggregated evaluations of the scores of five virtual workers for one review. First, as shown in the table, consistency among the workers was confirmed for most dimensions. Second, some of the evaluations were divided. For example, in $menu$ , one worker rated it as 2, while the remaining four workers rated it as 3; hence, the voted sentiment was 3. In $congestion$ , most of the workers evaluated that the dimension was not mentioned, whereas worker 4 evaluated it as 1. Thus, the final sentiment was set to 0, indicating that this aspect was not mentioned.

Table 5: Results of Majority Voting (Study 2)

			rating	price	menu	dishes	dessert	clean	atmosphere	congestion	noise	enjoyment	attitude	child	couple	access
Each Worker
	Worker 1		4	0	2	3	0	0	4	0	0	0	5	0	0	1
	Worker 2		4	0	3	3	0	0	4	0	0	4	5	2	3	1
	Worker 3		4	0	3	3	0	0	4	0	0	4	5	2	3	1
	Worker 4		4	0	3	3	0	0	4	1	1	4	5	2	4	5
	Worker 5		4	0	3	3	0	0	4	0	0	4	5	0	0	1
Majority Voting
	$m^{k}$	: $k$ is mentioned	1	0	1	1	0	0	1	0	0	1	1	1	1	1
	$v^{k}$	: nonzero median	4	0	3	3	0	0	4	1	1	4	5	2	3	1
	$s^{k}$	: sentiment	4	0	3	3	0	0	4	0	0	4	5	2	3	1

Based on the above, Table 6 shows a lift in the performance by incorporating the mechanism. The results indicate that that incorporating our proposed mechanism improved performance across all indices, even exceeding the average. This tendency is similar to ensemble learning which uses multiple machine-learning models to perform inference tasks and aggregates the results through statistics such as mean and median.

Table 6: A Lift in Accuracy Depending on the Majority Voting Mechanism (Study 2)

	Improvement (%)
Majority Voting	Corr.	RMSE	Acc.
- is employed	1.098	1.327	1.039
- in-Seed average (Seed 1–5)	1.041	1.109	1.017
- is not-employed (baseline)	1.000	1.000	1.000

Note. Lift represents the improvement from the baseline.

Furthermore, the difference of accuracy improvements between Study 1 and 2 is noteworthy. In Study 1, a fourteen-fold increase in the processing time resulted in improvement of 2.4%. In contrast, in Study 2, a five-fold increase in the processing time led to that of 3.9% using majority voting. This indicates that a medium-sized model with iterative inferences using the majority voting mechanism is significantly more efficient in terms of both training time and prediction accuracy than the larger model with a single inference.

4.4 Study 3: Regression Analysis of the Factors Affecting the Evaluation

Thus far, we have validated the sentiment analysis using LLMs (Study 1) and the robustness of the results using the majority voting mechanism (Study 2). In other words, we can freely extract any aspect of sentiment from review texts accurately with medium-scale LLMs using the majority voting mechanism. Finally, in Study 3, to demonstrate that further analyses are possible with the annotated data, we use regression analysis to examine how each aspect affects the overall evaluation. Since the dataset did not contain the sentiment values of each aspect, we assess the generalizability of the results from the marketing literature. For a similar analysis, a previous study [22] which examined the relationship between the emotion of the posts on the social network and the total amount of monthly donation to the university constructed linear regression models based on predicted sentiments.

We use a generalized linear model (GLM) to predict the overall rating of restaurants based on aspect-based sentiments (Table 3) as explanatory variables. In this process, we construct two GLMs with different target variables: one using the actual user rating and the other using the rating predicted by the LLM. By comparing the estimated parameters and other values of the two models, we verify that they have similar structures. Since there is a risk of multicollinearity if all the dimensions considered in this study are simultaneously used in the model, explanatory variables are selected based on Bayesian information criterion (BIC) to explore the best model.

Table 7: A Comparison of the Regression Results (Study 3)

$n=1000$	$Y1$ : Actual evaluation				$Y2$ : Predicted evaluation				Diff.
	coef.	(SE)	$z$ -value		coef.	(SE)	$z$ -value		$t$ -value
intercept	1.653	(0.074)	22.244	^†	1.477	(0.064)	22.797	^†	1.781
menu	0.144	(0.020)	7.154	^†	0.144	(0.017)	8.229	^†	-0.016
dishes	0.173	(0.019)	9.029	^†	0.216	(0.016)	12.915	^†	-1.682
dessert	0.054	(0.017)	3.205	^†	0.067	(0.014)	4.507	^†	-0.546
atmosphere	0.087	(0.013)	6.629	^†	0.091	(0.011)	7.988	^†	-0.253
congestion	-0.155	(0.027)	-5.628	^†	-0.112	(0.024)	-4.695	^†	-1.155
enjoyment	0.316	(0.017)	18.489	^†	0.334	(0.014)	22.398	^†	-0.785
access	0.069	(0.024)	2.883	^†	0.055	(0.021)	2.621	^†	0.450
psuedo- $R^{2}$	0.768				0.894
AIC	2558.144				2284.276
BIC	2597.406				2323.539

Note. ${*}:p<0.05$ , ${\dagger}:p<0.001$ .

With variable selection, seven dimensions were adopted in addition to the intercept. Table 7 shows the parameter estimations for the actual user ratings ( $Y1$ , left) and LLM-predicted ratings ( $Y2$ , right). First, for Y1, the results confirmed statistically significant differences at the 1% level for all explanatory variables. In particular, the $z$ values show that $dishes$ and $enjoyment$ have a strong positive effect and $congestion$ has a strong negative effect. These are assessed as valid results, as several studies reported a similar tendency for the effects, such as "food quality" [88, 89, 90], "entertainment" [91], and "waiting time for a meal" [90, 92], on outcomes such as perceived value, customer satisfaction, and behavioral intention. In addition to Y1, the statistical significances at the 1% level are also confirmed for all the variables in Y2. The signs of the estimated coefficients and $z$ values are similar to those in Y1 and the $t$ tests for all explanatory variables using the coefficients and standard errors reveal no significant differences between the models (cf. Diff in Table 7).

In summary, in Study 3, there is no significant discrepancy between the predicted and actual values of the overall evaluation, and that further analysis can be implemented using aspect-based sentiments. This indicates that the proposed model can accurately assess the impact of any dimensions on the overall evaluation. In other words, points that have not been fully evaluated in previous studies can be further analyzed by extracting more detailed perspectives from a large amount of review data. For example, in the previous study [91], the effect of the entertainment factor on the evaluation was investigated as the free provision of board games at a café. By using the proposed model, we can freely define and quantify the entertainment factor and then analyze its relationship to the overall rating.

5 Conclusion

In this study, we demonstrated the utility of LLMs in AbSA through a series of three investigations: 1) predicting the sentiments of multiple aspects from online review texts posted by consumers using existing pre-trained LLMs and comparing the accuracy with multiple reference models, 2) incorporating a majority voting mechanism similar to human annotation and examining its impact, and 3) constructing linear regression models between the ratings and aspects and investigating whether a discrepancy exists between the predicted and actual sentiment.

More specifically, in Study 1, by comparing the prediction accuracy and processing time of multiple pre-trained LLMs, we demonstrated that it is not necessary to use the largest-scale model in terms of the number of model parameters and the precision of quantization. Moreover, the results showed that classification is possible with higher accuracy than the well-known existing methods, such as DNNs, without any training. Thus, the proposed methodology is highly useful for practical applications because it does not require large-scale computational resources. In Study 2, we integrated the majority voting mechanism into the model from Study 1 and validated the change in the robustness. The results showed that the voting made a significant improvement on the accuracy of all the indices. When only a single model is used, LLMs sometimes fail to generate a response, which results in missing values. However, by introducing the majority voting mechanism, the missing values and prediction error are mutually filled by each model, which significantly improves the accuracy. Thus, as in previous studies [9], annotation errors were reduced. In Study 3, we compared two regression models predicting the actual and assessed ratings of restaurant and showed that the results are sufficiently general, and no significant difference was confirmed between the models.

In summary, first, in terms of the selection of the pre-trained models, multiple inferences with a medium-sized model and majority voting of the results are much faster and more accurate than a single inference on a large model with a large number of parameters. Contrary to the initial expectations, the tendency for improvement in accuracy with lower precision was confirmed, and the lower precision speeds up inference, which resulted in reducing the time burden of iterating multiple inferences. Second, using the proposed model and arbitrary dimensions, we can dynamically annotate free-form texts and facilitate the obtained structural data for advanced analysis. In particular, the most significant difference from the existing annotation methods is that even if the dimension used for annotation is minor or has never been used before, as long as it can be linguistically explained, we can utilize it as a dimension in the model. Thus, the proposed model has high applicability for opinion mining in questionnaires and knowledge extraction from documents across various industries, including marketing research, policymaking in administration, and patent document analysis.

For practical implications, the proposed model provides several advantages. First, since it is constructed in a local environment, organizations can address security concerns regarding data leakage and falsification that can occur in cloud services. In particular, while companies are sometimes caught under pressure between the utilization and protection of data, this model enables them to leverage data with minimizing security risk. Second, unlike cloud services, a single purchase of a computer allows businesses to construct the model. That is, they can easily estimate the introduction and running costs, which means that companies can strategically manage their expenses for AI utilization and prevent unexpected expenditures. Third, since no additional cost occurs even if inferences are repeated with different dimensions, marketing researchers can attempt to freely explore business-useful aspects.

Finally, as the challenges of this study, while the proposed model have annotated the arbitrary dimensions, the dimensions may not necessarily provide a comprehensive understanding of the restaurant evaluation. Thus, further investigations are needed to annotate the free-form texts using the existing survey scales, such as SERVQUAL [93] and DINESERV [94]. Second, since it has been pointed out that the performance of LLMs varies depending on the instructions, it is necessary to verify the changes when the other examples are used, and the differences between zero-shot, one-shot, and a few-shots. Third, using the obtained structural data, various other analyses can be also implemented, such as factor analysis, correspondence analysis, clustering of users, and the development of a recommendation system. Therefore, further investigations are required to confirm whether these advanced analyses can be applied to the proposed model.

Acknowledgements

The dataset used in this study consisted of open data for academic use. As no additional information was collected, the author did not obtain any information that could lead to the identification of individuals. The large language models used to construct the analysis model were licensed for commercial and academic use. Both the dataset and models were managed and used in an appropriate environments that comply with the terms of use of the companies from which it was made available.

This study is supported by JSPS KAKENHI (Grant Number: 24K16472).

References

[1] Erick Kauffmann, Jesús Peral, David Gil, Antonio Ferrández, Ricardo Sellers, and Higinio Mora. Managing marketing decision-making with sentiment analysis: An evaluation of the main product features using text data mining. Sustainability, 11(15):4235, 2019.
[2] Zhiyi Luo, Shanshan Huang, Frank F Xu, Bill Yuchen Lin, Hanyuan Shi, and Kenny Zhu. Extra: Extracting prominent review aspects from customer feedback. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3477–3486, 2018.
[3] Yuanyuan Zhuang and Jaekyeong Kim. A BERT-Based Multi-Criteria Recommender System for Hotel Promotion Management. Sustainability, 13(14):8039, 2021.
[4] Affifa Sardar, Amir Manzoor, Khurram Adeel Shaikh, and Liaqat Ali. An empirical examination of the impact of ewom information on young consumers’ online purchase intention: Mediating role of ewom information adoption. Sage Open, 11(4):21582440211052547, 2021.
[5] Yin Kang and Lina Zhou. Rube: Rule-based methods for extracting product features from online consumer reviews. Information & Management, 54(2):166–176, 2017.
[6] Michael Chmielewski and Sarah C Kucker. An mturk crisis? shifts in data quality and the impact on study results. Social Psychological and Personality Science, 11(4):464–473, 2020.
[7] Benjamin D Douglas, Patrick J Ewell, and Markus Brauer. Data quality in online human-subjects research: Comparisons between mturk, prolific, cloudresearch, qualtrics, and sona. Plos one, 18(3):e0279720, 2023.
[8] Bosheng Ding, Chengwei Qin, Linlin Liu, Yew Ken Chia, Shafiq Joty, Boyang Li, and Lidong Bing. Is gpt-3 a good data annotator? arXiv preprint arXiv:2212.10450, 2022.
[9] Nicholas Pangakis, Samuel Wolken, and Neil Fasching. Automated annotation with generative ai requires validation. arXiv preprint arXiv:2306.00176, 2023.
[10] Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023.
[11] Bader Alouffi, Muhammad Hasnain, Abdullah Alharbi, Wael Alosaimi, Hashem Alyami, and Muhammad Ayaz. A systematic literature review on cloud computing security: threats and mitigation strategies. IEEE Access, 9:57792–57807, 2021.
[12] Mazhar Ali, Samee U Khan, and Athanasios V Vasilakos. Security in cloud computing: Opportunities and challenges. Information sciences, 305:357–383, 2015.
[13] Sher Badshah and Hassan Sajjad. Quantifying the capabilities of llms across scale and precision. arXiv preprint arXiv:2405.03146, 2024.
[14] Ambreen Nazir, Yuan Rao, Lianwei Wu, and Ling Sun. Issues and Challenges of Aspect-based Sentiment Analysis: A Comprehensive Survey. IEEE Transactions on Affective Computing, 13(2):845–863, 2022.
[15] Walaa Medhat, Ahmed Hassan, and Hoda Korashy. Sentiment analysis algorithms and applications: A survey. Ain Shams engineering journal, 5(4):1093–1113, 2014.
[16] Clayton Hutto and Eric Gilbert. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, volume 8, pages 216–225, 2014.
[17] Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede. Lexicon-based methods for sentiment analysis. Computational linguistics, 37(2):267–307, 2011.
[18] Steven Loria. TextBlob: Simplified Text Processing. (https://github.com/sloria/TextBlob, accessed June. 11th, 2024), 2024.
[19] Christopher SG Khoo and Sathik Basha Johnkhan. Lexicon-based sentiment analysis: Comparative evaluation of six sentiment lexicons. Journal of Information Science, 44(4):491–511, 2018.
[20] Anton Borg and Martin Boldt. Using vader sentiment and svm for predicting customer response sentiment. Expert Systems with Applications, 162:113746, 2020.
[21] Sameh Al-Natour and Ozgur Turetken. A comparative assessment of sentiment analysis and star ratings for consumer reviews. International Journal of Information Management, 54:102132, 2020.
[22] Sanghyub John Lee, Leo Paas, and Ho Seok Ahn. The power of specific emotion analysis in predicting donations: A comparative empirical study between sentiment and specific emotion analysis in social media. International Journal of Market Research, page 14707853241261248, 2024.
[23] Wouter Van Atteveldt, Mariken ACG Van der Velden, and Mark Boukes. The validity of sentiment analysis: Comparing manual annotation, crowd-coding, dictionary approaches, and machine learning algorithms. Communication Methods and Measures, 15(2):121–140, 2021.
[24] Susan AM Vermeer, Theo Araujo, Stefan F Bernritter, and Guda van Noort. Seeing the wood for the trees: How machine learning can help firms in identifying relevant electronic word-of-mouth in social media. International Journal of Research in Marketing, 36(3):492–508, 2019.
[25] Qing Sun, Jianwei Niu, Zhong Yao, and Hao Yan. Exploring ewom in online customer reviews: Sentiment analysis at a fine-grained level. Engineering Applications of Artificial Intelligence, 81:68–78, 2019.
[26] Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21–27, 1967.
[27] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20:273–297, 1995.
[28] Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972.
[29] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[30] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5:135–146, 2017.
[31] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.
[32] Lopamudra Dey, Sanjay Chakraborty, Anuraag Biswas, Beepa Bose, and Sweta Tiwari. Sentiment analysis of review datasets using naive bayes and k-nn classifier. arXiv preprint arXiv:1610.09982, 2016.
[33] Manuel J Sánchez-Franco, Antonio Navarro-García, and Francisco Javier Rondán-Cataluña. A naive bayes strategy for classifying customer satisfaction: A study based on online reviews of hospitality services. Journal of Business Research, 101:499–506, 2019.
[34] Bijoyan Das and Sarit Chakraborty. An improved text sentiment classification model using tf-idf and next word negation. arXiv preprint arXiv:1806.06407, 2018.
[35] Lei Zhang, Shuai Wang, and Bing Liu. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1253, 2018.
[36] Hai Ha Do, PWC Prasad, Angelika Maag, and Abeer Alsadoon. Deep Learning for Aspect-Based Sentiment Analysis: A Comparative Review. Expert Systems with Applications, 118:272–299, 2019.
[37] Muhammad Umer, Zainab Imtiaz, Muhammad Ahmad, Michele Nappi, Carlo Medaglia, Gyu Sang Choi, and Arif Mehmood. Impact of convolutional neural network and fasttext embedding on text classification. Multimedia Tools and Applications, 82(4):5569–5585, 2023.
[38] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[39] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
[40] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
[41] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[42] Mohammad Ehsan Basiri, Shahla Nemati, Moloud Abdar, Erik Cambria, and U Rajendra Acharya. Abcdm: An attention-based bidirectional cnn-rnn deep model for sentiment analysis. Future Generation Computer Systems, 115:279–294, 2021.
[43] Peng Chen, Zhongqian Sun, Lidong Bing, and Wei Yang. Recurrent attention network on memory for aspect sentiment analysis. In Proceedings of the 2017 conference on empirical methods in natural language processing, pages 452–461, 2017.
[44] Qiongxia Huang, Riqing Chen, Xianghan Zheng, and Zhenxing Dong. Deep sentiment representation based on cnn and lstm. In 2017 international conference on green informatics (ICGI), pages 30–33. IEEE, 2017.
[45] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[46] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[47] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv, 2019.
[48] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Marilyn Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, 2018. Association for Computational Linguistics.
[49] Zeynep Hilal Kilimci. Prediction of user loyalty in mobile applications using deep contextualized word representations. Journal of Information and Telecommunication, 6(1):43–62, 2022.
[50] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 689–696, 2011.
[51] Dhanesh Ramachandram and Graham W Taylor. Deep multimodal learning: A survey on recent advances and trends. IEEE signal processing magazine, 34(6):96–108, 2017.
[52] Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller, Shih-Fu Chang, and Maja Pantic. A survey of multimodal sentiment analysis. Image and Vision Computing, 65:3–14, 2017.
[53] Ankita Gandhi, Kinjal Adhvaryu, Soujanya Poria, Erik Cambria, and Amir Hussain. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion, 91:424–444, 2023.
[54] Junichiro Niimi. Multimodal deep learning of word-of-mouth text and demographics to predict customer rating: Handling consumer heterogeneity in marketing. arXiv preprint arXiv:2401.11888, 2024.
[55] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
[56] Jan Ole Krugmann and Jochen Hartmann. Sentiment analysis in the age of generative ai. Customer Needs and Solutions, 11(1):3, 2024.
[57] Zengzhi Wang, Qiming Xie, Yi Feng, Zixiang Ding, Zinong Yang, and Rui Xia. Is chatgpt a good sentiment analyzer? a preliminary study. arXiv preprint arXiv:2304.04339, 2023.
[58] Lany Laguna Maceda, Jennifer Laraya Llovido, Miles Biago Artiaga, and Mideth Balawiswis Abisado. Classifying sentiments on social media texts: A gpt-4 preliminary study. In Proceedings of the 2023 7th International Conference on Natural Language Processing and Information Retrieval, pages 19–24, 2023.
[59] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[60] Stefanos Laskaridis, Kleomenis Kateveas, Lorenzo Minto, and Hamed Haddadi. Melting point: Mobile evaluation of language transformers. arXiv preprint arXiv:2403.12844, 2024.
[61] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332, 2022.
[62] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
[63] Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, and Yuxiong He. Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation. arXiv preprint arXiv:2303.08302, 2023.
[64] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
[65] G. Gerganov. llama.cpp: LLM inference in C/C++. (https://github.com/ggerganov/llama.cpp, accessed May. 26th, 2024), 2022.
[66] Peiyu Liu, Zikang Liu, Ze-Feng Gao, Dawei Gao, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. Do emergent abilities exist in quantized large language models: An empirical study. arXiv preprint arXiv:2307.08072, 2023.
[67] Tommaso Pegolotti, Elias Frantar, Dan Alistarh, and Markus Püschel. Qigen: Generating efficient kernels for quantized inference on large language models. arXiv preprint arXiv:2307.03738, 2023.
[68] Renren Jin, Jiangcun Du, Wuwei Huang, Wei Liu, Jian Luan, Bin Wang, and Deyi Xiong. A comprehensive evaluation of quantization strategies for large language models. arXiv preprint arXiv:2402.16775, 2024.
[69] Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078, 2023.
[70] Zhuocheng Gong, Jiahao Liu, Jingang Wang, Xunliang Cai, Dongyan Zhao, and Rui Yan. What makes quantization for large language model hard? an empirical study from the lens of perturbation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18082–18089, 2024.
[71] Somnath Roy. Understanding the impact of post-training quantization on large language models. arXiv preprint arXiv:2309.05210, 2023.
[72] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
[73] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[74] AI@Meta. Llama 3 model card. 2024.
[75] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
[76] Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000.
[77] Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012.
[78] Xibin Dong, Zhiwen Yu, Wenming Cao, Yifan Shi, and Qianli Ma. A survey on ensemble learning. Frontiers of Computer Science, 14:241–258, 2020.
[79] Michael V Reiss. Testing the reliability of chatgpt for text annotation and classification: A cautionary remark. arXiv preprint arXiv:2304.11085, 2023.
[80] Yelp. Yelp Open Dataset, An all-purpose dataset for learning. (https://www.yelp.com/dataset, accessed Nov. 20th, 2023), 2022.
[81] Eman Saeed Alamoudi and Norah Saleh Alghamdi. Sentiment classification and aspect-based sentiment analysis on yelp reviews using deep learning and word embeddings. Journal of Decision Systems, 30(2-3):259–281, 2021.
[82] Nabiha Asghar. Yelp dataset challenge: Review rating prediction. arXiv preprint arXiv:1605.05362, 2016.
[83] Zefang Liu. Yelp review rating prediction: Machine learning and deep learning models. arXiv preprint arXiv:2012.06690, 2020.
[84] OpenAI. tiktoken: a fast BPE tokeniser for use with OpenAI’s models. (https://github.com/openai/tiktoken, accessed May. 26th, 2024), 2023.
[85] Fei Long, Kai Zhou, and Weihua Ou. Sentiment analysis of text based on bidirectional lstm with multi-head attention. IEEE Access, 7:141960–141969, 2019.
[86] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
[87] Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, and Yingbo Zhou. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309, 2023.
[88] G Qin and Victor R Prybutok. Determinants of customer-perceived service quality in fast-food restaurants and their relationship to customer satisfaction and behavioral intentions. Quality Management Journal, 15(2):35–50, 2008.
[89] Hong Qin, Victor R Prybutok, and Qilan Zhao. Perceived service quality in fast-food restaurants: Empirical evidence from china. International Journal of Quality & Reliability Management, 27(4):424–437, 2010.
[90] Kisang Ryu, Heesup Han, and Tae-Hee Kim. The relationships among overall quick-casual restaurant image, perceived value, customer satisfaction, and behavioral intentions. International journal of hospitality management, 27(3):459–469, 2008.
[91] Puti Ara Zena and Aswin Dewanto Hadisumarto. The study of relationship among experiential marketing, service quality, customer satisfaction, and customer loyalty. Asean marketing journal, 4(1):4, 2012.
[92] Gerard Prendergast and Ho Wai Man. The influence of store image on store loyalty in hong kong’s quick service restaurant industry. Journal of Foodservice Business Research, 5(1):45–59, 2002.
[93] Ananthanarayanan Parasuraman, Valarie A Zeithaml, and Leonard L Berry. Servqual: A multiple-item scale for measuring consumer perc. Journal of retailing, 64(1):12, 1988.
[94] Pete Stevens, Bonnie Knutson, and Mark Patton. Dineserv: A tool for measuring service quality in restaurants. Cornell hotel and restaurant administration quarterly, 36(2):56–60, 1995.

Dynamic Sentiment Analysis with Local Large Language Models using Majority Voting: A Study on Factors Affecting Restaurant Evaluation