¹¹institutetext: Rutgers University, New Jersey, USA ¹¹email: {kalliopi.basioti, vladimir}@rutgers.edu ²²institutetext: Samsung AI Centre - Toronto, Toronto, Canada ²²email: {m.abdelsalam, a.fazly}@samsung.com³³institutetext: Solventum ³³email: [email protected]

CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation

Kalliopi Basioti^* 1122 Mohamed A. Abdelsalam 22 Federico Fancellu^† 33 Vladimir Pavlovic^† 11 Afsaneh Fazly 22

Abstract

^†^†footnotetext: *Work done during an internship at Samsung AI Centre - Toronto^†^†footnotetext: †Work done while at Samsung AI Centre - Toronto

Controllable Image Captioning (CIC) aims at generating natural language descriptions for an image, conditioned on information provided by end users, e.g., regions, entities or events of interest. However, available image–language datasets mainly contain captions that describe the entirety of an image, making them ineffective for training CIC models that can potentially attend to any subset of regions or relationships. To tackle this challenge, we propose a novel, fully automatic method to sample additional focused and visually grounded captions using a unified structured semantic representation built on top of the existing set of captions associated with an image. We leverage Abstract Meaning Representation (AMR), a cross-lingual graph-based semantic formalism, to encode all possible spatio-semantic relations between entities, beyond the typical spatial-relations-only focus of current methods. We use this Structured Semantic Augmentation (SSA) framework to augment existing image–caption datasets with the grounded controlled captions, increasing their spatial and semantic diversity and focal coverage. We then develop a new model, CIC-BART-SSA, specifically tailored for the CIC task, that sources its control signals from SSA-diversified datasets. We empirically show that, compared to SOTA CIC models, CIC-BART-SSA generates captions that are superior in diversity and text quality, are competitive in controllability, and, importantly, minimize the gap between broad and highly focused controlled captioning performance by efficiently generalizing to the challenging highly focused scenarios. Code is available at https://github.com/SamsungLabs/CIC-BART-SSA.

Keywords:

Controllable Image Captioning Vision Language Model Abstract Meaning Representation

Figure 1: Existing captioning datasets contain captions that describe the entirety of an image. This is reflected in the narrow distributions of the entities that appear in those captions and the caption lengths (the red-colored histograms). CIC aims to generate diverse descriptions by controllably re-focusing on different spatiosemantic aspects of an image, such as the semantically coherent subsets of image objects. Our proposed CIC-BART-SSA is designed to produce diverse, controlled captions ranging from brief and concise to detailed and comprehensive. Sentences 1-15 are example outputs of our approach where the highlighted text indicates the focus of a controllable caption. The histograms demonstrate that our approach generates high-quality descriptions for a wider range of scene focus (number of visual entities) and caption length compared to the original captions. Image is licensed under CC BY-SA 2.0.

1 Introduction

Image captioning refers to the task of providing an AI system with an input image, and asking the system to describe the visual content in natural language. This process requires the captioning system to understand what objects are present, in what context (e.g., event or scene), and how they relate. Recent deep learning approaches to this task [44, 55, 32, 15, 27, 54, 43, 41, 34, 62] surpass human performance in standard image captioning metrics. However, these models tend to generate general captions that describe the entirety of an image, and are often of limited diversity; see Original Captions in Fig. 1.

Controllable image captioning (CIC) overcomes these challenges by generating different descriptions for the same image in a user-controlled fashion. That is, a CIC model receives as input an image paired with a user-specified control signal (e.g., entities or regions of interest), and generates a caption conditioned on the control signal. CIC models are thus capable of generating a diverse set of captions by varying the control signal for the same image; see CIC generated captions $1$ - $15$ in Fig. 1.

In realistic applications, the easiest way for the user to control the generation of captions is to limit the focus of the desired captions by selecting different entities (objects) using their bounding boxes, as shown in Figs. 1 and 2. Most previous work focuses on such spatial control signals [24, 19, 52, 30, 61, 60]. To improve performance, more recent studies supplement this spatial signal with additional information on the desired length, style, or syntactic and semantic structure of the generated text [14, 13], increasing the richness and complexity of control signals. However, for the CIC approach to succeed, the CIC models need to be trained on equally rich datasets that incorporate, explicitly or implicitly, those control signals. Unfortunately, most image captioning datasets today, such as Flickr30k [40] or MS-COCO [19], lack this necessary diversity of controls and corresponding captions.

Our goal is to achieve SOTA performance in CIC without the need for new, increasingly rich, yet also costly, and impractical-to-collect datasets, where human workers would face the burden of having to provide multitudes of control signals and corresponding descriptive captions. To achieve this goal, we propose a novel Structured Semantic Augmentation (SSA) method, which automatically generates an augmented set of captions and the corresponding control signals with diverse spatiosemantic focus starting from only the core set of “original” uncontrolled captions. The method takes advantage of a detailed visual-linguistic semantic graph (illustrated in Fig. 2) constructed from the original captions and their image groundings. To build these semantic graphs, we use Abstract Meaning Representation (AMR) [6], a semantic formalism that can capture fine-grained linguistic relations beyond the exclusively spatial relationships present in the common scene graphs [25]. The availability of robust AMR parsers [5, 10] allows us to generate semantic graphs for individual captions automatically, which we then merge into a rich meta-AMR graph for the joint image–language pair. From this meta-graph, we sample diverse connected subgraphs that represent semantically coherent combinations of image-anchored entities, events, and their relations, which we then turn into controlled captions automatically via existing AMR-to-text models [10]. Fig. 2 depicts an example of our meta-graph inferred from the original uncontrolled captions associated with an image. Filled nodes in the meta-graph indicate image entity groundings. Five semantically coherent subgraphs (a)–(e) of variable complexity are then sampled from the meta-graph, which are subsequently used to generate novel captions, shown below each subgraph. These new captions augment the original caption set by providing both image focus, through node groundings, and increased semantic diversity induced by the sampled subgraphs.

Refer to caption — Figure 2: An example of our structured semantic augmentation approach. We start by using visually-grounded captions (1)-(5) to create a meta-vgAMR graph, which includes all available image information in one representation. We then sample sub-graphs from the meta-vgAMR to generate a new and diverse set of captions (such as sentences (a)-(e)). Our approach takes advantage of both linguistic and spatial diversity, with the latter creating descriptions for new combinations of visual entities. For instance, caption (a) focuses only on the ‘boat’, and captions (c) and (d) focus on the ‘dock’ and ‘house’, combinations that are not explored in the original captions. Image is licensed under CC BY-SA 2.0.

Building upon SSA, we introduce a new CIC model, CIC-BART, suitable for generating focused controlled captions. Alongside the regions of interest, CIC-BART also makes use of the length of the desired caption as a control signal proxy for the verbosity of the caption. CIC-BART can be trained on SSA-augmented versions of standard VL datasets such as MS-COCO or Flicker30k to accommodate the CIC task. Our experiments show that, compared to several SOTA models, the captions generated by our model have superior text quality and diversity, while being comparable in terms of faithfulness to control signals.

In summary, our contributions are:
1. We propose a novel data augmentation technique, SSA, that draws on a structured semantic formalism (AMR) to automatically generate focused captions suitable for training of CIC models. We empirically show that our SSA technique enables CIC models to generate captions with high controllability, diversity, and text quality.
2. We propose CIC-BART, a model designed for CIC, that does not require overly descriptive and complex control signals that SOTA models often require to achieve high performance. We show a superior overall performance, compared to SOTA, while relying on simple control signals (i.e., regions of interest and preferred caption length).
3. We present an extensive evaluation of our model, compared with existing SOTA. Specifically, we report results on different aspects of generated captions, including controllability (faithfulness to control signal), diversity, and text quality (linguistic well-formedness). To account for the trade-off among these metrics, we propose an overall performance score based on their harmonic mean. This metric helps us identify models that perform well in all these aspects.

2 Related Work

Controllable Image Captioning (CIC).

Various types of control have been used for CIC, including visual entities, a type of region-based control [24, 19, 52, 30, 61, 60], where generated captions should learn to focus on the regions of interest. Others draw on complex control signals where additional knowledge about the generated caption structure is provided. For example, some recent work provides the complete skeleton of the desired sentence in the form of a number of objects or attributes or object-relation-object templates [14, 13]. Additional control signals that CIC draws on include different caption styles, e.g., positive, negative, humorous, or romantic tone [52, 35, 59, 21, 22, 36, 50, 58], user personality [18, 45], or the length of the generated captions [20, 52, 23, 50, 53]. The use of complex control signals aims at improving the diversity of captions and the quality of the text in CIC models. However, it requires the users to provide a detailed description of the control signal, which is not realistic in practical settings where such models are to be deployed (e.g., a self-driving car or personal assistant). We instead draw on two simple control signals (regions of interest and desired caption length) and show that we can achieve competitive performance on CIC, while keeping the control signals simple and practical.

Recent SOTA models that draw on spatial control include the SCT model [19] that also uses the Faster R-CNN feature vectors and object tags (corresponding GloVe vectors [39]) of the entities of interest, as well as models that include skeleton-based control, namely ASG2Caption [14] and VSR [13]. ASG2Caption uses an abstract scene graph (ASG) to express the desired structure of a caption. ASG contains three types of unlabeled abstract nodes (object, attribute, relationship) that are grounded in the image by extracting features from the corresponding bounding boxes (for objects and attributes) or from the union of bounding box pairs (for a relationship node). ASG2Caption shows improved controllability (by conditioning on ASGs), and diversity (by automatically sampling diverse ASGs as control signals). The VSR model [13] draws on GloVe embeddings of Faster R-CNN object tags for visual entities (as in SCT). It also uses a skeleton control signal (like ASG2Caption), but one that includes more detailed information and richer semantics. Specifically, the VSR control signal follows the form of a fine-grained PropBank entry¹¹1https://propbank.github.io — i.e., specifying the exact verb(s) expressing action(s) depicted in the image, and their visually grounded arguments (e.g., subject, object, location, manner). Thus, VSR uses the most descriptive control signal among the SOTA models. Refer to Sec. 0.C.2 for an illustrative example of the control signal used for each method.

Compared to ASG2Caption and VSR, our control signal is kept minimal and only specifies the bounding boxes and desired caption lengths. To improve the diversity of captions, we draw on a structured semantic graph (AMR) that expresses the semantics of a sentence based on PropBank semantics. Notably, we do not use these rich graphs to express detailed and overly descriptive control signals (as in VSR), but we use these semantic structures to augment our training data with richer and diverse captions, which will result in the model learning to generate more diverse captions. Additionally, we include a length control signal to further increase diversity without needing to specify detailed information about the structure of the output (e.g., number of attributes per object, etc.). This way, we can generate a variety of captions for a fixed image sub-region by simply controlling the desired length of the output.

Abstract Meaning Representation (AMR).

AMR [7] is a rich semantic formalism for expressing the meaning of natural language sentences as a formal graph. AMR draws on PropBank, which is a rich lexical semantic resource encoding predicates expressing an action or state, as well as the number and nature of the participating entities (arguments and other semantic roles, such as location, manner, etc.). AMR is a widely researched semantic formalism for which highly accurate automatic Text-to-AMR and AMR-to-Text models are developed [5, 10]. We rely on these models to augment original image–caption datasets with newly generated captions (as explained in Sect. 4).

AMRs vs. Scene Graphs.

Recent studies [57, 1, 16, 17] have shown that AMRs better capture the semantic relations of an image as compared to the scene graphs [12]. Existing scene graph annotations mainly capture geometric or possessive relations, which account for more than 90% of the relations captured, whereas more than $1/3$ of the captured entities refer to clothing, object, or body parts information [57, 1]. This difference is crucial for high-quality image captioning, as we use higher-level semantic relations in our everyday language rather than geometric ones. For instance, during a soccer game, we would probably describe a goal save as ‘the player kicks the ball away from the goal’ or ‘the goalkeeper defends his team by saving a goal’ and not by using mainly geometric and possessive relations like ‘a person wearing a white shirt, standing with his right leg lifted, close to a ball which is above the ground’. In Sec. 0.D.7.2, we provide a detailed comparison of AMR and scene graph representations, particularly focusing on their applications in data augmentation.

3 Model

We propose CIC-BART, specifically designed to generate controlled image captions. Specifically, it can generate descriptions of particular areas within a scene with a desired level of detail. Our model, based on VL-BART [15], utilizes a transformer-based encoder-decoder architecture, as shown in Fig. 3. CIC-BART extends VL-BART encoder to the CIC task by modifying the encoder input to include: a) a global image embedding that provides the context of the full image to the model; b) the visual control signal, including the visual embedding of the regions that contain the entities of interest; c) the text control signal, containing length control (indicating the desired length range of the output caption) and an optional verb signal that indicates the action we want the generated caption to concentrate on.

The visual embeddings of the regions are position-aware embeddings from a Faster R-CNN model [42] trained for visual object and attribute classification [3] on Visual Genome. The global image feature vector is extracted as well from Faster R-CNN. For the length control signal, we add to our vocabulary $L$ tokens for the $L$ different caption length levels; for instance, level one represents sentences between one and nine words, and level two, ten to nineteen. These tokens describe our coarse levels, for a finer sentence size accuracy, we accompany the tokens with the desired number of words. This choice gives our model the capacity to generate diverse captions for a particular length level. Finally, the output of the decoder generates the desired, controlled image caption.

4 Structured Semantic Augmentation (SSA)

The goal of our SSA method is to augment existing image captioning datasets with new focused captions along with their control signals (i.e., regions corresponding to entities). We rely on datasets where visual entities in the captions are annotated with their corresponding regions (see Sect. 5 for details on the datasets). The SSA process consists of four main steps, as described below. For more details, refer to Appendix 0.B, which includes a step-by-step example of our SSA methodology.

Step 1: Image-level AMR graph generation.

Our objective in this stage is to enclose all the information available from the visually grounded captions into a single representation. To accomplish this, we create a visually grounded AMR graph (vgAMR) for each caption of an image and then merge them into a single image-level graph, the meta-vgAMR. To create the vgAMRs of an image, we first convert each of its $N$ captions to their AMR representation, using the Neural transition-based Text-to-AMR parser [5] which also aligns words in a caption with their respective nodes in the AMR graph. We utilize the alignment information of 1) caption words and AMR nodes (from Text-to-AMR parser) and 2) caption words to image bounding boxes (from existing dataset annotations) to visually ground the AMR nodes. After this step, we get the collection of nodes referring to visual entities, where each grounded meta-AMR node is linked with a non-empty set of bounding boxes. This extended representation, ‘AMR + visual grounded nodes’, is our vgAMR.

Our next step is to combine the N vgAMRs to form a single meta-vgAMR. To achieve this, we employ a pairwise strategy to merge the most similar vgAMRs first (we measure similarity with Smatch score [11]). We use the UPGMA hierarchical clustering algorithm [33, 37] to find the optimal merge ordering starting from the most similar graphs. UPGMA creates a hierarchy where the bottom level consists of the N individual vgAMRs. By merging all vgAMRs using the UPGMA ordering, we obtain a single structure called meta-vgAMR.

When merging two vgAMRs, the main challenge is identifying which nodes correspond to the same concepts, such as entities, attributes, actions, and relations. We use three node properties to accomplish this: a) visual grounding information, b) semantic similarity of node labels, and c) node neighborhood semantic similarity. We derive two node-merging criteria from there: 1) visually grounded entity nodes are merged if they point to the same image-bounding boxes. When 1) does not hold, we check the second criterion: 2) for the remaining non-grounded nodes, including amr-specific, predicates, adjectives, and adverbs, we use a combination of node label semantic similarity (cosine similarity of the labels using their GloVe embeddings) and neighborhood similarity. Neighborhood similarity examines the similarity of parents for adjectives/adverbs nodes and children for predicate nodes, along with the similarity of connecting edge roles. When two nodes satisfy criterion 1) or 2), we merge them into a single node. Moreover, if they have different labels, we maintain both names by keeping a list of synonyms to increase representation diversity. In Appendix 0.B (Fig. 8), we have included the flow diagram depicting the process of merging two nodes corresponding to the same concepts.

In the special case when the two vgAMRs describe two totally different concepts, and hence they have no common nodes, we add an amr-specific node called ‘multi-sentence’ as the root with the two independent vgAMRs as its children. The final graph, meta-vgAMR, includes all non-redundant²²2A node may have different names for the same bounding box in different meta-vgAMRs, such as ‘A male’ and ‘A person’. According to criterion 1), we merge the corresponding AMR nodes and keep both ’male’ and ’person’ in the names list to avoid redundancy. Therefore, criteria 1) and 2) ensure that multiple nodes don’t describe the same concept in the meta-vgAMR. elements of the original $N$ captions while preserving the visual grounding between the meta nodes and their respective image regions.

Remark: Meta-vgAMR efficiently compresses all available image information into a single structure. Following our approach, we can easily scale when new scene information becomes available by applying our pairwise merge procedure.

Step 2: Event-based graph sampling from image-level AMRs.

We start from the predicate nodes, which mainly correspond to verbs, to sample subgraphs in meta-vgAMR graphs. Predicate nodes are identified by their label and the edges connected to them. The label of a predicate node typically follows the format ‘predicate_name-xx,’ where ‘xx’ represents the different senses a word can have regarding the concept it is used for. Predicate nodes have outward ARGy edges, where ‘y’ can take values from 0 to 5, connecting them to their arguments. We sample subgraphs from these nodes by following the outgoing argument edges, which are labeled as ARGn in an AMR graph, each defining a particular semantic role (e.g., ARG0 points to the agent, ARG1 to the patient, etc.). Finally, we add one more subgraph containing the remaining children branches of other non-ARG optional predicate edges (e.g., ‘location’, ‘time’). We repeat this process until the leaves of the graphs are reached. During sampling, we randomly select one of the synonyms if a node is a list of synonym labels, as mentioned in the previous step. The output of this step is our more focused event-focused sub-graphs. In Fig. 2, we can see some instances of our event-based sampling (SSA samples), where the predicate nodes include z0/sit-01, z7/dock-01, z13/sit-01, and so on³³3Note that in Fig. 2, the node z17/green-03 is also categorized as a predicate. This may seem an error because we usually think of ‘green’ as an attribute node rather than a predicate. However, in AMRs, when ‘green’ is paired with its argument, in this case, z16/lawn, it encapsulates a predicate/verb that can be expressed in natural language as ‘the lawn is green.’. Although we cannot show all the sampled event-based subgraphs in the figure, we included five of them and used colored roots and edges for visualization purposes.

Step 3: New caption generation from sampled AMRs

We use the SPRING AMR-to-Text model [9] to generate new event-focused captions from the sampled vgAMR subgraphs. Because both vgAMR merging and sampling steps introduce noise, the output captions are not always of good quality. We automatically filter low-quality captions by using a linguistic well-formedness measure, GRUEN [63], which is a reference-free metric based on BERT contextual embeddings. In Sec. 0.D.7 we provide examples of original dataset captions and their SSA augmentations, along with their GRUEN score.

Step 4: Control signal generation.

The last step is to create the control signal for the generated captions. The spatial control signal for a specific caption is extracted from the corresponding sampled vgAMR, by pulling the bounding boxes of the visual entity linked AMR nodes.

4.1 Mixing Strategies of Original and SSA Data

To analyze the impact of our SSA data, we explore various mixing strategies with the original training set. Assume ${\mathcal{D}}$ represents the training control-caption pairs in the original dataset, containing $N_{\mathcal{D}}$ samples, and $SSA$ represents our SSA samples, containing $N_{SSA}$ instances. The augmented dataset ${\mathcal{D}}_{SSA}$ is defined by combining ${\mathcal{D}}$ and $SSA$ : ${\mathcal{D}}_{SSA}=\operatorname{sam}_{\mathcal{D}}(\tau_{\mathcal{D}},p_{% \mathcal{D}})\cup\operatorname{sam}_{SSA}(\tau_{SSA},p_{SSA})$ , where the functions $\operatorname{sam}_{\mathcal{D}}$ samples a subset of the original dataset, and $\operatorname{sam}_{SSA}$ a subset of our SSA data. Since we are interested in the effect of our SSA, we assume that $\operatorname{sam}_{\mathcal{D}}(\tau_{\mathcal{D}},p_{\mathcal{D}})=\mathcal{D}$ , with $\tau_{\mathcal{D}}=\text{`Random Sampling Strategy'}$ and $p_{\mathcal{D}}=100\%$ , meaning that all original data are included in the mixed dataset. Depending on the $\operatorname{sam}_{SSA}$ parameter $\tau_{SSA}$ we have the cases:

Random Sampling Strategy.

In this case, we randomly select a pre-specified number of examples from $SSA$ . The parameter $p_{SSA}$ expresses the percentage of SSA samples included in ${\mathcal{D}}_{SSA}$ . With boundary cases $p_{SSA}=100\%$ (all $N_{SSA}$ samples are included), and $p_{SSA}=0\%$ (no SSA data are added).

Uniform-Coverage Sampling Strategy.

To mitigate the original dataset’s bias (having mainly samples describing the entire image), we aim to create a new focus-unbiased dataset. By modeling the control signal focus as the image area percentage covered by the bounding boxes of the control signal, we split the original data into $B$ coverage bins. Then, we will randomly add in each bin SSA samples, aiming to create a new uniform, coverage-unbiased ${\mathcal{D}}_{SSA}$ dataset. Here, $p_{SSA}$ contains the range of each bin for the coverage histogram. For example, in the case where we choose ten uniform coverage bins, we have $p_{SSA}=\{[0\%,10\%),[10\%,20\%),\dotsc,[90\%,100\%]\}$ .

We present results from the Random Sampling Strategy for $p_{SSA}=0\%$ and $p_{SSA}=100\%$ in the main paper. Results from other scenarios can be found in Sec. 0.D.2.

5 Experimental Setup

5.1 Data

We use Flickr30k Entities (Flickr-Ent) [40] and MS-COCO Entities (COCO-Ent) [19] for training and evaluation. Flickr-Ent augments the original captions of Flickr30k [56] with manually-annotated region–phrase groundings. Flickr-Ent contains the original $31K$ images annotated with five captions each. COCO-Ent augments the original MS-COCO [29] ( $120K$ images each annotated with around five captions) with semi-automatically collected grounding annotations; see [19] for details on the annotation process. For both datasets, we follow previous work and use the training and test splits by Karpathy et al. [26]. We apply our SSA algorithm on the aforementioned datasets to create their augmented variations, COCO-Ent-SSA and Flickr-Ent-SSA, containing about $800K$ and $250K$ training captions, respectively, of which $33\%$ and $37\%$ , are generated by our SSA algorithm.

For all four training sets, we automatically generate image–control–caption triplets to train our model on. For spatial control, we extract from the grounded captions the bounding boxes of the entities of interest using the annotations from COCO-Ent and Flickr-Ent. Note that since these datasets do not contain the text control signal, we use each ground-truth caption as a proxy for a controlled caption, from which we first generate the coarse- and fine-length control levels and then extract their verbs using part-of-speech tagging for the optional action control.

5.2 Models and Evaluation Metrics

We compare two variations of our model (with and without SSA augmentations) with SOTA models as our baselines: Show Control & Tell (SCT) [19] that uses region-based control (bounding boxes of visual entities of interest); ASG2Caption (ASG) [14] that draws on visually grounded abstract scene graphs as control signal; and VSR [13] that uses overly descriptive control signals that express verb (s) and fine-grained verb-specific semantic roles of the desired captions; ComPro [53] that learns a mapping from the bounding boxes of the entities of interest and caption length level to GPT-2 Large prompts aiming to retrieve controlled captions; and the LaBERT length-control-only model [20].

We report the performance of our models and baselines using a comprehensive set of metrics that evaluate different aspects of caption controllability, diversity, and quality. We also propose and report an overall performance metric that summarizes these different aspects in a meaningful way. In particular, for diversity, we measure n-gram diversity D-1, D-2 [4] and self-CIDEr (sC) [51] metrics. For content controllability, we have developed an extended version of the IoU [19] and further analyzed its performance by introducing the Hallucinating Nouns (Hal) metric. Both metrics are thoroughly discussed in Sec. 0.C.1. For length controllability, we measure mean absolute error (L) and length precision (LP) [20]. We assess the generated text quality using the GRUEN (G) [63] metric. We determine the overall performance using the harmonic mean of IoU, G, and sC metrics. A higher score indicates better performance. The harmonic mean ( $H$ ) helps us determine the model with the best overall performance since it prioritizes models that perform well across all metrics while penalizing those with poor performance, even in one metric. Sec. 0.C.1 provides details for each evaluation metric. Further, in Sec. 0.D.5 we include comparisons of the CIC models on standard captioning metrics (like CIDEr [49] and Spice [2]). Note that standard captioning metrics are not sufficient for evaluating CIC, as they compare a generated (controlled) caption with a reference ground-truth caption, ignoring the desired effect of the control signal. We report them for completeness, but we believe that our well-formedness metric (GRUEN) is more suited for evaluating the quality of the controlled captions.

Table 1: Performance of CIC models based on content controllability (IoU), text quality (G), and diversity (sC, D-1, D-2), and their harmonic mean (H). For our models, we also report length controllability (L); baseline models (SCT, ASG, VSR) do not include this type of control. All models are evaluated on the original Flickr-Ent and COCO-Ent test sets. ^∗ASG-type dataset is not released for Flickr-Ent; therefore, we could not reproduce the ASG results.

Model	$H\uparrow$	IoU $\uparrow$	G $\uparrow$	sC $\uparrow$	D-1 $\uparrow$	D-2 $\uparrow$	L $\downarrow$	$H\uparrow$	IoU $\uparrow$	G $\uparrow$	sC $\uparrow$	D-1 $\uparrow$	D-2 $\uparrow$	L $\downarrow$
	COCO-Ent							Flickr-Ent
SCT[19]	55.8	67.3	64.4	42.8	27.0	35.5	-	54.6	50.7	79.8	44.0	29.3	36.5	-
ASG*[14]	74.2	72.6	72.0	78.3	37.8	56.6	-	-	-	-	-	-	-	-
VSR[13]	56.2	77.6	39.0	67.4	30.0	42.2	-	62.5	60.2	54.0	77.9	33.3	49.3	-
CIC-BART	75.9	76.2	73.0	78.7	38.0	56.2	.49	69.8	54.0	85.0	78.6	43.6	58.2	1.24
CIC-BART-SSA	78.3	77.2	74.8	82.5	44.6	63.2	.11	71.3	55.0	86.0	81.7	47.0	62.6	1.05

Table 2: Performance of CIC models based on content controllability (IoU), text quality (G), and diversity (sC, D-1, D-2), and their harmonic mean (H). All models are evaluated on the Flickr-Ent and COCO-Ent SSA-only test set samples.

Model	$H\uparrow$	IoU $\uparrow$	G $\uparrow$	sC $\uparrow$	D-1 $\uparrow$	D-2 $\uparrow$	$H\uparrow$	IoU $\uparrow$	G $\uparrow$	sC $\uparrow$	D-1 $\uparrow$	D-2 $\uparrow$
	COCO-Ent (SSA only)						Flickr-Ent (SSA only)
SCT[19]	51.7	62.1	64.8	37.8	23.7	31.0	43.9	29.9	77.3	45.7	31.0	36.7
CIC-BART	69.2	61.4	73.9	74.0	44.2	57.0	68.5	53.0	80.5	79.8	52.9	62.9
CIC-BART-SSA	75.6	65.2	80.7	83.7	53.8	67.8	72.0	55.6	82.9	86.1	56.5	69.3

6 Results

6.1 Overall Performance

We first compare the overall performance of our models with the SCT, ASG, and VSR baselines with respect to controllable captioning metrics. We do not include ComPro in this comparison due to the unavailability of the codebase. We also exclude LaBERT since this model solely focuses on length controllability.

Tab. 1 presents results for content and length controllability (IoU and L), text quality (G), and diversity (sC, D-1, D-2), as well as the harmonic mean (H). For both datasets, CIC-BART-SSA has the best performance in all metrics, except IoU, where it is the second best. Specifically, CIC-BART-SSA is superior to all other models with respect to diversity (sC, D-1, D-2) and text quality (G), but comparable to VSR in terms of content controllability (IoU). The length controllability (L) scores show that our SSA augmentation helps the model learn to generate high-quality output at the desirable length (compare CIC-BART and CIC-BART-SSA). This is due to the increased diversity in caption length provided by our SSA augmentations.

Importantly, we can see that model performance can vary depending on the metric. E.g., whereas VSR has the highest IoU, it falls behind in text quality and diversity. In our qualitative analysis, we observe the poor quality of the captions generated by VSR. The best-performing model should be identified based on the $H$ score that summarizes content controllability, text quality, and diversity into a single score. Based on this score, CIC-BART-SSA is better than SCT and VSR baselines by a large margin, and notably better than ASG. Nevertheless, ASG requires complex control signals in the form of scene graphs, in contrast to the simple control signal requirements of CIC-BART.

Next, we conduct a further evaluation of the CIC performance on our SSA samples from the test set images of COCO-Ent and Flickr-Ent. We present the results in Tab. 2 for our models CIC-BART, CIC-BART-SSA, and SCT. ASG and VSR are excluded since they need complex control signals (grounded abstract scene graphs for ASG and grounded verb semantic roles for VSR), which are only available for COCO-Ent and Flickr-Ent. We observe a significant improvement in overall performance for our CIC-BART-SSA model. We notice that the models (SCT and CIC-BART) trained on the original datasets, which describe the entire image, had difficulties generalizing to cases where they had to focus on a specific sub-region of an image. However, our model CIC-BART-SSA was able to generate focused and diverse descriptions of the challenging, highly focused examples present in our SSA data.

Table 3: Length Precision (LP) for CIC models on COCO-Ent and Flickr-Ent original test sets.

Model	LP $\uparrow$	LP $\uparrow$
	COCO-Ent	Flickr-Ent
ComPro[53]	94.7	81.4
LaBERT[20]	99.7	98.4
CIC-BART	99.9	88.0
CIC-BART-SSA	99.9	91.3

We compare the length precision of our model with the baselines utilizing length control in Table 3. LaBERT uses only length-control signals without spatial control, while our model employs both spatial and length-control signals to generate focused captions. This makes the LaBERT task much easier since it only focuses on generating specific length descriptions of an image. On the other hand, our model focuses on generating captions that describe only a specific sub-region of the scene while maintaining a desired description length level. Although our task is more challenging than LaBERT, we achieve competitive length precision performance. Lastly, we want to emphasize that the improvement in length controllability (L) and length precision (LP) from CIC-BART to CIC-BART-SSA stems from the increased length diversity found in our SSA augmentations, which enriches the original COCO-Ent and Flickr-Ent datasets. In Sec. 0.D.1, we provide an analysis of the caption length statistics in the original datasets and our SSA-derived captions.

6.2 Effect of SSA on Content Controllability

To analyze the impact of our SSA augmentations, we measure the content controllability (IoU) performance of CIC-BART at different levels of focus of the control signals and report it in Fig. 4. We use coverage, defined as the area of the image enclosed by the bounding boxes of the entities of interest in the control signal, to quantify that focus. For example, highly focused control signals cover a small area, yielding low coverage, while broader signals cover a larger area and have high coverage. We ‘break down’ the IoU performance into 10 coverage bands and report the average IoU over control signals in those bands. In addition, the ‘Samples’ curve shows the distribution of test captions over the same bands. The results in Fig. 4 indicate that by training with SSA (blue bars), the spatial controllability improves significantly in the low-coverage regime, where the control signals are highly focused. Interestingly, these are also the most underrepresented (data deprived) parts of the original dataset Flickr-Ent. Therefore, SSA which enriches the original datasets with highly focused examples (refer to Sec. 0.D.1 for % Samples per coverage bands for the training sets), is effective in improving generalization performance in CIC.

6.3 Qualitative Analysis

In Fig. 5, we present qualitative examples from the original test sets, and in Fig. 6 examples from our SSA (only) test set control signals. In the two figures, each highlighted word found in the generated controlled captions corresponds to the control entity of the same color. This shows the match between the captions produced and the control signal. We also strike through the parts where the model hallucinates or generates redundant references to the entities of interest.

Our models have been observed to outperform the previous state-of-the-art models by substantially enhancing the quality of the generated controlled captions. This behavior was expected from our quantitative analysis, which showed that our models have significantly higher text quality (G). More importantly, our CIC-BART-SSA model is capable of generating captions that are faithful to the control signal and better understand the relationships that connect the entities of interest. We include additional qualitative samples in Sec. 0.D.6.

7 Conclusions

We address two main challenges faced by the controllable image captioning (CIC) models. First, standard image–caption datasets lack the controllability and diversity needed for proper training and evaluation of CIC. Second, most recent SOTA models require complex and overly descriptive control signals as input (including, e.g., the main action/verb to appear in the generated caption). To address the first challenge, we propose a novel technique that draws on a structured semantic augmentation (SSA) formalism to generate focused captions and the corresponding control signals for images. For the second challenge, we propose a transformer-based vision-language model attuned to the CIC task. We show that this model performs competitively with SOTA models without requiring complex and explicit control signals. Importantly, when combined with our SSA approach, our model generates highly diverse captions and significantly reduces the content controllability performance gap between the different levels of focus of the generated controlled captions. Finally, when provided with the commonly used verb guidance of other SOTA approaches, our model shows a substantial improvement in performance.

References

[1] Abdelsalam, M.A., Shi, Z., Fancellu, F., Basioti, K., Bhatt, D., Pavlovic, V., Fazly, A.: Visual semantic parsing: From images to abstract meaning representation. In: Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL). pp. 282–300 (2022)
[2] Anderson, P., Fernando, B., Johnson, M., Gould, S.: Spice: Semantic propositional image caption evaluation. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. pp. 382–398. Springer (2016)
[3] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
[4] Aneja, J., Agrawal, H., Batra, D., Schwing, A.: Sequential latent spaces for modeling the intention during diverse image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4261–4270 (2019)
[5] Astudillo, R.F., Ballesteros, M., Naseem, T., Blodgett, A., Florian, R.: Transition-based parsing with stack-transformers. arXiv preprint arXiv:2010.10669 (2020)
[6] Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., Knight, K., Koehn, P., Palmer, M., Schneider, N.: Abstract meaning representation (amr) 1.0 specification. In: Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle: ACL. pp. 1533–1544 (2012)
[7] Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., Knight, K., Koehn, P., Palmer, M., Schneider, N.: Abstract meaning representation for sembanking. In: Proceedings of the 7th linguistic annotation workshop and interoperability with discourse. pp. 178–186 (2013)
[8] Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. pp. 65–72 (2005)
[9] Bevilacqua, M., Blloshmi, R., Navigli, R.: One spring to rule them both: Symmetric amr semantic parsing and generation without a complex pipeline. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 12564–12573 (2021)
[10] Blloshmi, R., Bevilacqua, M., Fabiano, E., Caruso, V., Navigli, R.: Spring goes online: end-to-end AMR parsing and generation. In: Proceedings of the 2021 conference on empirical methods in natural language processing: system demonstrations. pp. 134–142 (2021)
[11] Cai, S., Knight, K.: Smatch: an evaluation metric for semantic feature structures. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 748–752 (2013)
[12] Chang, X., Ren, P., Xu, P., Li, Z., Chen, X., Hauptmann, A.: A comprehensive survey of scene graphs: Generation and application. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(1), 1–26 (2021)
[13] Chen, L., Jiang, Z., Xiao, J., Liu, W.: Human-like controllable image captioning with verb-specific semantic roles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16846–16856 (2021)
[14] Chen, S., Jin, Q., Wang, P., Wu, Q.: Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9962–9971 (2020)
[15] Cho, J., Lei, J., Tan, H., Bansal, M.: Unifying vision-and-language tasks via text generation. In: International Conference on Machine Learning. pp. 1931–1942. PMLR (2021)
[16] Choi, W.S., Heo, Y.J., Punithan, D., Zhang, B.T.: Scene graph parsing via abstract meaning representation in pre-trained language models. In: NAACL 2022 Workshop on Deep Learning on Graphs for Natural Language Processing (2022)
[17] Choi, W.S., Heo, Y.J., Zhang, B.T.: Sgram: Improving scene graph parsing via abstract meaning representation. arXiv preprint arXiv:2210.08675 (2022)
[18] Chunseong Park, C., Kim, B., Kim, G.: Attend to you: Personalized image captioning with context sequence memory networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 895–903 (2017)
[19] Cornia, M., Baraldi, L., Cucchiara, R.: Show, control and tell: A framework for generating controllable and grounded captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8307–8316 (2019)
[20] Deng, C., Ding, N., Tan, M., Wu, Q.: Length-controllable image captioning. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16. pp. 712–729. Springer (2020)
[21] Gan, C., Gan, Z., He, X., Gao, J., Deng, L.: Stylenet: Generating attractive visual captions with styles. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3137–3146 (2017)
[22] Guo, L., Liu, J., Yao, P., Li, J., Lu, H.: Mscap: Multi-style image captioning with unpaired stylized text. 2019 ieee. In: CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4199–4208 (2019)
[23] Hirsch, E., Tal, A.: Clid: Controlled-length image descriptions with limited data. arXiv preprint arXiv:2211.14835 (2022)
[24] Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4565–4574 (2016)
[25] Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D., Bernstein, M., Fei-Fei, L.: Image retrieval using scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3668–3678 (2015)
[26] Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3128–3137 (2015)
[27] Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. pp. 121–137. Springer (2020)
[28] Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. pp. 74–81 (2004)
[29] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014)
[30] Lindh, A., Ross, R.J., Kelleher, J.D.: Language-driven region pointer advancement for controllable image captioning. arXiv preprint arXiv:2011.14901 (2020)
[31] Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., Han, J.: On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265 (2019)
[32] Lu, J., Yang, J., Batra, D., Parikh, D.: Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7219–7228 (2018)
[33] Lukasová, A.: Hierarchical agglomerative clustering procedure. Pattern Recognition 11(5-6), 365–381 (1979)
[34] Luo, J., Li, Y., Pan, Y., Yao, T., Feng, J., Chao, H., Mei, T.: Semantic-conditional diffusion networks for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23359–23368 (2023)
[35] Mathews, A., Xie, L., He, X.: Senticap: Generating image descriptions with sentiments. In: Proceedings of the AAAI conference on artificial intelligence. vol. 30 (2016)
[36] Mathews, A., Xie, L., He, X.: Semstyle: Learning to generate stylised image captions using unaligned text. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8591–8600 (2018)
[37] Müllner, D.: Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378 (2011)
[38] Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)
[39] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014)
[40] Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision. pp. 2641–2649 (2015)
[41] Ramos, R., Martins, B., Elliott, D., Kementchedjhieva, Y.: Smallcap: lightweight image captioning prompted with retrieval augmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2840–2849 (2023)
[42] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)
[43] Ren, Y., Mao, Z., Fang, S., Lu, Y., He, T., Du, H., Zhang, Y., Ouyang, W.: Crossing the gap: Domain generalization for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2871–2880 (2023)
[44] Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7008–7024 (2017)
[45] Shuster, K., Humeau, S., Hu, H., Bordes, A., Weston, J.: Engaging image captioning via personality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12516–12526 (2019)
[46] Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 human language technology conference of the north american chapter of the association for computational linguistics. pp. 252–259 (2003)
[47] Toutanvoa, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora. pp. 63–70 (2000)
[48] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
[49] Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4566–4575 (2015)
[50] Wang, N., Xie, J., Wu, J., Jia, M., Li, L.: Controllable image captioning via prompting. In: AAAI (2023)
[51] Wang, Q., Chan, A.B.: Describing like humans: on diversity in image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4195–4203 (2019)
[52] Wang, T., Zhang, J., Fei, J., Ge, Y., Zheng, H., Tang, Y., Li, Z., Gao, M., Zhao, S., Shan, Y., et al.: Caption anything: Interactive image description with diverse multimodal controls. arXiv preprint arXiv:2305.02677 (2023)
[53] Wang, Z., Xiao, J., Chen, L., Gao, F., Shao, J., Chen, L.: Learning combinatorial prompts for universal controllable image captioning. arXiv preprint arXiv:2303.06338 (2023)
[54] Xia, Q., Huang, H., Duan, N., Zhang, D., Ji, L., Sui, Z., Cui, E., Bharti, T., Zhou, M.: Xgpt: Cross-modal generative pre-training for image captioning. In: Natural Language Processing and Chinese Computing: 10th CCF International Conference, NLPCC 2021, Qingdao, China, October 13–17, 2021, Proceedings, Part I 10. pp. 786–797. Springer (2021)
[55] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. pp. 2048–2057. PMLR (2015)
[56] Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, 67–78 (2014)
[57] Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: Scene graph parsing with global context. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5831–5840 (2018)
[58] Zeng, Z., Zhang, H., Lu, R., Wang, D., Chen, B., Wang, Z.: Conzic: Controllable zero-shot image captioning by sampling-based polishing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23465–23476 (2023)
[59] Zhao, W., Wu, X., Zhang, X.: Memcap: Memorizing style knowledge for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 12984–12992 (2020)
[60] Zhao, Y., Wei, J., Lin, Z., Sun, Y., Zhang, M., Zhang, M.: Visual spatial description: Controlled spatial-oriented image-to-text generation. arXiv preprint arXiv:2210.11109 (2022)
[61] Zheng, Y., Li, Y., Wang, S.: Intention oriented image captions with guiding objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8395–8404 (2019)
[62] Zhong, Y., Wang, L., Chen, J., Yu, D., Li, Y.: Comprehensive image captioning via scene graph decomposition. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. pp. 211–229. Springer (2020)
[63] Zhu, W., Bhat, S.: Gruen for evaluating linguistic quality of generated text. arXiv preprint arXiv:2010.02498 (2020)

Appendix 0.A Overview

In the appendix, we provide more details for our SSA methodology in Appendix 0.B; in Appendix 0.C, we provide additional details for our experimental set-up; and finally, in Appendix 0.D we provide extended qualitative and quantitative results of our performed experiments and ablations. Specifically, we focus on:

•

Dataset statistics before and after SSA augmentation in Sec. 0.D.1.
•

Impact of mixing of original and SSA captions in Sec. 0.D.2.
•

Effects of SSA on content controllability in Sec. 0.D.3.
•

SSA-induced diversity in Sec. 0.D.4.
•

Standard Captioning Performance of CIC models Sec. 0.D.5
•

Qualitative comparisons in Sec. 0.D.6.
•

Comparison of SSA and alternative augmentation strategies with attention on LLM-based paraphrasing and Scene Graph-based methods in Sec. 0.D.7.

Algorithm 1 meta-vgAMR Graph Construction

1:Input: An image

I

with

N

human-generated, visually-grounded captions; We denote the visually grounded entities of each caption as

\{G^{en}_{i}\}_{i=1}^{N}

;

2:Output: The meta-vgAMR graph,

\mathcal{A}^{vg}_{Meta}

, of the

N

captions;

3:Initialize: Generate the individual AMR graphs

\{\mathcal{A}_{i}\}_{i=1}^{N}

for each image caption using a pre-trained Text-to-AMR semantic parser with (AMR node–caption word) alignment; Construct the vgAMRs,

\mathcal{A}^{vg}=\{\mathcal{A}^{vg}_{i}\}_{i=1}^{N}

, using the visual grounding annotations and (AMR node–caption word) alignment;

4:Compute

D=1-

SmatchScore

(\mathcal{A}^{vg})

;

\triangleright

A symmetric

N\times N

AMR graph distance matrix between all

\mathcal{A}^{vg}_{i},\mathcal{A}^{vg}_{j}

pairs.

5:bottomUpHCs = UPGMA

(D)

;

\triangleright

Bottom-up hierarchical clusters, each cluster contains two vgAMR graphs.

6:for

(\mathcal{A}^{vg}_{i},\mathcal{A}^{vg}_{j})

in bottomUpHCs do

\triangleright

Following the bottom-up hierarchy, pair-wise merge the vgAMRs of each cluster.

\mathcal{A}^{vg}_{i}=(\mathcal{N}_{i},\mathcal{E}_{i})

;

\mathcal{A}^{vg}_{j}=(\mathcal{N}_{j},\mathcal{E}_{j})

\triangleright

The nodes and edges of each vgAMR graph.

8: Initialize

\mathcal{A}^{vg}_{m}=(\mathcal{N}_{m},\mathcal{E}_{m})

as a null graph;

\mathcal{N}_{\text{common}}

= getCommonNodes

(\mathcal{A}^{vg}_{i},\mathcal{A}^{vg}_{j})

;

\triangleright

Returns the common nodes between the two vgAMR graphs.

10: if

\mathcal{N}_{\text{common}}

is empty then

\triangleright

The two vgAMRs have no overlapping information.

11:

\mathcal{N}_{m}=\mathcal{N}_{i}\cup\mathcal{N}_{j}\cup\mathcal{N}_{\text{multi% -sentence}}

;

\triangleright

Introduce a new, AMR-specific “multi-sentence" node, to be the root of the merged graph. This node will connect the two disjoint vgAMR graphs.

12: else

13:

\mathcal{N}^{\prime}_{i}=\mathcal{N}_{i}\setminus\mathcal{N}_{\text{common}}

;

\mathcal{N}^{\prime}_{j}=\mathcal{N}_{j}\setminus\mathcal{N}_{\text{common}}

14:

\mathcal{N}_{m}=\mathcal{N}_{\text{common}}\cup\mathcal{N}^{\prime}_{i}\cup% \mathcal{N}^{\prime}_{j}

15:

\mathcal{E}_{m}=

getConnectingEdges

(\mathcal{A}^{vg}_{i},\mathcal{A}^{vg}_{j},\mathcal{N}_{m})

16:

\mathcal{A}^{vg}

.remove

(\mathcal{A}^{vg}_{i},\mathcal{A}^{vg}_{j})

17:

\mathcal{A}^{vg}

.add

(\mathcal{A}^{vg}_{m})

18:

\mathcal{A}^{vg}_{Meta}=\mathcal{A}^{vg}

\triangleright

Alle

N

vgAMRs are merged into one representation

19:return

\mathcal{A}^{vg}_{Meta}

Appendix 0.B Structured Semantic Augmentation (SSA)

In this section, we will provide additional information on the SSA augmentation strategy we introduced in our main paper. We summarize the steps in constructing the meta-vg Graph (as described in Step 1: Image-level AMR graph generation in our main paper) in Algorithm 1. We include a detailed example of our SSA methodology in Fig. 7. We present the flow diagram that explains the process of determining if two nodes from different vgAMRs refer to the same concept and, therefore, should be merged in Fig. 8.

SSA Algorithm.

To construct the hierarchical clusters, we use the UPGMA algorithm, which considers each individual vgAMR as a separate cluster at Level 0. Two clusters are merged at each level based on their distance, starting with the most similar graphs. To measure similarity, we use the Smatch Score between two vgAMR graphs. Since the Smatch score is a metric from 0 to 1, we use 1- Smatch score as the distance metric for the UPGMA algorithm. For this example, the AMR graphs of captions (2) and (3) are the most similar, so they are merged first to create their joint vgAMR graph at Level 1. Every graph from levels 1 to 4 results from the 6-17 step of our Algorithm 1 where we merge two graphs from lower layers according to the hierarchical clusters computed by UPGMA. The final layer (4) graph is our meta-vgAMR graph, which contains all information from the original vgAMRs and, thus, from the available original captions. By applying our event-focused sampling approach, we can generate novel, focused, visually grounded descriptions from this new structure. Some examples can be seen at the bottom of Fig. 7, along with the resulting captions generated by pre-trained AMR-To-Text parsers.

Finding same-concept nodes in two vgAMR graphs.

In the flow diagram labeled as Fig. 8, we can observe that when we merge two vgAMR graphs, vgAMR-A and vgAMR-B, we need to identify the common concept nodes between the two representations and combine them. This merging process serves two purposes: a) it allows for a more efficient and compressed representation by reducing redundancies and eliminating multiple nodes for the same concept, and b) it consolidates all available information about a particular concept found in different captions. For example, in Fig. 14 c) for the top player, one caption may describe her clothing, another her physical characteristics, and yet another her actions (for instance, practicing martial arts). Despite the differences, all these captions have a common concept: the person in the picture. Instead of having three separate nodes with partial information, we aim to create a single node (person) that consolidates all available information about the person in the image, making it easier to explore all connected nodes and access the complete information.

The process for identifying common nodes involves the steps outlined in Fig. 8. We start by checking if the two nodes are AMR-specific nodes of type AND. If these nodes are found at the root of the graph, it indicates that the corresponding sentence follows the format ‘FACT-1 AND FACT-2’. In this case, we can merge them as they represent aggregated facts about the image. If they are not root nodes, we need to be more cautious and ensure that they originate from the same concept. For this reason, we assess the similarity of their neighboring nodes. If they link to the same nodes, we merge them and combine the provided facts.

As shown in Fig. 8, if both nodes are visually grounded and refer to the same visual entity (i.e. if they have the same bounding boxes), we are hesitant to merge the nodes without first verifying that their names are synonyms ⁴⁴4We determine if two nodes are synonyms by comparing the cosine similarity of their GloVe embeddings. If the cosine similarity is above a certain threshold, we consider them synonyms.. This additional condition is helpful in cases where a) the original dataset visually grounds phrases instead of nouns, and b) there is noise from the Text-to-AMR parser. This check ensures that entity attributes (such as ‘young’ and ‘tall’) which may be visually grounded, are not mistakenly merged with noun nodes. Finally, if the two nodes are semantically distant or do not refer to the same visual entity, or if one of them is grounded and the other is not, we conclude that the two nodes cannot be merged.

When we don’t have visual cues to help us identify similar concept nodes, we rely on the names of the nodes and their surrounding information to make decisions. If two nodes are synonyms, we look at how similar their neighbors are (e.g., if the two nodes are nouns or adjectives, do they share the same parent?). If they do, we merge them as similar concept nodes.

In our final step, we have an additional procedure for predicate nodes. In our experiments, we observed that the GloVe embeddings of predicate/verb words tend to be more distant. Therefore, in the last step, if the two nodes are predicates, and their child nodes (ARG0, ARG1, and so on) are the same, and the similarity of their names is above a certain threshold (which is smaller than the thresholds used in the previous steps), then we merge the two predicate nodes. This concludes our node merge process.

Appendix 0.C Experimental Setup

0.C.1 Evaluation Metrics

Content Controllability: IoU.

To measure content controllability we design an extended version of the IoU metric of [19] that calculates the degree-of-match (faithfulness) between a control signal and the corresponding generated caption. For our control signal, we use the set of nouns $\mathcal{E}$ that represent the entities of interest, which are the names of the visual objects in the control. To extract nouns from the predicted sentences, we use the Stanford part-of-speech tagger [47, 46]. We then find the semantic intersection of the two sets using Hungarian Matching, as in [19]. Finally, we calculate the semantic intersection over union of the control nouns and the nouns extracted from the controllable caption, which gives us our content controllability IoU.

In particular, we measure content controllability using the IoU (the overlap between two sets) of the set of nouns in the control signal and the set of nouns in the generated sentence.

•

For the set of control signal nouns $\mathcal{E}$ in COCO-Ent test set, we use the existing annotations. For each caption, the head noun of a noun chunk is provided. We use the set of head nouns as our $\mathcal{E}$ .
•

For Flickr-Ent this information is not available. We use the object labels from Faster R-CNN to get the control signal nouns.

Our IoU metric is based on the corresponding score in [19]. We modify the following parts:

•

Ground truth nouns: Instead of using the ground truth captions as a proxy, we directly extract control signal nouns from the control signal itself, as described in the previous bullet points.
•

Generated sentence nouns: For the generated controlled sentence, instead of looking if each word is in a dictionary of nouns prepared by [19], we use part-of-speech tagging [47, 46] to extract the nouns of the sentence. We use this approach because the provided dictionary, although it contained many nouns, was not a complete list, so in many cases, during evaluation, nouns were discarded because there was no entry for them in the dictionary, which added noise to the original metric.

Our next steps are as described in [19], that is, the Hungarian matching of the two sets of nouns using the cosine similarity of the corresponding GLoVE embeddings for each noun word. The final IoU is the sum of cosine similarities for the aligned nouns.

The advantage of our IoU score from the one proposed in [19] is that it directly compares the control signal with the generated sentence; instead of the dataset ground truth sentences, which are just a proxy of entities in the control signal. This helps reduce the metric noise, since our IoU are not affected from annotation errors (for example, ground truth captions where not all entities are annotated/grounded to a bounding box, which will lead to a noisy proxy of the control signal) or from missing entries in the noun dictionary used in [19].

Content Controllability: Hallucinations.

We propose the Hallucinating Nouns (Hal) content controllability metric to help us to determine the number of hallucinations present in the generated captions. These hallucinations refer to nouns or visual entities that are not part of the control signal. They could be visual objects present in the image but not in focus of the control signal, or visual objects that are not present in the image at all. To measure this, we propose the ‘Hal-lucinating Nouns’ metric, which can be computed using the following equation:

\displaystyle\mathrm{Hal}

\displaystyle=\frac{1}{|\mathcal{N}|}\left(|\mathcal{N}|-\text{IoU}(\mathcal{N% },\mathcal{E})\right).

(1)

where $\mathcal{N}$ is the set of nouns extracted from the generated controlled caption and $\mathcal{E}$ is the set of nouns (visual entities) in the control signal.

Diversity.

To measure diversity, we compute n-gram diversity, D- $n$ for $n=1,\,2$ [4], as well as self-CIDEr-based diversity (sC) [51]. D- $n$ measures the ratio of distinct $n$ -grams to the total number of words generated per set of diverse captions. sC computes the diversity of a set of captions by using their CIDEr score [49], a metric that measures sentence similarity by giving more weight to the matching of novel words. For a fair comparison of the different CIC models, we measure diversity for the five generated captions for each test image (in COCO-Ent and Flickr-Ent), and report their average. Note that not all images in COCO-Ent and Flickr-Ent have five caption–control signal pairs, especially for COCO-Ent that is automatically annotated. We only considered the ones with five available pairs for diversity evaluation, including $985$ images for Flickr-Ent and $112$ images for COCO-Ent.

Best-5 Diversity.

For completeness, we compute the best-5 diversity, proposed in [14]. Specifically, we generate $M=10$ randomly generated control signals for a given image. From the $M$ captions, we form all possible sets of $5$ captions ( $M$ choose $5$ ) and measure the ratio of $n$ -grams to the total number of words for each set. We report the average of the best Div- $n$ scores for all images in the test set.

Length Controllability.

For length controllability (L), we measure the Mean Absolute Error (MAE) between the fine length control (number of words) and the size of the resulting $M=10$ controlled captions, which are generated from $M$ randomly created control signals. We also calculate the length precision (LP) [20] by determining the percentage of generated captions that match the desired coarse length level.

Text Quality.

We assess text quality of generated captions using GRUEN (G) [63], a reference-free metric based on BERT contextual embeddings that measure the syntactic and semantic well-formedness of a text segment.

Overall Performance using Harmonic Means.

Finally, we measure the overall performance of each model based on its ability to balance content controllability, diversity, and text quality. To calculate this, we use the harmonic mean of IoU, G, and sC. All of these metrics range between 0 and 1, with a higher value indicating better performance. The harmonic mean ( $H$ ) helps us determine the model with the best overall performance. It prioritizes models that perform well across all metrics while penalizing those with poor performance, even in one metric.

Standard Captioning Metrics.

Following prior work, we also report performance with respect to standard captioning metrics, namely Bleu-4 (B4) [38], Meteor (M) [8], Rouge (R) [28], CIDEr (C) [49] and Spice (S) [2]. Specifically, Bleu measures the n-gram similarity of the two sentences, but without examining their synonymity, something that is addressed by Meteor. Rouge estimates the recall of their largest common subphrase. CIDEr score gives more weight to the matching of novel words and finally, Spice computes the semantic similarity by comparing the scene graph representations of the two sentences. For all these measures, higher is better. Note that these metrics are not sufficient for evaluating CIC, as they compare a generated (controlled) caption with a reference ground-truth caption, ignoring the desired effect of the control signal. We report them for completeness, but we believe that our well-formedness metric (GRUEN) is more suited for evaluating the quality of the controlled captions.

0.C.2 CIC-BART-SSA and Baselines Setup

SSA Parameters.

For our AMR-to-Text generated sentence filtering, we set the GRUEN (G) threshold to 0.7.

Model Parameters.

We initialize CIC-BART encoder and decoder from the pre-trained weights of VL-BART [15] to benefit from transfer learning. We further train our model on data that contains image–control–caption triplets, where control consists of the above-mentioned signals. We incorporate five different length levels to control the length of our output. Each level has a specific range, with level A ranging from one to nine words, level B spanning ten to nineteen words, level C covering twenty to twenty-nine words, level D consisting of thirty to thirty-nine words, and level E including sentences with forty or more words. In our CIC-BART vocabulary, we have added five tokens to represent these five caption length levels. For optimizing the cross-entropy loss, we utilize the RAdam optimizer [31] with a learning rate of $5\cdot 10^{-5}$ and a batch size of 80. We train our models for 20 epochs and select our trained model based on the best content controllability IoU and CIDEr scores.

Baseline Models.

We conducted evaluations for metrics such as content controllability (IoU), text quality (G), and diversity (D-1, D-2 and sC) for SCT and VSR, using the code and pre-trained checkpoints available on their official project GitHub pages. However, for ASG, we re-trained and evaluated the ASG2Caption model for COCO-Ent using the official GitHub codebase since the pre-trained checkpoints were not available. Unfortunately, we could not train the ASG2Caption model on the Flickr-Ent dataset as the ASG dataset for Flickr-Ent has not been released. For the standard captioning metrics, best-5 diversity, and length precision of the testing sets of the datasets COCO-Ent and Flickr-Ent, we used the values presented in the corresponding papers.

We mention, that in our main paper, we used the strongest model performance from ComPro, which employs GPT-2 Large. This model has a total of 881M parameters, 107M of which are used for the mapping network and 774M are from GPT-2. It’s worth noting that our models, namely, CIC-BART, CIC-BART-SSA, and CIC-BART +verb, use only 140M parameters, making them more than six times smaller than ComPro with GPT-2 Large.

In Fig. 9, we present an example with the control signals used by the baselines and our models for a specific instance where we need a focused caption on the boy (bbox 1) and the cake (bbox2, bbox3). The SCT model [19] uses bounding boxes of entities of interest and GLoVE embeddings of their Faster R-CNN labels as control signals. The VSR model [13] adds ground truth caption verbs and their PropBank grounded verb semantic roles to the SCT control signal. The ASG model [14] employs abstract scene graphs as control signals that provide information about how visual entities are related or connected and how many attributes they have. Our models (CIC-BART and CIC-BART-SSA) use only the bounding boxes of interest and the desired caption length level as control signals. We have also explored the use of ground truth verbs in the control signal, like in VSR, in our CIC-BART +verb model. However, unlike VSR, we only use the ground truth verb name and not their PropBank grounded semantic roles.

Appendix 0.D Results

0.D.1 Original and SSA Augmented Datasets Analysis

In Figs. 10 and 11, we present the coverage and caption length statistics of the COCO-Ent and Flickr-Ent training set and their derived SSA augmentations, respectively. When analyzing the scene coverage based on the control signal, it becomes apparent that the original datasets predominantly feature samples that describe the entire image (high coverage), with very few focusing on a small portion of the scene (highly focused control signals, low coverage). This is particularly evident in the COCO-Ent dataset, where examples with focused control signals are minuscule. For the caption length statistics, we notice that COCO-Ent dataset is far from diverse with approximately 84% of the captions having 8-12 words. Similarly, in the Flickr-Ent dataset, approximately 65% of its descriptions have 8-16 words.

With our SSA data (blue bars), we augment the original datasets (gray bars) with highly focused control–caption pairs and diverse caption length, to construct a new dataset of spatially and linguistically diverse data for controllable image captioning, namely the COCO-Ent-SSA and Flickr-Ent-SSA datasets.

0.D.2 Original and SSA captions Mixtures

In this section, we delve deeper into the impact of our SSA augmentation on CIC models. To conduct our experiments, we utilize our mixing methodology described in our main paper (Section 4.1), to create various versions of COCO-Ent-SSA and Flickr-Ent-SSA datasets. We aim to examine the effect of our SSA examples, so we include all original samples in the mixed dataset $\mathcal{D}_{SSA}$ . Formally, we state that $\operatorname{sam}_{\mathcal{D}}(\tau_{\mathcal{D}},p_{\mathcal{D}})=\mathcal{D}$ .

We conducted six experiments for our augmentation function $\operatorname{sam}_{SSA}$ . In the first scenario, we randomly sampled $x\%$ of the SSA samples. This is equivalent to the ‘Random Sampling Strategy’ (R). To test the impact of parameter $p_{SSA}$ , we experimented with the following percentages: 0% (no SSA samples); 25% (all original and a random 25% of the generated SSA samples); 50%; 75%; and 100% (all original and all generated SSA samples). Additionally, we conducted an experiment using the ‘Uniform Coverage Sampling Strategy’ (U) for $\tau_{SSA}$ , with $p_{SSA}$ set to ten (10) uniform coverage bins ( $p_{SSA}=\{[0\%,10\%),[10\%,20\%),\dotsc,[90\%,100\%]\}$ ). We note that the ‘Uniform Coverage Sampling Strategy’ contains approximately the same number of SSA samples as the 50% random sampling strategy.

Table 4: Content (IoU, Hal) and length (L) controllability, text quality (G), diversity (D-1, D-2, sC), and harmonic mean (H) of (IoU, G, and sC) for our CIC-BART-SSA models evaluated only on the original Flickr-Ent test set. Each of our CIC-BART-SSA models is trained on a different Flickr-Ent-SSA mixture, described by the augmentation strategy type

\tau_{SSA}

and parameters

p_{SSA}

. Each blended data version has all the original data but different percentages of our SSA augmentations. The row order of experiments corresponds to the included SSA percentage in Flickr-Ent ranging from 0 to 100%.

Model: CIC-BART-SSA \| Evaluation: Flickr-Ent Test Set
$\tau_{SSA}$	$p_{SSA}$	$H\uparrow$	IoU $\uparrow$	G $\uparrow$	sC $\uparrow$	L $\downarrow$	Hal $\downarrow$	D-1 $\uparrow$	D-2 $\uparrow$
R	0%	69.8	54.0	85.0	78.6	1.24	36.5	43.6	58.2
R	25%	69.9	53.7	85.1	79.8	1.29	36.5	44.7	59.5
R	50%	70.3	53.9	85.6	80.5	1.23	36.2	45.3	60.6
U	10	70.6	54.3	85.6	80.5	1.07	35.6	45.9	61.0
R	75%	70.5	53.9	85.5	81.1	1.05	35.9	46.2	61.7
R	100%	71.3	55.0	86.0	81.7	1.05	34.1	47.0	62.6

Table 5: Content (IoU, Hal) controllability, text quality (G), diversity (D-1, D-2, sC), and harmonic mean (H) of (IoU, G, and sC) for our CIC-BART-SSA models evaluated only on the SSA data generated using Flickr-Ent test set. Each of our CIC-BART-SSA models is trained on a different Flickr-Ent-SSA mixture, described by the augmentation strategy type

\tau_{SSA}

and parameters

p_{SSA}

Model: CIC-BART-SSA \| Evaluation: SSA Test Set
$\tau_{SSA}$	$p_{SSA}$	$H\uparrow$	IoU $\uparrow$	G $\uparrow$	sC $\uparrow$	Hal $\downarrow$	D-1 $\uparrow$	D-2 $\uparrow$
R	0%	68.5	53.0	80.5	79.8	37.3	52.9	62.9
R	25%	70.5	54.4	81.4	84.0	35.3	55.5	67.2
U	10	71.3	55.4	82.6	83.7	33.9	56.1	67.4
R	50%	71.5	55.2	82.7	85.1	33.9	56.4	68.3
R	75%	71.6	55.4	82.8	84.8	33.3	56.4	68.7
R	100%	72.0	55.6	82.9	86.1	33.0	56.5	69.3

Table 6: Content (IoU, Hal) and length (L) controllability, text quality (G), diversity (D-1, D-2, sC) and harmonic mean (H) of (IoU, G, and sC) for our CIC-BART-SSA models evaluated only on the original COCO-Ent test set. Each of our CIC-BART-SSA models is trained on a different COCO-Ent-SSA mixture, described by the augmentation strategy type

\tau_{SSA}

and parameters

p_{SSA}

Model: CIC-BART-SSA \| Evaluation: COCO-Ent Test Set
$\tau_{SSA}$	$p_{SSA}$	$H\uparrow$	IoU $\uparrow$	G $\uparrow$	sC $\uparrow$	L $\downarrow$	Hal $\downarrow$	D-1 $\uparrow$	D-2 $\uparrow$
R	0%	75.9	76.2	73.0	78.7	.490	19.0	38.0	56.2
R	25%	76.9	76.5	74.6	80.1	.148	18.5	42.6	59.7
R	50%	76.8	76.7	74.9	79.0	.163	17.8	42.9	59.0
U	10	76.7	77.0	74.0	79.4	.150	17.8	42.0	58.6
R	75%	77.8	77.2	74.0	82.6	.116	17.8	42.9	61.6
R	100%	78.3	77.2	74.8	82.5	.106	17.8	44.6	63.2

Table 7: Content (IoU, Hal) controllability, text quality (G), diversity (D-1, D-2, sC), and harmonic mean (H) of (IoU, G, and sC) for our CIC-BART-SSA models evaluated only on the SSA data generated using COCO-Ent test set. Each of our CIC-BART-SSA models is trained on a different COCO-Ent-SSA mixture, described by the augmentation strategy type

\tau_{SSA}

and parameters

p_{SSA}

Model: CIC-BART-SSA \| Evaluation: SSA Test Set
$\tau_{SSA}$	$p_{SSA}$	$H\uparrow$	IoU $\uparrow$	G $\uparrow$	sC $\uparrow$	Hal $\downarrow$	D-1 $\uparrow$	D-2 $\uparrow$
R	0%	69.2	61.4	73.9	74.0	28.4	44.2	57.0
R	25%	74.4	65.1	80.1	80.0	23.0	50.1	63.1
U	10	74.6	65.0	80.3	80.8	23.2	51.2	64.4
R	50%	74.9	64.9	80.7	81.7	23.1	51.7	64.3
R	75%	74.9	64.9	80.7	81.6	23.1	51.7	65.1
R	100%	75.6	65.2	80.7	83.7	23.2	53.8	67.8

We repeat the procedure for both the COCO-Ent-SSA and Flickr-Ent-SSA datasets. We evaluate all models on content (IoU, Hal) and length (L) controllability, text quality (G), diversity (sC, D-1, D-2), and the harmonic mean (H) of IoU, G, and sC. We present the evaluation results for the COCO-Ent and Flickr-Ent test sets in Tabs. 6 and 4. We observe a similar trend in both datasets where adding our SSA samples improves context and length controllability, text quality, and diversity. Our significant improvement in diversity and length controllability is due to the linguistic diversity offered by our SSA augmentations. For example, in Fig. 10 caption length histogram, we can see how narrow it is for COCO-Ent dataset which mainly contains captions with 11, 12, or 13 words, so it is difficult for models trained on just COCO-Ent to generalize and generate captions of other lengths. On the contrary, our models trained jointly with our SSA augmentations can generate captions faithful to the length control signal.

In Tabs. 7 and 5, we have evaluated the performance of each model on the SSA augmentations from the COCO-Ent and Flickr-Ent testing images, respectively. We have excluded the length (L) controllability as it is the same as in Tabs. 6 and 4. This is because it is computed on random control signals of each dataset testing images.

We observe an even more evident improvement across all metrics as we progressively include more of our focused SSA examples for CIC model training. This results from our SSA augmentations, which provide focused examples for training controllable image captioning models. This is exemplified in Figs. 10 and 11 coverage histograms, where the low coverage (high focus) regime is highly under-represented in the original COCO-Ent and Flickr-Ent datasets. Finally, our quantitative analysis demonstrates that training with our SSA augmentations improves controllability, text quality, and diversity performance. Particularly, the improvement is significant in cases where the CIC models need to focus on and describe a specific, small region of a complex and large scene.

0.D.3 Effect of SSA on Content Controllability

In this section, we present the performance of our models, namely CIC-BART and CIC-BART-SSA, with regards to IoU (Intersection over Union) analysis. The former is trained on the original datasets, COCO-Ent and Flickr-Ent, while the latter is trained on our proposed datasets, COCO-Ent-SSA and Flickr-Ent-SSA. We break down the content controllability (IoU, Hal) performance on coverage bands, where coverage refers to the percentage of the image covered by the control signal.

In Fig. 12, we show the results of our models on different evaluation test sets. The first row of the figure represents the performance of models trained on either COCO-Ent or COCO-Ent-SSA datasets, while the second row represents models trained on Flickr-Ent or Flickr-Ent-SSA. The two columns describe the evaluation test sets. In the first column, we evaluate the models on the original datasets (COCO-Ent, Flickr-Ent) test sets. In contrast, in the second column, we evaluate our SSA augmentations derived from the test sets of the original datasets.

The orange line in all plots represents the percentage of examples in each coverage band. We observe that the test sets of the original datasets, COCO-Ent and Flickr-Ent, have more examples with control signals covering a broad aspect of the image. In contrast, the SSA test sets have more samples for focused control signals covering a small percentage of the image. We also notice that the test sets of the original and SSA datasets are consistent with their respective training set statistics presented in Figs. 10 and 11.

Furthermore, in Fig. 13, we present the corresponding coverage histograms for our Hallucinations (Hal) metric. We observe a similar trend in all cases, wherein our model CIC-BART-SSA, trained on our SSA augmentations, shows an improvement in content controllability performance. This means higher IoU and reduced Hal. We note that breaking down the content controllability metrics in coverage bands reveals the major improvement in the low coverage regions, where the original datasets, COCO-Ent and Flickr-Ent, have very few data points.

Table 8: Best-5 Diversity for randomly generated control signals of the COCO-Ent and Flickr-Ent testing images. Our model CIC-BART was trained on the original COCO-Ent and Flickr-Ent datasets while CIC-BART-SSA was trained with our COCO-Ent-SSA and Flickr-Ent-SSA augmented datasets. *ASG-type dataset was not released for Flickr-Ent, precluding us from evaluating its best-5 diversity scores.

Method	D-1 $\uparrow$	D-2 $\uparrow$	D-1 $\uparrow$	D-2 $\uparrow$
	COCO-Ent		Flickr-Ent
ASG [14]	43	56	-	-
CIC-BART	58	86	67	90
CIC-BART-SSA	67	92	68	93

0.D.4 Best-5 Diversity

In Tab. 8, we present the best-5 D-1, D-2 diversity for our models CIC-BART and CIC-BART-SSA, which was proposed in ASG [14]. We notice an important diversity improvement, especially for the COCO-Ent dataset, when we train our models using our SSA augmentations (CIC-BART-SSA).

0.D.5 Measuring Performance via Standard Captioning Metrics

Table 9: Captioning metrics on COCO-Ent and Flickr-Ent original test sets.

Model	B4 $\uparrow$	M $\uparrow$	R $\uparrow$	C $\uparrow$	S $\uparrow$	B4 $\uparrow$	M $\uparrow$	R $\uparrow$	C $\uparrow$	S $\uparrow$
	COCO-Ent					Flickr-Ent
LaBERT[20]	13.5	20.6	42.3	136.6	32.4	8.1	14.6	32.7	70.8	19.6
SCT[19]	22.3	25.6	55.3	209.7	48.5	12.5	16.8	38.9	84.0	23.5
ComPro[53]	24.0	27.3	56.1	232.2	50.4	11.9	17.3	37.8	89.4	23.9
ASG[14]	23.0	24.5	50.1	204.2	42.1	-	-	-	-	-
VSR[13]	25.4	28.8	57.8	265.0	49.8	12.3	19.8	40.9	131.4	22.4
CIC-BART	21.0	26.2	50.2	225.0	46.3	14.2	19.4	39.7	136.4	27.2
CIC-BART-SSA	20.0	25.5	48.9	216.2	46.1	13.0	18.9	37.8	123.3	27.0
CIC-BART +verb	36.2	33.7	62.9	366.8	53.7	26.6	27.2	53.9	275.1	32.4

Tab. 9 reports the results of all models in standard captioning metrics (i.e., B4, M, R, C and S). As we can see, both CIC-BART and CIC-BART-SSA perform comparably to the three baselines with respect to these metrics. Nonetheless, as we noted earlier, these scores reflect the match between a generated controlled caption and a ground-truth image-level caption. Given a focused control signal (e.g., one focusing on a subset of entities in an image), we expect a partial match between the generated controlled caption and the ground-truth caption. VSR has the best scores for most of these metrics, but this is partially due to this model using the exact verb as the control signal and is not necessarily an indicator of this model’s caption quality (as we saw earlier with the low G score). To understand the role of such descriptive control signals, we present results for a variation of our CIC-BART where we also input the verb as an additional control signal; see the last row of the table (CIC-BART +verb). Note that even with this additional information, our control signal is still simpler than that of the VSR, as we do not provide the verb-specific semantic roles. Nevertheless, by adding a verb as the control signal, we can see a substantial increase in all standard captioning metrics.

0.D.6 Qualitative Results

In Fig. 14, we present qualitative examples from COCO-Ent and Flickr-Ent test sets. In these examples, the control signals are extracted from the ground-truth captions. Each colored oval under ‘Cntl’ corresponds to a bounding box of the same color in the image. The collection of ovals identifies the entities of interest, that is, the control signal. (Note that we do not show the colored ovals for image (c), since both sets of control signals include all bounding boxes.) For example, in (a), the first control signal (at the top) focuses on the regions restaurant, man, and food, while the second control signal also includes the entity table. Each highlighted word in the generated controlled captions corresponds to the control entity of the same color, showing the match between the generated captions and the control signal.

We notice that our models outperform previous SOTA by substantially improving the quality of the generated controlled captions. This behavior was expected from our quantitative analysis, showing that our models have significantly higher text quality (G) performance. In addition, more evidently in figures (b) and (d), our models have better content controllability performance by correctly referring to all entities of interest in the control signal. Especially in the highly challenging, complex scene (d) in which many objects are present, it successfully describes all entities of interest in the generated controllable caption.

Our CIC-BART-SSA model generates captions that are faithful to the control signal and better understand the relationships connecting the entities of interest. For example, in (a), it correctly identifies that the person is photographing his food rather than eating it and that the image is not black and white, or in (d) that the woman holds the purse and not the man in the background.

In Fig. 15 we present additional qualitative examples from COCO-Ent test set. We include the generated controllable captions from the baseline models (SCT, ASG, and VSR) and the proposed models CIC-BART and CIC-BART-SSA. Our qualitative examples also show that our models generate diverse, high-text-quality captions with improved content controllability when compared to the baseline models.

Next in Fig. 16 we present examples using the SSA augmentations control signals which are derived from Flickr-Ent test set. We notice that our CIC-BART-SSA better conveys the image concept without hallucinating. For example, in the second image, it correctly describes that the man is fixing the ticket booth or that the woman carries a bag and is not walking.

Further, we conducted an experiment to evaluate our length control performance qualitatively. In the experiment, we generated controllable captions for a fixed image region and various caption length controls. We present some of our qualitative results in Fig. 17 showing that the generated captions were faithful to the length control signal, indicating that our model is effective in controlling the length of captions. Furthermore, we observe that our model generates a diverse set of captions for a specific image region.

0.D.7 SSA vs Other Augmentation Strategies

0.D.7.1 Augmentations via LLM Paraphrasing

To understand the impact of our SSA enhancements, we perform an experiment in which we augment the original training data with paraphrases generated using an LLM, Llama-2 [48]. We generate one paraphrase per original caption, effectively doubling the size of the training data⁵⁵5For 20% of the captions, Llama generates paraphrases identical to original sentences.. Specifically, we instructed the Llama-2 model to rephrase the initial captions using few shot prompting like

If the phrase ‘Children wearing team uniforms playing soccer in a grassy field’ can be paraphrased as ‘Kids in a grassy field playing soccer in uniforms’, and the phrase ‘A little girl sitting in the middle of a restaurant and smiling for picture’ can be paraphrased as ‘A smiling little girl taking a picture while sitting in a restaurant’, then the phrase ‘{caption}’ can be paraphrased as …

We replace {caption} with original dataset captions, relying on Llama-2 to paraphrase them. Since we only paraphrased the original sentence, we can assume that it pertains to the same set of bounding boxes since it refers to the same visual entities of interest, which is the only information required for our CIC-BART model.

The results in Tab. 10 (bottom panel) demonstrate that CIC-BART-SSA outperforms CIC-BART-par on all controllable captioning metrics. We conclude that the improved performance of CIC-BART-SSA is not just due to the increase in training data size; the model benefits from the intricate structured and visually grounded guidance of our SSA.

Table 10: Performance of our SSA augmentations (CIC-BART-SSA) vs LLM paraphrases (CIC-BART-par). All models are evaluated on the original COCO-Ent test sets.

Model	$H\uparrow$	IoU $\uparrow$	Hal $\downarrow$	G $\uparrow$	sC $\uparrow$	D-1 $\uparrow$	D-2 $\uparrow$	L $\downarrow$
CIC-BART-par	74.3	76.2	18.8	72.0	74.9	36.1	52.7	.19
CIC-BART-SSA	78.3	77.2	17.8	74.8	82.5	44.6	63.2	.11

In Figs. 20, 18 and 19, we present some examples of the Llama-2 paraphrases of the original COCO captions. For example, the paraphrase of 1-O) is 1-Llama-2), the paraphrase of 2-O) is 2-Llama-2), and so on. We have included the SSA-generated focused captions for each image, along with the corresponding synthetic caption and its GRUEN score. SSA uses this metric to filter out poor-quality sentences (in our experiments, we set the GRUEN threshold to 0.7).

0.D.7.2 Scene Graphs with LLM Paraphrasing Augmentations

The scope of this section is to illustrate the benefits of AMRs when compared to scene graphs for CIC augmentation. Figs. 20, 22 and 24 depict examples from COCO-Ent and Flickr-Ent that contrast Original Captions, LLM-paraphrased captions, SSA-augmented captions, and CLID-augmented captions.

Nature of captured relations and entities in AMR vs. Scene Graph representations.

As stated in our main paper, prior analysis [1, 57] has shown that existing scene graph annotations focus mainly on geometric and possessive relations. For example, in Fig. 21 examples of geometric relations are ‘a man in front of a door’, ‘one man next to another man,’ etc., and possessive ‘a man has hair,’ ‘the man has a head’ etc. Regarding entities, scene graphs focus mainly on object/body parts (hair, head, arm, etc.) and clothing (dress, shirt, etc.). On the contrary, the AMRs derived from the image captions contain a wide range of semantic relations that are inherited from the natural language image descriptions drawn from the image captioning datasets. For example, in Fig. 21, in the original captions, we can find the semantic relations, 1-O) ‘the men are hanging out in the yard’, 5-O) ‘the friends enjoy time spent together’ which will be inherited in their AMR representations and therefore in the SSA samples (i.e. 2-SSA), 3-SSA), 5-SSA), 7-SSA)).

Limitations of Scene Graph representations.

Scene graphs are useful in capturing the visual elements of a scene and their relationships, such as geometric and possessive relationships. However, they lack the ability to represent abstract concepts like time-related information, such as ‘a sunny morning’ or ‘a quiet afternoon’. For instance, in Fig. 24 caption 2-O), the phrase ‘during a sunny afternoon’ is not directly related to low-level semantics like individual visual objects and their relationships. Rather, it relates to higher-level reasoning, such as observing the shadows of the trees and the annotators’ experiences that helped them conclude that the picture was taken during a sunny afternoon.

Another important difference between AMRs and scene graphs is that the edges of an AMR carry linguistic information, which is not the case with scene graphs. In Fig. 7, we can see only a fraction of the available edge roles, which are crucial when converting vgAMRs to natural language sentences as they help generate accurate descriptions. On the other hand, scene graph edges lack the ability to convey additional linguistic information, except for the edge direction, which indicates the object and subject of a relation. This lack of additional information makes it challenging to augment captions using sampled scene graphs and forces reliance on LLM paraphrasing, which, as previously discussed in Sec. 0.D.7.1, can introduce errors and inaccuracies (increased hallucinations (Hal) and poorer text quality (G) as seen in Tab. 10). For instance, in the scene graph-based augmentation of CLID [23], they first place the sampled scene graph node names in a sequence and then ask an LLM paraphraser to generate a sentence.

In Figs. 18, 19, 20, 24, 21, 22 and 23, we can see instances from our SSA (AMR-based) and CLID (scene graph-based) augmentations for a particular image. Augmentations in the CLID dataset are not visually grounded and, hence, cannot be utilized for spatial CIC. However, our SSA approach includes visual grounding information, which enables it to effectively generate visually grounded augmentations, making them suitable for spatial CIC tasks as well.

CLID augmentations mainly describe the geometric and possessive relationships of the visual entities, and not their semantic relations. The main focus of these augmentations is on body parts (i.e. in Fig. 22, caption 1-CLID: ‘people with a leg’, ‘a girl with an arm and a leg’, and ‘a woman with a head’) as well as clothing. This is expected, since this type of information is typically captured in existing scene graphs. It is also noticed that most of the generated sentences are difficult to read, containing redundancies and hallucinations. This may be the result of scene graph annotation or generation errors or due to LLM-induced hallucinations and poor quality text generation. On the other hand, the SSA augmentations are based on AMR representations and capture various types of relations, including semantic, geometric, possessive, etc. Unlike other methods, they provide a more natural and human-like description of visual entities, as they are derived from the original dataset of human-annotated captions.