The Role of Network and Identity in the Diffusion of Hashtags

Although the spread of behaviors is influenced by many social factors, existing literature tends to study the effects of single factors—most often, properties of the social network—on the final cascade. In order to move towards a more integrated view of cascades, this paper offers the first comprehensive investigation into the role of two social factors in the diffusion of 1,337 popular hashtags representing the production of novel culture on Twitter: 1) the topology of the Twitter social network and 2) performance of each user’s probable demographic identity. Here, we show that cascades are best modeled using a combination of network and identity, rather than either factor alone. This combined model best reproduces a composite index of ten cascade properties across all 1,337 hashtags. However, there is important heterogeneity in what social factors are required to reproduce different properties of hashtag cascades. For instance, while a combined network+identity model best predicts the popularity of cascades, a network-only model has better performance in predicting cascade growth and an identity-only model in adopter composition. We are able to predict what type of hashtag is best modeled by each combination of features and use this to further improve performance. Additionally, consistent with prior literature on the combined network+identity model most outperforms the single-factor counterfactuals among hashtags used for expressing racial or regional identity, stance-taking, talking about sports, or variants of existing cultural trends with very slow- or fast-growing communicative need. In sum, our results imply the utility of multi-factor models in predicting cascades, in order to account for the varied ways in which network, identity, and other social factors play a role in the diffusion of hashtags on Twitter.

Keywords: hashtags, cascade prediction, cascade evaluation, social network, social identity.

1.   Introduction

Roughly 1 in 5 posts on Twitter (now known as 𝕏𝕏\mathbb{X}blackboard_X) contain hashtags. The ubiquity of hashtags likely stems from their pragmatic and social functions in the process of cultural production on Twitter: 1) to facilitate and codify the creation of new culture and 2) to enable easy dissemination of new culture that is produced (Bruns und Burgess, 2011, 2015; Rambukkana, 2015; La Rocca, 2020). As such, modeling the spread of cultural innovation on Twitter requires a strong mechanistic understanding of how hashtags spread among users on the platform, as well as the diverse mechanisms underlying the creation and adoption of hashtags.

The spread of behaviors, known as cascades, has been primarily studied through the lens of social networks, including analyzing the effects of different network topologies and contagion processes Kupavskii u. a. (2012); Krishnan u. a. (2016); Pramanik u. a. (2017), or how these effects vary by properties like the hashtag’s topic and semantics Romero u. a. (2011); Lehmann u. a. (2012). Prior work suggests that many types of social context help in shaping how far, how fast, and to whom artifacts diffuse (Chang, 2010; Lehmann u. a., 2012; Eisenstein u. a., 2014; Zhang u. a., 2016a; Goel u. a., 2016a; Li u. a., 2018). For instance, hashtags are often used to explicitly signal the user’s social identity or affiliation (Berger, 2008; Smith und Smith, 2012; Evans, 2016; Barron und Bollen, 2022); in these cases, the Twitter network afford users exposure to the hashtag, but each user’s identity helps determine whether they adopt the hashtag and, therefore, also shapes future exposures (Palloni, 2001; Sharma, 2013; Merle u. a., 2019; Mueller u. a., 2021). For example, Sharma (2013) theorizes that the spread of hashtags on Black Twitter is driven by a combination of network and identity. Users coin these hashtags to “perform” their racial identity online. Adopters are often part of the Black Twitter network and, as such, continued adoption largely occurs within this community because 1) these users are more likely to be exposed to the hashtag, 2) exposed users outside the community tend not to adopt the hashtag if it does not signal their racial identity, which 3) minimizes exposure and adoption outside the community. As this example illustrates, the dynamics underlying network-only diffusion likely differ from the dynamics when diffusion involves a combination of network and other social factors like identity. As such, models of hashtag dissemination would likely benefit from including multiple interacting social factors. However, even in the context of studying identity-related hashtags, prior work has largely focused on the impact of a single factor (e.g., either network or identity) when modeling the diffusion of hashtags.

In this work, we investigate the role of two social factors in the adoption of innovation: 1) the topology of Twitter’s social network and 2) the probable demographic identity of users. We study the spread of 1,337 popular hashtags, in a network of nearly 3M users on Twitter. These hashtags represent the production of novel culture (e.g., #learnlife, #gocavs). In order to compare the effects of network and identity, we simulate the diffusion of each hashtag using: 1) a Network-only model where hashtags spread through the Twitter network using a modified linear-threshold model 2) an Identity-only model where hashtags diffuse between users who share relevant identities, and 3) a combined Network+Identity model that includes both social factors. We evaluate how well these models reproduce ten commonly studied properties of cascades, including their popularity, growth, and adopter composition. Overall, network and identity best reproduce a composite measure of all ten properties. However, there is important heterogeneity in the role of network and identity; for instance, hashtags regional or racial identity, and those discussing sports or news have highest comparative advantage with the Network+Identity model. We create a better-performing customized model, selecting whether to study each hashtag using either network alone, identity alone, or network and identity together. Our work underscores the importance of building models that integrate multiple social factors.

2.   Related Work

2.1 Modeling Cultural Diffusion Online.

The diffusion of behavior and information online is a topic of significant study in the literature. cf. Zhou u. a. (2021), Li u. a. (2021) and Raponi u. a. (2022) for recent reviews of this literature. Some key points from these reviews: Empirical models often aim to predict some property of the final cascade given some information about its initial adopters. Many such papers adapt models developed to simulate offline behaviors from first principles, including the Susceptible-Infectious-Recovered (SIR) compartment model, the linear threshold model of complex contagion, and stochastic simulations like Hawkes models or Poisson processes. Other papers use deep learning for the predictive task, including graph representation learning and predictive models from features of the network, adopters, and early parts of the cascade. Our work builds on these studies by using a more recent agent-based model of diffusion that accounts for diffusion dynamics particular to Twitter. For instance, by adopting a usage-based instead of adopter-based model, our framework accounts for frequency effects in the adoption (Ellis, 2002; Beckner u. a., 2009); and by modeling the fading of attention online our model allows for cultural artifacts to stop being used over time (e.g., to model hashtags that are used temporarily and then exit the lexicon) Bruns und Burgess (2011). Using a first-principles model also allows us to test the specific mechanisms associated with network and identity that are encoded in the model – and to explicitly test the effects of network and identity in cultural diffusion rather than simply using network or identity features in our model. We also introduce a novel dataset of hashtag cascades and an ten-factor evaluation framework to support future work in this area.

2.2 Social Factors in the Adoption of Hashtags.

Prior work often attributes hashtag adoption to factors related to network, identity, lifecycle, discourse. Network factors include the position of initial adopters in the social network and simulating the diffusion of innovation through a social network (Li u. a., 2021; Fink u. a., 2016). Identity factors include wanting to join or signal membership to a certain community Raponi u. a. (2022); Page (2012); Yang u. a. (2012). Lifecycle factors include the hashtag’s growth trajectory Cheng u. a. (2014); Fink u. a. (2016); Lin u. a. (2013) Discursive factors include the hashtag’s relevance, topicality, and ease of use (e.g., length) Yang u. a. (2012); Fink u. a. (2016); Lin u. a. (2013); Cunha u. a. (2011); Giaxoglou (2018). In addition to individual social factors, some theoretical models of diffusion posit that the interaction of multiple social factors may play a role in the diffusion of hashtags. For instance, qualitative studies of sports hashtags Smith und Smith (2012) and racial hashtags Sharma (2013) have suggested that hashtags pertaining to specific communities on Twitter may spread via network effects (someone is exposed to a hashtag, a precursor for adoption, when a member of their network uses it) and identity effects (an exposed user chooses to use the hashtag if they are a member of this community and want to signal this identity). However, most empirical models do not incorporate the interaction of network and identity. For instance, Raponi u. a. (2022) notes a number of articles that, separately, describe the effect of “network factors” and “user factors” (e.g., identity) on the propagation of misinformation, but none that describe the effects of both network and user factors. Similarly, Zhou u. a. (2021) lists several papers that model adoption decisions based on either “neighboring relations” (i.e., the network) or “individual/group characteristics” (like identity), but not both, while Kwon und Ha (2023) describes hashtags as being either “identity-based” or “bond-based.” Our work builds in this prior literature by empirically modeling the interaction of two social factors, network and identity, in cultural diffusion. In addition, we explore the conditions under which the interaction of these two social factors is especially important in modeling the diffusion, and propose a combined model to predict which factors (network and/or identity) best model the properties of a given hashtag’s cascade.

3.   Methods

In order to test the roles of network and identity in the diffusion of hashtags, we 1) collect cascades of popular hashtags from Twitter, 2) estimate the identity of each user and the network among the users, and 3) use an agent-based model to simulate these cascades using both network and identity. To do this, we adapt the methods of Ananthasubramaniam u. a. (2024). We summarize all key points of the methods in this section, and the original paper has full details.

3.1 Modeling Diffusion of Innovation

Testing our study’s hypotheses requires comparing empirical cascades against cascades simulated using the Twitter network and users’ demographic identities. In this section, we describe how we produce the needed synthetic data.

3.1.1 Simulation Formalism

The diffusion of hashtags on Twitter is modeled using a common agent-based setup: a set of initial adopters use the hashtag at time t=0𝑡0t=0italic_t = 0; and at each subsequent timestep, each agent will decide whether to use the hashtag depending on prior adoption by other agents. In the popular linear threshold model, an agent will use the hashtag if the (weighted) fraction of their network neighbors who are already adopters crosses a certain threshold (Centola und Macy, 2007). To better simulate the dynamics underlying cultural production, Ananthasubramaniam u. a. (2024) adapts the classic linear threshold model in two ways that are well-suited for our research question:

First, adoption is usage-based rather than user-based. That is, rather than representing adoption as a binary property of the agent (i.e., an agent is either “an adopter” or “not an adopter”), each exposed agent i𝑖iitalic_i could use the hashtag hhitalic_h at each timestep with some time-varying probability pihsubscript𝑝𝑖p_{ih}italic_p start_POSTSUBSCRIPT italic_i italic_h end_POSTSUBSCRIPT. Therefore, unlike the linear threshold model, an agent can 1) use the hashtag multiple times, or 2) decide not to use it in one timestep but then decide to use it later. This assumption is consistent with the role of repeated exposure in the adoption of textual innovation (Tomasello, 2000; Ellis, 2019).

Second, the model uses not only the topology of the social network but also the identity of agents to model the diffusion of innovation. As shown in Equation 1, the probability of an agent i𝑖iitalic_i’s adoption of hashtag hhitalic_h pihsubscript𝑝𝑖p_{ih}italic_p start_POSTSUBSCRIPT italic_i italic_h end_POSTSUBSCRIPT is proportional to (i) the similarity between their identity and the hashtag’s identity δihsubscript𝛿𝑖\delta_{ih}italic_δ start_POSTSUBSCRIPT italic_i italic_h end_POSTSUBSCRIPT; and (ii) the fraction of their neighbors j𝑗jitalic_j who adopted the hashtag, weighted by tie strength wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and similarity in identity δijsubscript𝛿𝑖𝑗\delta_{ij}italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.

pihSδih neighbors who adoptedwjiδji all neighborswkiδkisimilar-tosubscript𝑝𝑖𝑆subscript𝛿𝑖subscript neighbors who adoptedsubscript𝑤𝑗𝑖subscript𝛿𝑗𝑖subscript all neighborssubscript𝑤𝑘𝑖subscript𝛿𝑘𝑖p_{ih}\sim S\cdot\delta_{ih}\frac{\sum\limits_{\textrm{{j $\in$ neighbors who % adopted}}}w_{ji}\delta_{ji}}{\sum\limits_{\textrm{{k $\in$ all neighbors}}}w_{% ki}\delta_{ki}}italic_p start_POSTSUBSCRIPT italic_i italic_h end_POSTSUBSCRIPT ∼ italic_S ⋅ italic_δ start_POSTSUBSCRIPT italic_i italic_h end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT j ∈ neighbors who adopted end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT k ∈ all neighbors end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT end_ARG (1)

Therefore, consistent with prior work on adoption of innovation (Centola und Macy, 2007; Bakshy u. a., 2012), the network influences 1) the hashtags an agent is exposed to (opportunity to adopt) and 2) the agents’ level of exposure (likelihood of adopting). Consistent with prior work on identity performance and signalling (Goffman u. a., 1978; Eckert, 2012), the effects of identity are modeled in two ways: 1) agents preferentially use hashtags that match their own identity, and 2) agents give higher weight to exposure from demographically similar network neighbors.

3.1.2 Model Parameters

Each hashtag has a different propensity to be used on Twitter, due to differences in factors like the size of potential audience, communicative need, and novelty (e.g., a hashtag about a TV show with a small audience is likely to get fewer uses than a hashtag about a TV show with a large audience) Bakshy u. a. (2011); Sharma (2013). Accordingly, in Equation 1, each hashtag is associated with a different constant of proportionality S𝑆Sitalic_S. The S𝑆Sitalic_S parameter is termed stickiness because larger values of this parameter bias the model towards higher levels of—or “stickier”—adoption. The stickiness of each hashtag is calibrated to the empirical cascade size (number of uses) using a nested grid search on a parameter space of [0.1,1]0.11[0.1,1][ 0.1 , 1 ]: first, identifying the interval of width 0.1 in which the model best approximates the empirical cascade size; and second, identifying the best fitting stickiness value in that interval using a grid search with step size 0.01. Grid searches are performed using one run of the model at each value of stickiness.

The model has three hyperparameters that apply across all hashtags. These are taken from the original paper, which tuned the parameters to the empirical cascade size with the same set of users.

3.1.3 Comparing Network and Identity

To understand the effects of network and identity, we compare the full Network+Identity model described above against two counterfactuals: 1) the Network-only model, where we simulate the spread of the word through just the network with no identity effects (this is achieved by setting δij=1subscript𝛿𝑖𝑗1\delta_{ij}=1italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 and δih=1subscript𝛿𝑖1\delta_{ih}=1italic_δ start_POSTSUBSCRIPT italic_i italic_h end_POSTSUBSCRIPT = 1) and 2) the Identity-only model, where we eliminate the effects of homophily by running simulations on a configuration model random graph with the same users and degree distribution as the original network.

3.2 Network and Identity Estimation

This section elaborates on how network and identity are estimated. Each agent in this model is a user on Twitter who is likely located in U.S.A., based on the geographic coordinates tagged on their tweets (Compton u. a., 2014). There are 3,959,711 such users in the Twitter Decahose, a 10% sample of tweets from 2012 to 2022. Since we use the same agent identities and network to model the diffusion of all hashtags during this ten-year period, the network and identities are inferred from 2018 data, which is at the midpoint of this timeframe (e.g., identities are from the 2018 American Community Survey and House of Representative elections, the network is inferred from interactions between 2012 and 2018).

3.2.1 Agent Identities

In this model, identity includes an agent’s affiliations towards 25 identities within five demographic categories: (i) race/ethnicity (identities include different racial and ethnic groups such as non-Hispanic white, Black/African American, etc.), (ii) socioeconomic status (identities include categories of income level, educational attainment, and labor-force status); (iii) languages spoken (identities include the top six languages spoken in U.S.A.: English, Spanish, French, Chinese, Vietnamese, Tagalog); (iv) political affiliation (identities are Democrat, Republican, or Other Party); and (v) geographic location. Each agent’s demographic identity is modeled as a vector Υ[0,1]25Υsuperscript0125\Upsilon\in[0,1]^{25}roman_Υ ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT 25 end_POSTSUPERSCRIPT whose entries represent the composition of the user’s Census tract and Congressional district. An agent’s location is inferred using the geographic coordinates they tweeted from, using the high-precision algorithm from Compton u. a. (2014). An agent’s political affiliation is the fraction of votes each party got in the agent’s Congressional district during the 2018 House of Representatives election. An agent’s race, socioeconomic status, and languages spoken are the fraction of the Census tract with the corresponding identity.

3.2.2 Network

This study uses a weighted Twitter mutual mention network, which has been shown to model information diffusion well (Adamic und Adar, 2003; Huberman u. a., 2008; Romero u. a., 2013). In particular, the nodes in this network are all agents and there is an edge between agents i𝑖iitalic_i and j𝑗jitalic_j if both users mentioned the other at least once in the Twitter Decahose sample. The strength of the edge from i𝑖iitalic_i to j𝑗jitalic_j is proportional to the number of times user i𝑖iitalic_i mentioned user j𝑗jitalic_j in the sample. Although all ties are reciprocated, the network is directed because the strength of the edge from i𝑖iitalic_i to j𝑗jitalic_j may not match the strength of the edge from j𝑗jitalic_j to i𝑖iitalic_i. This network contains 2,937,405 users and 29,153,138 edges.

3.3 Hashtags

This study models the spread of 1,337 popular hashtags between 2013 and 2022. This section describes how hashtags and their initial adopters and identities are selected.

3.3.1 Definition

As this paper seeks to study the roles of network and identity in the lifecycle of the production of novel culture, we select hashtags where:

  1. 1.

    Hashtag is well-adopted: Sufficiently popular hashtags can be considered cultural objects, used to allow Twitter users to position their own thoughts in context of a broader conversation La Rocca (2020); Page (2012). To ensure that the hashtag was popular enough to be considered part of a “broader conversation,” we included only hashtags with 1,000 or more uses in our Decahose sample.

  2. 2.

    Hashtag is new: Our goal is to model the spread of hashtags from when they’re coined to their dissemination. Therefore, we include hashtags that have had low adoption before the data collection window (i.e., they are novel) and whose initial adopters we can identify in our data.

  3. 3.

    Hashtag represents truly innovative culture: We are interested in modeling the diffusion of cultural innovation on Twitter. Therefore, hashtags of interest do not reference common words or phrases, and are not simply the names of existing named entities (e.g., celebrities, movie titles). Instead, they are neologisms or novel phrases that are partly or wholly created by the community.

3.3.2 Identification

We apply the above definition to systematically select hashtags from the Twitter Decahose sample between January 2012 and December 2022. First, we collect all tweets from the Decahose sample that were posted by the 2,937,405 users in our network. These tweets contain 198,988 hashtags that were used at least 100 times. Next, we filter these hashtags, as follows:

  1. 1.

    Popularity: To limit our study to hashtags that eventually became popular, we eliminate 116,477 hashtags that were used fewer than 1,000 times between 2013 and 2022. Frequencies are counted without considering case(e.g., #GoSox is considered the same hashtag as #gosox). While some studies may also consider less popular hashtags, we eliminate these because many of the properties we’re interested in can’t be calculated or are too noisy on small cascades.

  2. 2.

    Novelty: To limit our study to newly coined hashtags, we eliminate 77,134 hashtags that were used more than 100 times in 2012 (e.g., #obama2012, #sup, #sobad, #sandlot).

  3. 3.

    Innovativeness: To ensure the hashtag represents production of novel culture (e.g., it is not a reference to some named entity, a common phrase, or a dictionary word), we eliminate 3,144 hashtags that were entries in the Merriam Webster English-language dictionary (e.g., #explore, #dirt) or in Wikidata, a repository of popular named entities and phrases (e.g., #domesticviolence, #billcosby, #interiordesign). Since hashtags cannot contain certain characters that might appear in the dictionary and Wikidata (e.g., spaces, apostrophes, periods), we replaced these characters with both spaces and underscores to ensure that we eliminate hashtags using these different conventions. Two authors reviewed a sample of 100 of these hashtags and determined that 84% of them were examples of novel cultural production, rather than references to entities, dictionary words, or other non-cultural or existing cultural references (annotation guidelines in Appendix A).

  4. 4.

    Presence of Seed Nodes: To ensure that the hashtag was coined between 2013 and 2022, we eliminate 896 hashtags whose cascade began before 2013 (e.g., #theedmsoundofla, #southernstreets, #rastafarijams). The procedure to identify seed nodes is described in Section 3.3.3.

After this filtering, we were left with 1,337 hashtags.

3.3.3 Initial Adopters

Each cascade’s initial adopters are the users whose adoption of the hashtag 1) was likely not influenced by prior usage on Twitter and 2) likely influenced future adoption of the hashtag. To identify these users, we first find instances where each hashtag had a period of contiguous usage, by looking for periods of time when the hashtag was used at least 100 times in the Decahose sample (likely at least 1,000 times overall) with less than a month’s gap between successive uses. Prior work has determined that after one month of inactivity, the subsequent usage of a hashtag is likely not in response to any prior usage (Ananthasubramaniam u. a., 2024). Additionally, a hashtag’s prior period usage is more likely to be remembered if it was used more frequently in the prior period (Lehmann u. a., 2012; Lorenz-Spreen u. a., 2019). As such, we assume that the cascade starts during the first period where the hashtag was used more than 1,000 times, since: any usage prior to this start date is likely unrelated to the cascade, since it was used too infrequently for users in the cascade to have a high likelihood of adoption; and adopters after the start date are likely to remember the usage in this first period because of its high frequency. The hashtag’s initial adopters are the first ten users to use the hashtag after the start date.

3.3.4 Hashtag Identity

Each hashtag signals an identity, determined by the composition of its initial adopters. Initial adopters who are more strongly aligned with a particular identity are more likely to coin hashtags that signal that identity (Agha, 2005). Accordingly, if the median initial adopter is sufficiently extreme in any given register of identity (in the top 25th percentile of that identity, using the threshold from (cf. Ananthasubramaniam u. a., 2024)), the hashtag signals that identity.

4.   Evaluating Simulated Cascades

When comparing empirical and simulated adoption, researchers often choose to focus on reproducing certain desired properties of a cascade rather than predicting exactly which individuals will adopt the focal behavior, because there is a high degree of stochasticity in adoption decisions (Hofman u. a., 2017; Li u. a., 2021; Zhou u. a., 2021). However, the properties used in the literature vary widely and are often uncorrelated in their performance (common metrics include measures like cascade size, growth, properties of the adopter subgraph, and virality). In order to comprehensively study the effects of network and identity on the diffusion of hashtags on Twitter, we develop a framework to analyze a model’s ability to reproduce ten different properties of cascades, related to a cascade’s popularity, growth, and adopter composition. This requires evaluating models across all ten measures and then combining the ten evaluation scores into a composite Cascade Match Index (cmi) to measure the overall performance across the ten measures. To enable error analysis, we do not compare the distribution of properties over all trials; instead we calculate the cmi score for each pair of simulated and empirical cascades and then average errors over all simulations.

For each of the ten metrics, we explain 1) what property of the hashtag is being measured and 2) how comparisons between pairs of simulated and empirical cascades are made.

4.1 Popularity

Cascades are often modeled with the goal of understanding the dynamics underlying popularity (Kupavskii u. a., 2012; Zhou u. a., 2021). More popular hashtags experience high levels of adoption or adoption in parts of the social network that are very distant from the initial adopters, increasing the influence they have on popular culture.

M1: Level of Usage

One of the most common metrics used to measure the popularity of a new behavior is simply how often the behavior is used. M1 calculates the number of times a hashtag is used in each cascade, including repeated usage by a user. Comparing simulated and empirical usage requires a measure that operates on a logarithmic rather than a linear scale (e.g., not relative error), because the level of usage could span several orders of magnitude. For instance, if the empirical cascade had 1,000 uses in the Decahose sample (or an expected 10,000 uses on all of Twitter), simulation 1 had 5,000 uses, and simulation 2 had 20,000 uses, a measure like relative error would show that simulation 1 has smaller error than simulation 2 (|10,0005,000||10,000-5,000|| 10 , 000 - 5 , 000 | vs. |10,00020,000||10,000-20,000|| 10 , 000 - 20 , 000 |); however, since cascades often grow exponentially (Zhang u. a., 2016b), it would be better for both to have the same magnitude of error since one is half as big and the other is twice as big as the empirical cascade. Therefore, we compare the ratio of simulated to empirical usage on a logarithmic scale |log(M1sim10M1emp)|𝑙𝑜𝑔𝑀subscript1𝑠𝑖𝑚10𝑀subscript1𝑒𝑚𝑝|log(\frac{M1_{sim}}{10\cdot M1_{emp}})|| italic_l italic_o italic_g ( divide start_ARG italic_M 1 start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT end_ARG start_ARG 10 ⋅ italic_M 1 start_POSTSUBSCRIPT italic_e italic_m italic_p end_POSTSUBSCRIPT end_ARG ) |, henceforth referred to as the log-ratio error. We compare M1sim𝑀subscript1𝑠𝑖𝑚M1_{sim}italic_M 1 start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT to 10M1emp10𝑀subscript1𝑒𝑚𝑝10\cdot M1_{emp}10 ⋅ italic_M 1 start_POSTSUBSCRIPT italic_e italic_m italic_p end_POSTSUBSCRIPT because the empirical cascades are drawn from a 10% sample of Twitter and, therefore, we expect M1emp𝑀subscript1𝑒𝑚𝑝M1_{emp}italic_M 1 start_POSTSUBSCRIPT italic_e italic_m italic_p end_POSTSUBSCRIPT to be 10 times larger on all of Twitter.

M2: Number of Adopters

In addition to the level of usage, another popular way of measuring popularity is the number of unique adopters in a cascade. M2 looks at the number of unique users in the downsampled cascade who adopted each hashtag. Unlike M1, M2 does not consider repeated usage and may be much lower than M1 when a cascade experiences a high volume of usage by a small group of users (e.g., for niche cascades that are really popular among a small group of users); however, in many cases, M1 and M2 are likely to be correlated. Since, like M1, the number of adopters also scales exponentially, comparisons between empirical and simulated cascades are made using the log-ratio error.

M3: Structural Virality

Another way of measuring the popularity of a hashtag is to assess how deeply the hashtag has permeated the network (Goel u. a., 2016b). Structural virality measures exactly this. When initial adopters are not known, structural virality is operationalized as the mean distance between all pairs of adopters (the Wiener index). However, as initial adopters are known in our models, structural virality is defined as the average distance between each adopter and the nearest seed node. Unlike M1 and M2, path lengths in a network vary in a smaller range (e.g., prior work has found that paths are usually between 3 and 12 hops long (Leskovec und Horvitz, 2008)). Therefore, comparisons between the structural virality of each simulated and empirical cascade are made using relative error with respect to the empirical cascade |M3simM3emp|M3emp𝑀subscript3𝑠𝑖𝑚𝑀subscript3𝑒𝑚𝑝𝑀subscript3𝑒𝑚𝑝\frac{|M3_{sim}-M3_{emp}|}{M3_{emp}}divide start_ARG | italic_M 3 start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT - italic_M 3 start_POSTSUBSCRIPT italic_e italic_m italic_p end_POSTSUBSCRIPT | end_ARG start_ARG italic_M 3 start_POSTSUBSCRIPT italic_e italic_m italic_p end_POSTSUBSCRIPT end_ARG.

4.2 Growth

In order to understand how hashtags become viral, many studies look not just at the popularity of a hashtag but also how its adoption shifts over time (Chang, 2010; Sarkar u. a., 2017). There are a number of commonly studied properties of cascades that measure how they grow.

M4: Shape of Adoption Curve

The shape of a hashtag’s adoption curve (or the number of uses over time) is indicative of different mechanisms that may promote or inhibit a cascade’s growth (Pemberton, 1936; Sarkar u. a., 2017). M4 is modeled by splitting both the simulated and empirical time series into T𝑇Titalic_T evenly-spaced intervals, where T𝑇Titalic_T is the smaller of a) the number of timesteps in the simulation and b) the number of hours in the empirical cascade. To make the empirical curve comparable to the simulated curve, we first truncate the adoption curve’s right tail once adoption levels remain low for a sustained period of time, to match the simulation’s stopping criteria. We compare the empirical and simulated curves using the dynamic time warping (DTW) distance between them.

M5: Usage per Adopter

Hashtags where users tend to use the hashtag more often have different growth patterns than hashtags where each adopter uses the hashtag fewer times (Ellis, 2002). M5 calculates the average number of times each adopter used the hashtag. Comparisons between simulated and empirical cascades are made with the relative error.

M6: Edge Density

The structure of the adopter subgraph of the network often reflects how a cascade grows and spreads through the network (Aiello u. a., 2012). In particular, the connectivity or edge density within M6 is operationalized as the number of edges, or edge density, within the adopter subgraph.111Another commonly studied property of the adopter subgraph is the number of connected components. We chose not to use the number of connected components because the corresponding error was reasonably correlated with edge density, so they didn’t seem like sufficiently different measures; additionally, unlike edge density, the connected components often change dramatically after downsampling. Since edges in the adopter subgraph can be very sparse or very dense and these scenarios change the number of edges by several orders of magnitude, the empirical and simulated edge densities are compared using the log-ratio error.

M7: Growth Predictivity

In many cases, it is useful to be able to predict how big a cascade will become based on a small set of initial adopters (Kupavskii u. a., 2012; Cheng u. a., 2014; Li u. a., 2017). In order to test how well each model achieves this task, we attempt to predict the size of each empirical cascade based on the characteristics of the first 100 adopters in each simulation using a multi-layer perceptron regression with 100 hidden layers, an Adam optimizer, and ReLU activation. Predictors include a set of 711 attributes from Cheng u. a. (2014) that are not directly used by our models: the timestep at which each of the first 100 adopters used the hashtag; the degree of each adopter in the full network and adopter subgraph (note that the identity-only model preserves degrees of each agent); and the age and gender of each adopter, inferred using Wang u. a. (2019)’s demographic inference algorithm, etc. Simulated and empirical cascades are compared using the relative error of the predicted cascade size.

4.3 Adopters

In addition to modeling popularity and growth, there has been significant research on how certain subpopulations come to adopt new culture (Anderson u. a., 2015; Cheng u. a., 2016). We identify a set of three measures of how well a simulated cascade reproduces the composition of adopters.

M8: Demographic Similarity

New culture is often adopted in demographically (e.g., racially, socioeconomically, linguistically) homogenous groups. This may occur when the cultural artifact is explicitly signaling an affiliation with the demographic identity (e.g., #strugglesofbeingblack) or when the artifact does not explicitly acknowledge an identity but ends up being used more in one group than another by convention (e.g., Democrats use more swear words online (Sylwester und Purver, 2015)) (Eckert, 2008; Sylwester und Purver, 2015; Abitbol u. a., 2018; Stewart u. a., 2018). We compare the distribution of demographic attributes from Section 3.2.1 in adopters from empirical and simulated cascades. Since there are many demographic attributes, we construct a one-dimensional measure of these attributes using a propensity score. This propensity score is the predicted probability obtained by regressing the demographic attributes on a binary variable indicating whether the user is from the simulated or empirical cascade. This propensity score has two important properties: 1) users that are adopters in both cascades will not factor into the construction of the propensity score since they are represented as both 1’s and 0’s in the logistic regression; and 2) if the empirical and simulated adopters have similar demographic distributions, the propensity scores of adopters in the empirical cascade will have a similar distribution as the propensity scores of adopters in the simulated cascade (Rosenbaum und Rubin, 1983). The differences in demographics between simulated and empirical cascades is measured using the Kullback–Leibler (KL) divergence of the distribution of the empirical adopters’ and simulated adopters’ propensity scores.

M9: Geographic Similarity

Another property of interest is whether a model can reproduce where adopters of a hashtag are located in U.S.A. (Eisenstein u. a., 2012; Huang u. a., 2016). The location of adopters is modeled as a smoothed county-level distribution of the fraction of users in the county who adopted the hashtag. Geographic similarity is measured as the Lee’s L𝐿Litalic_L spatial correlation (Lee, 2001) between the spatial distributions of empirical and simulated usage (cf. Ananthasubramaniam u. a., 2024).

M10: Network Property Similarity

Another property of cascades is the position of adopters within the network (e.g., the communities they belong to, their centrality) (Watts, 2002; Jalili und Perc, 2017). We calculate each user’s position in the network along four relatively low-correlated (Pearson’s R<0.5𝑅0.5R<0.5italic_R < 0.5) network properties, including PageRank, eigencentrality, transitivity, and community membership (using the Louvain community detection algorithm (Blondel u. a., 2008)). Similar to M8, we represent the adopters’ network positions using a propensity score, and compare the distribution of empirical adopters’ and simulated adopters’ propensity scores using KL divergence.

4.4 Composite Metric

In order to evaluate how well each model reproduces properties M1 through M10, we construct a composite Cascade Match Index (cmi) encompassing all ten metrics. The cmi is defined as the normalized similarity between simulated and empirical cascades, averaged over all ten metrics. See Section B for details. The ten measures comprising the cmi are overall poorly correlated with each other (Figure S1), suggesting that M1-M10 do, in fact, measure distinct properties of the cascade and are not redundant.

5.   Network and Identity Model Different Attributes of a Cascade

Now we use the methods from the prior sections to test our main hypothesis: That network and identity together better predict properties of cascades compared to network or identity alone.

5.1 Experimental Setup

To test our hypothesis, we simulate hashtag cascades using the Network+Identity, Network-only, and Identity-only models, and determine which one best reproduces properties of the empirical cascades. For each of the 1,337 hashtags and three models, we 1) seed the model at the hashtag’s initial adopters, 2) fit the stickiness parameter, 3) run five simulations at this parameter, and 4) compare properties of the simulated and empirical cascades. Then we construct the cmi and compare values across the three models.

5.2 Results

Refer to caption
(a) Network+Identity vs. Network-only vs. Identity-only
Refer to caption
(b) Performance of Customized Models
Figure 1: a) The Network+Identity model outperforms the Network-only and Identity-only baselines (this color scheme is used throughout the paper) and b) A customized model that selects among the three does even better. Models evaluated on the full cmi and just the subset of indices corresponding to popularity, growth, and adopter characteristics. Higher cmi scores corresponds to better performance.

Figure 1(a) shows that the Network+Identity model outperforms the Network-only and Identity-only counterfactuals on the composite cmi—suggesting that, on the whole, hashtag cascades are best modeled using a combination of network and identity.

However, our results also suggest that, while models involving both network and identity are most performant overall, there is important heterogeneity in what social factors are required to reproduce different properties of hashtag cascades. Thus, while incorporating network structure and identity into the model leads to the highest overall performance, the network-only or identity-only model may be a better choice for some features of interest.

As shown in Table S1, the Network+Identity model had the top performance on a larger number of individual metrics (5 of 10) than the Network-only (2) or Identity-only (3) models. Notably, however, the Network+Identity model did not have the top performance on all metrics. Overall, the Network-only model tended to perform best on popularity-related metrics; it had the highest score on M2 and M3, as well as a higher score on a composite index of the three growth-related measures (Figure 1). On the other hand, the Identity-only model tends to perform better on adopter-related metrics, while growth-related metrics were best modeled by a combination of both factors. Moreover, on the whole, the Network+Identity model had the highest score on the cmi in 42% (2,791) of trials, while the Network-only model had the highest score in 30% (1,992) and the Identity-only model in 28% (1,902) of trials.

A possible explanation for the heterogeneity in performance is that different mechanisms are responsible for different properties of cascades. For instance, when we select the model that has the highest score on the cmi for each trial (we’ll call this the optimal customized model), the average score on the cmi improves from 0.06 in the Network+Identity model to 0.27 with the optimal customized model (this is a jump in 0.21 points, in contrast to a jump in 0.11 points between the Network+Identity and Identity-only model) (Figure 1(b), pink vs. dark blue bars); moreover, the performance on popularity, growth, and adopter characteristics each improves in this optimal customized model as well. This suggests that the mechanisms underlying the diffusion of hashtags are likely heterogeneous: most hashtags are best modeled by a combination of network and identity, but some are better modeled by network alone or identity alone. Identifying which of these three mechanisms best applies to the hashtag can lead to significant predictive gains. Since our goal is to produce a unified model that simultaneously reproduces all properties of cascades, one option is to identify conditions under which network and identity are needed—that is, to create a predicted customized model that uses features of the hashtag and early adopters to decide whether to use network or identity or both, instead of an optimal customized model where the model selection is performed post-hoc. We explore this idea in the next section.

6.   Roles of Network and Identity in Different Contexts

The diffusion of hashtags specifically, as well as the process of cultural production more generally, varies across contexts. For instance, hashtags with demographically homogenous initial adopters are more likely to be used to signal identity (Agha, 2005; Smith und Smith, 2012). Additionally, hashtags have different patterns of diffusion depending on their topic or semantic context (Romero u. a., 2011; Lehmann u. a., 2012). The goals of this section are to understand whether information about the hashtag and its initial adopters 1) are associated with model performance and 2) can be used to develop a predicted customized model.

6.1 Experimental Setup

In order to understand the relationship between the context in which each hashtag was coined and the role of network and identity, we run a linear regression to test the association between the cmi and several properties of the hashtag. As shown in Equation 2, we estimate the effect of each covariate cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on the cmi of each model (e.g., β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT estimates the effect of the first covariate on cmi in the Network+Identity model; β1+β1Nsubscript𝛽1superscriptsubscript𝛽1𝑁\beta_{1}+\beta_{1}^{N}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT estimates the effect in the Network-only model). Our regression estimates the effect of each property after controlling for all other properties (e.g., the effect of racial similarity in initial adopters is independent of the effect of their geographic proximity, even though these two factors are correlated).

CMIβ0+iβici+iβiIci𝟏Idonly+iβiNci𝟏Netonlysimilar-to𝐶𝑀𝐼subscript𝛽0subscript𝑖subscript𝛽𝑖subscript𝑐𝑖subscript𝑖superscriptsubscript𝛽𝑖𝐼subscript𝑐𝑖subscript1𝐼𝑑𝑜𝑛𝑙𝑦subscript𝑖superscriptsubscript𝛽𝑖𝑁subscript𝑐𝑖subscript1𝑁𝑒𝑡𝑜𝑛𝑙𝑦CMI\sim\beta_{0}+\sum_{i}\beta_{i}c_{i}+\sum_{i}\beta_{i}^{I}c_{i}*\mathbf{1}_% {Id-only}+\sum_{i}\beta_{i}^{N}c_{i}*\mathbf{1}_{Net-only}italic_C italic_M italic_I ∼ italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ bold_1 start_POSTSUBSCRIPT italic_I italic_d - italic_o italic_n italic_l italic_y end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∗ bold_1 start_POSTSUBSCRIPT italic_N italic_e italic_t - italic_o italic_n italic_l italic_y end_POSTSUBSCRIPT (2)

Covariates are four sets of properties of the hashtag’s context (the distribution of each property is in Figure S1):


The topic of a hashtag (e.g., whether it is related to sports, pop culture, or some other subject matter) may be associated with the extent to which network and identity play a role in its diffusion. For instance, prior work has shown that hashtags related to different topics may diffuse at different scales and via different mechanisms Romero u. a. (2011). Therefore, we include each hashtag’s topic, measured using the model from Antypas u. a. (2022), as a covariate in Equation 2. Appendix C has more details.

Refer to caption
Figure 2: Although the Network+Identity model never underperforms the others, the performance of all three models and the relative advantage of the Network+Identity model varies by the topic of the hashtag. Effects are estimated by running a regression, controlling for other variables related to the hashtag’s context.
Communicative Need.

Properties of hashtag cascades may also be attributable to differences in the hashtags’ semantic roles Lehmann u. a. (2012). For instance, hashtags that are in higher demand (e.g., because they belong to a fast-growing subset of the semantic space) or lower supply (e.g., because there are fewer alternatives to choose among) may have higher levels of communicative need and, therefore, different social factors may be responsible for their spread (Stewart und Eisenstein, 2018; Karjus u. a., 2020). Ryskina u. a. (2020) quantified communicative need using two measures: 1) semantic sparsity, or how many similar hashtags exist in the lexicon when the focal hashtag was introduced (a hashtag in a sparse space may be in higher demand since there are fewer hashtags that can serve the same function); and 2) semantic growth, or the growth in the semantic space over time (a hashtag in a high-growth space may be in higher demand since it serves a purpose of increasing popularity). For instance, a hashtag like #broncosnation (signifying support for the city of Denver’s local football team) has low semantic sparsity, because many cities had similar sports hashtags when it was coined; it also has low semantic growth because, while sports team hashtags are popular, the use of these sorts of hashtags has remained fairly stable over time. Appendix C has details on how these measures are operationalized.


As described in Section 3.3.4, each hashtag’s identity is based on the demographics of the first ten adopters. Since the identities of early adopters may influence the perception of the hashtag (Agha, 2005), and since having more homogenous initial adopters may lead to stronger perceptions, covariates include the mean similarity of initial adopters within each component of identity (location, race, socioeconomic status, languages spoken, political affiliation).

Initial Network Position.

Another factor in a hashtag’s diffusion is where in the network the hashtag is introduced (Watts, 2002; Jalili und Perc, 2017). For instance, more central initial adopters or those belonging to larger communities in the network may be able to spread the hashtag more broadly because of their influence. Therefore, covariates include the median initial adopter’s PageRank, eigencentrality, and how many initial adopters fall into each of the network’s communities.

Refer to caption
(a) Location
Refer to caption
(b) Race
Refer to caption
(c) Languages Spoken
Refer to caption
(d) Socioeconomic Status
Refer to caption
(e) Political Affiliation
Figure 3: The comparative advantage of modeling cascades using both network and identity is highest when a) initial adopters are located very close to each other; b) initial adopters have a high degree of racial similarity; and c-e) initial adopters have a moderate degree of linguistic, socioeconomic, and political similarity. Effects are estimated by running a regression, controlling for other variables related to the hashtag’s context.

6.2 Results

Figures 25 show the results of the regression model (Equation 2); the y𝑦yitalic_y-axes plot the predicted cmi for each model, corresponding to different levels of each covariate, conditional on all other covariates. In general, the Network+Identity model performs as well as or better than the other models under all conditions. This suggests that the conclusions from Section 5—that network and identity better predict cascades together than separately—are robust. In addition, there are three key takeaways about model performance.

First, the Network+Identity model tends to outperform the other models in cases where there is a theoretical expectation that network and identity would each contribute a portion of the underlying diffusion mechanism. For instance, when initial adopters have a high level of racial similarity, the Network+Identity model’s performance improves while other models get worse; this is consistent with the theoretical framework of Sharma (2013), where hashtags used to signal racial identity on Black Twitter diffuse via a mechanism that combines network and identity. Similarly, regional hashtags may require network and identity to constrain adopters to the local area (Schwartz und Halegoua, 2015); consistent with this expectation, the Network+Identity model has its highest performance among hashtags that promote regional culture, including sports hashtags (which often express support for local teams), news hashtags (which are often related to regional events), and hashtags whose initial adopters are located near each other. The Network+Identity model also has the best performance on geographic distribution of adoption, suggesting a connection between this model and the ability to predict geographic localization (Labov, 2007; Ananthasubramaniam u. a., 2024). Similarly, hashtags related to certain topics—sports, film/TV/video, diaries/daily life, and news/social concern—tend to be better modeled by the Network+Identity simulations, and prior work has shown that network and identity contribute to their growth. Accordingly, identity also mediates the spread of sports hashtags on the Twitter network, so only fans of a specific team adopt the hashtag but the hashtag can still be seen by supporters of rival teams (Smith und Smith, 2012). The other types of hashtags are often used in conversations that involve stance-taking and, in the process, identity signaling (e.g., sharing their opinion on issues of social concern, their favorite TV show, and aspects of daily life) (Evans, 2016).

Second, the Network+Identity model may outperform the Network-only and Identity-only models because hashtags that diffuse via two mechanisms are more likely to become popular than hashtags diffusing via just one (Han u. a., 2020; Hoang und Lim, 2012). For instance, the Network+Identity model outperforms baselines among hashtags in very slow- or very fast-growing areas, but not among hashtags with moderate growth (Figure 4). Similarly, the model has its highest comparative advantage when initial adopters are moderately central. In cases of extreme growth or moderate initial adopter centrality, hashtags that diffuse via multiple mechanisms (network and identity) may be overrepresented in our sample of popular hashtags. These hashtags are also likely to pertain to a smaller

Third, the Network+Identity model often has its strongest comparative advantage when the Network-only and Identity-only models perform well. For instance, all three models perform well when the hashtag is related to topics like sports, film, pop culture, and daily life; or in moderate ranges of covariates like linguistic, socioeconomic, and political identity, centrality, and semantic growth. This suggests that, even when single-variable models have relatively high predictive power, combining multiple social factors can improve performance.

Refer to caption
(a) Semantic Sparsity
Refer to caption
(b) Semantic Growth
Figure 4: Hashtags a) that convey a similar meaning as a moderate number of other hashtags and b) whose meaning is not becoming increasingly popular over time tend to have the highest comparative advantage from the Network + Identity model. Effects are estimated by running a regression, controlling for other variables related to the hashtag’s context.
Refer to caption
(a) PageRank
Refer to caption
(b) Eigencentrality
Figure 5: The Network+Identity model has its highest comparative advantage when the hashtag’s initial adopters have moderate to high a) PageRank and b) eigencentrality. Effects are estimated by running a regression, controlling for other variables related to the hashtag’s context.

6.3 Selecting Among Models

Since there are associations between the characteristics of the hashtags and the relative performance of the three models, we develop a predicted customized model that uses these characteristics to determine whether network alone, identity alone, or both together would perform best on the cmi. Using the features described in Section 6.1, we trained a random forest classifier to predict whether each hashtag would be best predicted by the Network+Identity, Network-only, or Identity-only model. Predictions were obtained using a repeated 5-fold cross-validation (the model was trained on sets 2-5 and predictions generated for set 1; then trained on sets 1 and 3-5 and predictions generated for set 2; and so on).

The random forest classifier weakly outperforms a baseline that always selects the Network+Identity model (0.44 vs. 0.41 accuracy); however, in spite of this, the predicted customized model significantly outperforms the Network+Identity model on the cmi (Figure 1(b), light blue bars), suggesting that the classifier may be picking out examples of hashtags that are “obviously” or “easily” identifiable as being better-modeled by network or identity alone and where the single-variable models are associated with significant predictive improvements over the Network+Identity model. This predicted customized model achieves its gain in performance by better reproducing properties related to popularity (where it equals the Network-only model’s performance) and adopter characteristics, and trading off slightly lower performance on the growth-related measures (Figure 1(b), comparing light and dark blue bars). These results suggest that the initial characteristics of cascades can, in some cases, signal the driving mechanism behind the hashtag’s diffusion and therefore the best model to estimate the cascade.

7.   Discussion

Our work suggests that modeling cultural production and the adoption of cultural innovation requires explicitly incorporating the role of multiple social factors in the process of diffusion. This study examines the role of Network and Identity in the diffusion of novel hashtags on Twitter. In order to test the roles of network and identity in diffusion, we evaluate whether a model containing network and identity better reproduces properties of each hashtag’s cascade than models containing just network or just identity—and whether this holds across different types of hashtags. The results support our hypothesis from three standpoints. First, the model with both identities and network better reproduces an aggregate of cascade properties than models with identity or network alone. Second, many individual properties are also better modeled with network and identity together. Third, these findings are true across many different types of hashtags (different topics, identities, etc.). These findings are significant because most existing work has focused on the effects of single factors (e.g., network or identity) rather than creating a model that combines multiple social factors to explain the diffusion of behaviors and culture. Our work suggests that there is value in adding extra complexity by multiple interacting factors.

However, our analysis also reveals that there is important heterogeneity in the roles network and identity play in cultural production. For instance, network structure does a worse job modeling the adopter composition of cascades, while identity underperforms at modeling a cascade’s popularity. Additionally, there are several contexts where the network and identity likely offer non-duplicative conditions for diffusion or jointly confer some selective advantage to new hashtags. Under these conditions, it is especially important for models of cascades to combine both factors.

Finally, our analysis has two limitations that can be addressed by future work: First, our model only considered network and identity, but did not integrate other social factors known to influence the spread of innovation (e.g., the type of relationships between users or the perception or planned use of a hashtag). This limitation could be responsible for some heterogeneity in performance (perhaps factors other than network and identity are required to model hashtag cascades and are particularly important for reproducing certain properties or in the extremes of some hashtag characteristics’ parameter space). However, such factors are difficult to model at scale and, thus, were outside the scope of the paper.

Our Network+Identity model always used both network and identity rather than selecting which features would work best for each hashtag. Our work was a first step towards developing such a customized model. However, future work could likely improve upon this initial model. In order to facilitate future work, we release a database of the 1,337 hashtags included in this study, which were coined between 2013 and 2022, used frequently, and likely to represent novel cultural production; using a 10% sample of Twitter, we develop a database of each hashtag’s adoption and a rich set of features like the hashtag’s topic, embedding, communicative need, and the identities of adopters. We also release a composite cmi that evaluates the performance of a model on its ability to reproduce ten frequently-studied properties of cascades, including those related to popularity, growth, and adopter composition. Based on a comprehensive literature review, we identify ten frequently-modeled properties of cascades related to their popularity (e.g., cascade size), growth (e.g., shape of the growth curve), and adopter composition (e.g., demographic similarity) and construct a composite cmi that compares empirical and simulated cascades across all ten properties.


Network+Identity Network-only Identity-only
M1. Level of Usage 3.10 / 0.09 3.21 / 0.01 3.37 / -0.10
M2. # Adopters 1.30 / -0.02 1.09 / 0.15 1.42 / -0.13
M3. Structural Virality 0.15 / 0.05 0.14 / 0.07 0.19 / -0.12
M4. Shape of Adoption Curve 0.12 / 0.05 0.13 / -0.16 0.11 / 0.12
M5. # Uses per Adopter 2.18 / 0.14 2.49 / -0.11 2.39 / -0.02
M6. Adopter Connectedness 1.83 / 0.29 2.08 / 0.11 2.80 / -0.40
M7. Growth Predictivity 1.83 / 0.12 2.08 / 0.01 2.80 / -0.04
M8. Demographic Difference 0.98 / -0.31 0.62 / 0.10 0.52 / 0.22
M9. Geographic Difference 0.09 / 0.32 0.03 / -0.12 0.03 / -0.16
M10. Network Difference 0.31 / 0.04 0.33 / 0.16 0.26 / -0.20
Table S1: The performance of each model on each metric in our cmi. This includes the raw comparison (e.g., log-ratio error, relative error, similarity score) and normalized comparison (z-score) for each measure.
Refer to caption
(a) Network+Identity
Refer to caption
(b) Network-only
Refer to caption
(c) Identity-only
Figure S1: Correlations between cascade evaluation measure M1–M10 are relatively low, suggesting that these capture distinct properties that can be effectively combined by the cmi.

Appendix A Annotation Prompt

Would the coining of this hashtag be an example of cultural production (Yes/No)? In this case, cultural production is the process of creating and disseminating new, innovative culture. While “culture” is a broad term, our definition excludes hashtags that make reference to entities by their official name (e.g., a person by their full name or stage name, a location, a song title), common phrases, and single words, since those hashtags do not seem innovative. However, the following types of hashtags can and should be considered examples of cultural production, because their existence requires innovative choices and combinations of words: nicknames or fan-created names for entities, slogans, combinations of dictionary words, slogans, and acronyms. Examples of hashtags to say ‘Yes’ to: #goravens, #rio2016, #votefreddie, #blacklivesmatter, #myboyfriendnotallowedto, #incomingfreshmenadvice

Appendix B Constructing the Cascade Match Index

Since M1-M7 are compared using a measure of distance or error (i.e., closer to 0 is better) and M8-M10 are compared using similarity scores, we convert M1 - M7 from difference scores into similarity scores by taking their additive inverse. This means that higher values of the cmi correspond to better fit between empirical and simulated cascades. Additionally, since each measure is on a different scale, we standardize all similarities using a z-score; to facilitate cross-model comparisons, z-scores are calculated across all three models (Network+Identity, Network-only, Identity-only) rather than within each model. Finally, since model parameters are calibrated to the cascade size, and since empirical cascades (which came from the Twitter Decahose) are expected to be 10% the size of simulated cascades, we downsample the larger cascade to match the size of the smaller one for properties M2 - M10 (e.g., if the simulated cascade ends up being 10 times bigger than the empirical cascade, we randomly sample 10% of the simulated cascade and compare that downsampled cascade to the empirical cascade). This downsampling ensures that the comparison between the empirical and simulated cascade is independent of size—e.g., that certain models do not better match properties because they were easier to calibrate to the correct cascade size.

Appendix C Hashtag Characteristics

C.1 Topic

We define a hashtag’s topic as the most frequent topic of the tweets it appears in, where tweet topics are inferred using Antypas u. a. (2022)’s supervised multi-label topic classifier. From the original set of 23 topics, we combine categories containing fewer than 50 hashtags into other categories that they most frequently co-occur with (e.g., Learning & Educational with Youth & Student Life), and end up with seven categories: diaries and daily life (379 hashtags, e.g., #relationshipwontworkif, #learnlife, #birthdaybehavior), sports (269 hashtags, e.g., #seahawksnation, #throwupthex, #dunkcity), celebrity and pop culture (213 hashtags, e.g., #freesosa, #beyoncebowl, #kikifollowspree), film/TV/video (154 hashtags, e.g., #iveseeneveryepisodeof, #betterbatmanthanbenaffleck, #doctorwho50th), news and social concern (130 hashtags, e.g., #impeachmentday, #getcovered, #saysomethingliberalin4words), music (103 hashtags, e.g., #lyricsthatmeanalottome, #nameanamazingband, #flawlessremix), and other hobbies (89 hashtags, e.g., #camsbookclub, #amazoncart, #polyvorestyle).

C.2 Semantic Sparsity and Growth

Semantic sparsity and growth are measured as follows: Each hashtag’s 250-dimensional embedding is constructed by training the word2vec algorithm over a window of 5 tokens and 800 epochs; in order to ensure that the hashtags in our study have high enough token frequency to be included in the final model, word2vec was trained on all tweets containing the 1,337 hashtags in our sample and a random sample of 20 million other tweets containing hashtags in our Twitter Decahose sample. Using the resulting word embeddings, semantic sparsity is the number of hashtags that were used in similar contexts at the time when the hashtag was coined (similarity means the cosine similarity of the embeddings is at least 0.3,222The threshold of 0.3 is slightly lower than the threshold of 0.35 used in the original paper, so that more hashtags have neighbors. representing the supply of similar hashtags) and the semantic growth is the Spearman rank correlation between the frequency of all tokens that are similar to the hashtag and the month (where a correlation of 1 means that words that are similar to the hashtag are becoming more popular over time, and 0 means the hashtag is used in contexts of static popularity).

Refer to caption
(a) Topic
Refer to caption
(b) Eigencentrality
Refer to caption
(c) PageRank
Refer to caption
(d) Semantic Popularity
Refer to caption
(e) Semantic Growth
Refer to caption
(f) Initial Adopter Identity
Figure S1: The distributions of all hashtag characteristics over the 1,337 hashtags.