-
Shaping History: Advanced Machine Learning Techniques for the Analysis and Dating of Cuneiform Tablets over Three Millennia
Authors:
Danielle Kapon,
Michael Fire,
Shai Gordin
Abstract:
Cuneiform tablets, emerging in ancient Mesopotamia around the late fourth millennium BCE, represent one of humanity's earliest writing systems. Characterized by wedge-shaped marks on clay tablets, these artifacts provided insight into Mesopotamian civilization across various domains. Traditionally, the analysis and dating of these tablets rely on subjective assessment of shape and writing style, l…
▽ More
Cuneiform tablets, emerging in ancient Mesopotamia around the late fourth millennium BCE, represent one of humanity's earliest writing systems. Characterized by wedge-shaped marks on clay tablets, these artifacts provided insight into Mesopotamian civilization across various domains. Traditionally, the analysis and dating of these tablets rely on subjective assessment of shape and writing style, leading to uncertainties in pinpointing their exact temporal origins. Recent advances in digitization have revolutionized the study of cuneiform by enhancing accessibility and analytical capabilities. Our research uniquely focuses on the silhouette of tablets as significant indicators of their historical periods, diverging from most studies that concentrate on textual content. Utilizing an unprecedented dataset of over 94,000 images from the Cuneiform Digital Library Initiative collection, we apply deep learning methods to classify cuneiform tablets, covering over 3,000 years of history. By leveraging statistical, computational techniques, and generative modeling through Variational Auto-Encoders (VAEs), we achieve substantial advancements in the automatic classification of these ancient documents, focusing on the tablets' silhouettes as key predictors. Our classification approach begins with a Decision Tree using height-to-width ratios and culminates with a ResNet50 model, achieving a 61% macro F1-score for tablet silhouettes. Moreover, we introduce novel VAE-powered tools to enhance explainability and enable researchers to explore changes in tablet shapes across different eras and genres. This research contributes to document analysis and diplomatics by demonstrating the value of large-scale data analysis combined with statistical methods. These insights offer valuable tools for historians and epigraphists, enriching our understanding of cuneiform tablets and the cultures that produced them.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
A Novel Method for News Article Event-Based Embedding
Authors:
Koren Ishlach,
Itzhak Ben-David,
Michael Fire,
Lior Rokach
Abstract:
Embedding news articles is a crucial tool for multiple fields, such as media bias detection, identifying fake news, and making news recommendations. However, existing news embedding methods are not optimized to capture the latent context of news events. Most embedding methods rely on full-text information and neglect time-relevant embedding generation. In this paper, we propose a novel lightweight…
▽ More
Embedding news articles is a crucial tool for multiple fields, such as media bias detection, identifying fake news, and making news recommendations. However, existing news embedding methods are not optimized to capture the latent context of news events. Most embedding methods rely on full-text information and neglect time-relevant embedding generation. In this paper, we propose a novel lightweight method that optimizes news embedding generation by focusing on entities and themes mentioned in articles and their historical connections to specific events. We suggest a method composed of three stages. First, we process and extract events, entities, and themes from the given news articles. Second, we generate periodic time embeddings for themes and entities by training time-separated GloVe models on current and historical data. Lastly, we concatenate the news embeddings generated by two distinct approaches: Smooth Inverse Frequency (SIF) for article-level vectors and Siamese Neural Networks for embeddings with nuanced event-related information. We leveraged over 850,000 news articles and 1,000,000 events from the GDELT project to test and evaluate our method. We conducted a comparative analysis of different news embedding generation methods for validation. Our experiments demonstrate that our approach can both improve and outperform state-of-the-art methods on shared event detection tasks.
△ Less
Submitted 2 August, 2024; v1 submitted 20 May, 2024;
originally announced May 2024.
-
Analyzing Key Users' behavior trends in Volunteer-Based Networks
Authors:
Nofar Piterman,
Tamar Makov,
Michael Fire
Abstract:
Online social networks usage has increased significantly in the last decade and continues to grow in popularity. Multiple social platforms use volunteers as a central component. The behavior of volunteers in volunteer-based networks has been studied extensively in recent years. Here, we explore the development of volunteer-based social networks, primarily focusing on their key users' behaviors and…
▽ More
Online social networks usage has increased significantly in the last decade and continues to grow in popularity. Multiple social platforms use volunteers as a central component. The behavior of volunteers in volunteer-based networks has been studied extensively in recent years. Here, we explore the development of volunteer-based social networks, primarily focusing on their key users' behaviors and activities. We developed two novel algorithms: the first reveals key user behavior patterns over time; the second utilizes machine learning methods to generate a forecasting model that can predict the future behavior of key users, including whether they will remain active donors or change their behavior to become mainly recipients, and vice-versa. These algorithms allowed us to analyze the factors that significantly influence behavior predictions.
To evaluate our algorithms, we utilized data from over 2.4 million users on a peer-to-peer food-sharing online platform. Using our algorithm, we identified four main types of key user behavior patterns that occur over time. Moreover, we succeeded in forecasting future active donor key users and predicting the key users that would change their behavior to donors, with an accuracy of up to 89.6%. These findings provide valuable insights into the behavior of key users in volunteer-based social networks and pave the way for more effective communities-building in the future, while using the potential of machine learning for this goal.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
Short Run Transit Route Planning Decision Support System Using a Deep Learning-Based Weighted Graph
Authors:
Nadav Shalit,
Michael Fire,
Dima Kagan,
Eran Ben-Elia
Abstract:
Public transport routing plays a crucial role in transit network design, ensuring a satisfactory level of service for passengers. However, current routing solutions rely on traditional operational research heuristics, which can be time-consuming to implement and lack the ability to provide quick solutions. Here, we propose a novel deep learning-based methodology for a decision support system that…
▽ More
Public transport routing plays a crucial role in transit network design, ensuring a satisfactory level of service for passengers. However, current routing solutions rely on traditional operational research heuristics, which can be time-consuming to implement and lack the ability to provide quick solutions. Here, we propose a novel deep learning-based methodology for a decision support system that enables public transport (PT) planners to identify short-term route improvements rapidly. By seamlessly adjusting specific sections of routes between two stops during specific times of the day, our method effectively reduces times and enhances PT services. Leveraging diverse data sources such as GTFS and smart card data, we extract features and model the transportation network as a directed graph. Using self-supervision, we train a deep learning model for predicting lateness values for road segments.
These lateness values are then utilized as edge weights in the transportation graph, enabling efficient path searching. Through evaluating the method on Tel Aviv, we are able to reduce times on more than 9\% of the routes. The improved routes included both intraurban and suburban routes showcasing a fact highlighting the model's versatility. The findings emphasize the potential of our data-driven decision support system to enhance public transport and city logistics, promoting greater efficiency and reliability in PT services.
△ Less
Submitted 24 August, 2023;
originally announced August 2023.
-
Interruptions detection in video conferences
Authors:
Shmuel Horowitz,
Dima Kagan,
Galit Fuhrmann Alpert,
Michael Fire
Abstract:
In recent years, video conferencing (VC) popularity has skyrocketed for a wide range of activities. As a result, the number of VC users surged sharply. The sharp increase in VC usage has been accompanied by various newly emerging privacy and security challenges. VC meetings became a target for various security attacks, such as Zoombombing. Other VC-related challenges also emerged. For example, dur…
▽ More
In recent years, video conferencing (VC) popularity has skyrocketed for a wide range of activities. As a result, the number of VC users surged sharply. The sharp increase in VC usage has been accompanied by various newly emerging privacy and security challenges. VC meetings became a target for various security attacks, such as Zoombombing. Other VC-related challenges also emerged. For example, during COVID lockdowns, educators had to teach in online environments struggling with keeping students engaged for extended periods. In parallel, the amount of available VC videos has grown exponentially. Thus, users and companies are limited in finding abnormal segments in VC meetings within the converging volumes of data. Such abnormal events that affect most meeting participants may be indicators of interesting points in time, including security attacks or other changes in meeting climate, like someone joining a meeting or sharing a dramatic content. Here, we present a novel algorithm for detecting abnormal events in VC data. We curated VC publicly available recordings, including meetings with interruptions. We analyzed the videos using our algorithm, extracting time windows where abnormal occurrences were detected. Our algorithm is a pipeline that combines multiple methods in several steps to detect users' faces in each video frame, track face locations during the meeting and generate vector representations of a facial expression for each face in each frame. Vector representations are used to monitor changes in facial expressions throughout the meeting for each participant. The overall change in meeting climate is quantified using those parameters across all participants, and translating them into event anomaly detection. This is the first open pipeline for automatically detecting anomaly events in VC meetings. Our model detects abnormal events with 92.3% precision over the collected dataset.
△ Less
Submitted 25 February, 2023;
originally announced March 2023.
-
Open Framework for Analyzing Public Parliaments Data
Authors:
Shai Berkovitz,
Amit Mazuz,
Michael Fire
Abstract:
Open information of government organizations is a subject that should interest all citizens who care about the functionality of their governments. Large-scale open governmental data open the door to new opportunities for citizens and researchers to monitor their government's activities and to improve its transparency. Over the years, various projects and systems have been processing and analyzing…
▽ More
Open information of government organizations is a subject that should interest all citizens who care about the functionality of their governments. Large-scale open governmental data open the door to new opportunities for citizens and researchers to monitor their government's activities and to improve its transparency. Over the years, various projects and systems have been processing and analyzing governmental data using open government information. Here, we present the Collecting and Analyzing Parliament Data (CAPD) framework. This novel generic open framework enables the collection and analysis of large-scale public governmental data from multiple sources. We used the framework to collect over 64,000 parliaments' protocols from over 90 committees from three countries. Then, we parsed the collected data and calculated structured features from it. Next, using the calculated features, we utilized anomaly detection and time series analysis to uncover various insights into the committees' activities. We demonstrate that the CAPD framework can be used to identify anomalous meetings and detect dates of events that affect the parliaments' functionality, and help to monitor their activities.
△ Less
Submitted 21 May, 2023; v1 submitted 2 October, 2022;
originally announced October 2022.
-
Malicious Source Code Detection Using Transformer
Authors:
Chen Tsfaty,
Michael Fire
Abstract:
Open source code is considered a common practice in modern software development. However, reusing other code allows bad actors to access a wide developers' community, hence the products that rely on it. Those attacks are categorized as supply chain attacks. Recent years saw a growing number of supply chain attacks that leverage open source during software development, relaying the download and ins…
▽ More
Open source code is considered a common practice in modern software development. However, reusing other code allows bad actors to access a wide developers' community, hence the products that rely on it. Those attacks are categorized as supply chain attacks. Recent years saw a growing number of supply chain attacks that leverage open source during software development, relaying the download and installation procedures, whether automatic or manual. Over the years, many approaches have been invented for detecting vulnerable packages. However, it is uncommon to detect malicious code within packages. Those detection approaches can be broadly categorized as analyzes that use (dynamic) and do not use (static) code execution. Here, we introduce Malicious Source code Detection using Transformers (MSDT) algorithm. MSDT is a novel static analysis based on a deep learning method that detects real-world code injection cases to source code packages. In this study, we used MSDT and a dataset with over 600,000 different functions to embed various functions and applied a clustering algorithm to the resulting vectors, detecting the malicious functions by detecting the outliers. We evaluated MSDT's performance by conducting extensive experiments and demonstrated that our algorithm is capable of detecting functions that were injected with malicious code with precision@k values of up to 0.909.
△ Less
Submitted 16 September, 2022;
originally announced September 2022.
-
Ethnic Representation Analysis of Commercial Movie Posters
Authors:
Dima Kagan,
Mor Levy,
Michael Fire,
Galit Fuhrmann Alpert
Abstract:
In the last decades, global awareness towards the importance of diverse representation has been increasing. Lack of diversity and discrimination toward minorities did not skip the film industry. Here, we examine ethnic bias in the film industry through commercial posters, the industry's primary advertisement medium for decades. Movie posters are designed to establish the viewer's initial impressio…
▽ More
In the last decades, global awareness towards the importance of diverse representation has been increasing. Lack of diversity and discrimination toward minorities did not skip the film industry. Here, we examine ethnic bias in the film industry through commercial posters, the industry's primary advertisement medium for decades. Movie posters are designed to establish the viewer's initial impression. We developed a novel approach for evaluating ethnic bias in the film industry by analyzing nearly 125,000 posters using state-of-the-art deep learning models. Our analysis shows that while ethnic biases still exist, there is a trend of reduction of bias, as seen by several parameters. Particularly in English-speaking movies, the ethnic distribution of characters on posters from the last couple of years is reaching numbers that are approaching the actual ethnic composition of US population. An automatic approach to monitor ethnic diversity in the film industry, potentially integrated with financial value, may be of significant use for producers and policymakers.
△ Less
Submitted 17 July, 2022;
originally announced July 2022.
-
Large-Scale Shill Bidder Detection in E-commerce
Authors:
Michael Fire,
Rami Puzis,
Dima Kagan,
Yuval Elovici
Abstract:
User feedback is one of the most effective methods to build and maintain trust in electronic commerce platforms. Unfortunately, dishonest sellers often bend over backward to manipulate users' feedback or place phony bids in order to increase their own sales and harm competitors. The black market of user feedback, supported by a plethora of shill bidders, prospers on top of legitimate electronic co…
▽ More
User feedback is one of the most effective methods to build and maintain trust in electronic commerce platforms. Unfortunately, dishonest sellers often bend over backward to manipulate users' feedback or place phony bids in order to increase their own sales and harm competitors. The black market of user feedback, supported by a plethora of shill bidders, prospers on top of legitimate electronic commerce. In this paper, we investigate the ecosystem of shill bidders based on large-scale data by analyzing hundreds of millions of users who performed billions of transactions, and we propose a machine-learning-based method for identifying communities of users that methodically provide dishonest feedback. Our results show that (1) shill bidders can be identified with high precision based on their transaction and feedback statistics; and (2) in contrast to legitimate buyers and sellers, shill bidders form cliques to support each other.
△ Less
Submitted 21 April, 2022; v1 submitted 5 April, 2022;
originally announced April 2022.
-
Co-Membership-based Generic Anomalous Communities Detection
Authors:
Shay Lapid,
Dima Kagan,
Michael Fire
Abstract:
Nowadays, detecting anomalous communities in networks is an essential task in research, as it helps discover insights into community-structured networks. Most of the existing methods leverage either information regarding attributes of vertices or the topological structure of communities. In this study, we introduce the Co-Membership-based Generic Anomalous Communities Detection Algorithm (referred…
▽ More
Nowadays, detecting anomalous communities in networks is an essential task in research, as it helps discover insights into community-structured networks. Most of the existing methods leverage either information regarding attributes of vertices or the topological structure of communities. In this study, we introduce the Co-Membership-based Generic Anomalous Communities Detection Algorithm (referred as to CMMAC), a novel and generic method that utilizes the information of vertices co-membership in multiple communities. CMMAC is domain-free and almost unaffected by communities' sizes and densities. Specifically, we train a classifier to predict the probability of each vertex in a community being a member of the community. We then rank the communities by the aggregated membership probabilities of each community's vertices. The lowest-ranked communities are considered to be anomalous. Furthermore, we present an algorithm for generating a community-structured random network enabling the infusion of anomalous communities to facilitate research in the field. We utilized it to generate two datasets, composed of thousands of labeled anomaly-infused networks, and published them. We experimented extensively on thousands of simulated, and real-world networks, infused with artificial anomalies. CMMAC outperformed other existing methods in a range of settings. Additionally, we demonstrated that CMMAC can identify abnormal communities in real-world unlabeled networks in different domains, such as Reddit and Wikipedia.
△ Less
Submitted 30 March, 2022;
originally announced March 2022.
-
CompanyName2Vec: Company Entity Matching Based on Job Ads
Authors:
Ran Ziv,
Ilan Gronau,
Michael Fire
Abstract:
Entity Matching is an essential part of all real-world systems that take in structured and unstructured data coming from different sources. Typically no common key is available for connecting records. Massive data cleaning and integration processes require completion before any data analytics, or further processing can be performed. Although record linkage is frequently regarded as a somewhat tedi…
▽ More
Entity Matching is an essential part of all real-world systems that take in structured and unstructured data coming from different sources. Typically no common key is available for connecting records. Massive data cleaning and integration processes require completion before any data analytics, or further processing can be performed. Although record linkage is frequently regarded as a somewhat tedious but necessary step, it reveals valuable insights, supports data visualization, and guides further analytic approaches to the data. Here, we focus on organization entity matching. We introduce CompanyName2Vec, a novel algorithm to solve company entity matching (CEM) using a neural network model to learn company name semantics from a job ad corpus, without relying on any information on the matched company besides its name. Based on a real-world data, we show that CompanyName2Vec outperforms other evaluated methods and solves the CEM challenge with an average success rate of 89.3%.
△ Less
Submitted 12 January, 2022;
originally announced January 2022.
-
Automatic Large Scale Detection of Red Palm Weevil Infestation using Aerial and Street View Images
Authors:
Dima Kagan,
Galit Fuhrmann Alpert,
Michael Fire
Abstract:
The spread of the Red Palm Weevil has dramatically affected date growers, homeowners and governments, forcing them to deal with a constant threat to their palm trees. Early detection of palm tree infestation has been proven to be critical in order to allow treatment that may save trees from irreversible damage, and is most commonly performed by local physical access for individual tree monitoring.…
▽ More
The spread of the Red Palm Weevil has dramatically affected date growers, homeowners and governments, forcing them to deal with a constant threat to their palm trees. Early detection of palm tree infestation has been proven to be critical in order to allow treatment that may save trees from irreversible damage, and is most commonly performed by local physical access for individual tree monitoring. Here, we present a novel method for surveillance of Red Palm Weevil infested palm trees utilizing state-of-the-art deep learning algorithms, with aerial and street-level imagery data. To detect infested palm trees we analyzed over 100,000 aerial and street-images, mapping the location of palm trees in urban areas. Using this procedure, we discovered and verified infested palm trees at various locations.
△ Less
Submitted 9 April, 2021; v1 submitted 6 April, 2021;
originally announced April 2021.
-
Using Data Mining for Infrastructure and Safety Violations Discovery in Cities
Authors:
Doron Laadan,
Eyal Arviv,
Michael Fire
Abstract:
In city planning and maintenance, the abilty to quickly identify infrastructure violations - such as missing or misplaced fire hydrants - can be crucial for maintaining a safe city; it can even save lives. In this work, we aim to provide an analysis of such violations, and to demonstrate the potential of data-driven approaches for quickly locating and addressing them. We conduct an analytical stud…
▽ More
In city planning and maintenance, the abilty to quickly identify infrastructure violations - such as missing or misplaced fire hydrants - can be crucial for maintaining a safe city; it can even save lives. In this work, we aim to provide an analysis of such violations, and to demonstrate the potential of data-driven approaches for quickly locating and addressing them. We conduct an analytical study based upon data from the city of Beer-Sheva's public records of fire hydrants, bomb shelters, and other public facilities. The result of our analysis are presented along with an interactive exploration tool, which allows for easy exploration and identification of different facilities around the city that viloate regulations.
△ Less
Submitted 16 July, 2020;
originally announced July 2020.
-
Zooming Into Video Conferencing Privacy and Security Threats
Authors:
Dima Kagan,
Galit Fuhrmann Alpert,
Michael Fire
Abstract:
The COVID-19 pandemic outbreak, with its related social distancing and shelter-in-place measures, has dramatically affected ways in which people communicate with each other, forcing people to find new ways to collaborate, study, celebrate special occasions, and meet with family and friends. One of the most popular solutions that have emerged is the use of video conferencing applications to replace…
▽ More
The COVID-19 pandemic outbreak, with its related social distancing and shelter-in-place measures, has dramatically affected ways in which people communicate with each other, forcing people to find new ways to collaborate, study, celebrate special occasions, and meet with family and friends. One of the most popular solutions that have emerged is the use of video conferencing applications to replace face-to-face meetings with virtual meetings. This resulted in unprecedented growth in the number of video conferencing users. In this study, we explored privacy issues that may be at risk by attending virtual meetings. We extracted private information from collage images of meeting participants that are publicly posted on the Web. We used image processing, text recognition tools, as well as social network analysis to explore our web crawling curated dataset of over 15,700 collage images, and over 142,000 face images of meeting participants. We demonstrate that video conference users are facing prevalent security and privacy threats. Our results indicate that it is relatively easy to collect thousands of publicly available images of video conference meetings and extract personal information about the participants, including their face images, age, gender, usernames, and sometimes even full names. This type of extracted data can vastly and easily jeopardize people's security and privacy both in the online and real-world, affecting not only adults but also more vulnerable segments of society, such as young children and older adults. Finally, we show that cross-referencing facial image data with social network data may put participants at additional privacy risks they may not be aware of and that it is possible to identify users that appear in several video conference meetings, thus providing a potential to maliciously aggregate different sources of information about a target individual.
△ Less
Submitted 2 July, 2020;
originally announced July 2020.
-
How Does That Sound? Multi-Language SpokenName2Vec Algorithm Using Speech Generation and Deep Learning
Authors:
Aviad Elyashar,
Rami Puzis,
Michael Fire
Abstract:
Searching for information about a specific person is an online activity frequently performed by many users. In most cases, users are aided by queries containing a name and sending back to the web search engines for finding their will. Typically, Web search engines provide just a few accurate results associated with a name-containing query. Currently, most solutions for suggesting synonyms in onlin…
▽ More
Searching for information about a specific person is an online activity frequently performed by many users. In most cases, users are aided by queries containing a name and sending back to the web search engines for finding their will. Typically, Web search engines provide just a few accurate results associated with a name-containing query. Currently, most solutions for suggesting synonyms in online search are based on pattern matching and phonetic encoding, however very often, the performance of such solutions is less than optimal. In this paper, we propose SpokenName2Vec, a novel and generic approach which addresses the similar name suggestion problem by utilizing automated speech generation, and deep learning to produce spoken name embeddings. This sophisticated and innovative embeddings captures the way people pronounce names in any language and accent. Utilizing the name pronunciation can be helpful for both differentiating and detecting names that sound alike, but are written differently. The proposed approach was demonstrated on a large-scale dataset consisting of 250,000 forenames and evaluated using a machine learning classifier and 7,399 names with their verified synonyms. The performance of the proposed approach was found to be superior to 10 other algorithms evaluated in this study, including well used phonetic and string similarity algorithms, and two recently proposed algorithms. The results obtained suggest that the proposed approach could serve as a useful and valuable tool for solving the similar name suggestion problem.
△ Less
Submitted 21 July, 2020; v1 submitted 24 May, 2020;
originally announced May 2020.
-
A Supervised Machine Learning Model For Imputing Missing Boarding Stops In Smart Card Data
Authors:
Nadav Shalit,
Michael Fire,
Eran Ben-Elia
Abstract:
Public transport has become an essential part of urban existence with increased population densities and environmental awareness. Large quantities of data are currently generated, allowing for more robust methods to understand travel behavior by harvesting smart card usage. However, public transport datasets suffer from data integrity problems; boarding stop information may be missing due to imper…
▽ More
Public transport has become an essential part of urban existence with increased population densities and environmental awareness. Large quantities of data are currently generated, allowing for more robust methods to understand travel behavior by harvesting smart card usage. However, public transport datasets suffer from data integrity problems; boarding stop information may be missing due to imperfect acquirement processes or inadequate reporting. We developed a supervised machine learning method to impute missing boarding stops based on ordinal classification using GTFS timetable, smart card, and geospatial datasets. A new metric, Pareto Accuracy, is suggested to evaluate algorithms where classes have an ordinal nature. Results are based on a case study in the city of Beer Sheva, Israel, consisting of one month of smart card data. We show that our proposed method is robust to irregular travelers and significantly outperforms well-known imputation methods without the need to mine any additional datasets. Validation of data from another Israeli city using transfer learning shows the presented model is general and context-free. The implications for transportation planning and travel behavior research are further discussed.
△ Less
Submitted 9 September, 2021; v1 submitted 10 March, 2020;
originally announced March 2020.
-
It Runs in the Family: Searching for Synonyms Using Digitized Family Trees
Authors:
Aviad Elyashar,
Rami Puzis,
Michael Fire
Abstract:
Searching for a person's name is a common online activity. However, Web search engines provide few accurate results to queries containing names. In contrast to a general word which has only one correct spelling, there are several legitimate spellings of a given name. Today, most techniques used to suggest synonyms in online search are based on pattern matching and phonetic encoding, however they o…
▽ More
Searching for a person's name is a common online activity. However, Web search engines provide few accurate results to queries containing names. In contrast to a general word which has only one correct spelling, there are several legitimate spellings of a given name. Today, most techniques used to suggest synonyms in online search are based on pattern matching and phonetic encoding, however they often perform poorly. As a result, there is a need for an effective tool for improved synonym suggestion. In this paper, we propose a revolutionary approach for tackling the problem of synonym suggestion. Our novel algorithm, GRAFT, utilizes historical data collected from genealogy websites, along with network algorithms. GRAFT is a general algorithm that suggests synonyms using a graph based on names derived from digitized ancestral family trees. Synonyms are extracted from this graph, which is constructed using generic ordering functions that outperform other algorithms that suggest synonyms based on a single dimension, a factor that limits their performance. We evaluated GRAFT's performance on three ground truth datasets of forenames and surnames, including a large-scale online genealogy dataset with over 16 million profiles and more than 700,000 unique forenames and 500,000 surnames. We compared GRAFT's performance at suggesting synonyms to 10 other algorithms, including phonetic encoding, string similarity algorithms, and machine and deep learning algorithms. The results show GRAFT's superiority with respect to both forenames and surnames and demonstrate its use as a tool to improve synonym suggestion.
△ Less
Submitted 29 January, 2021; v1 submitted 9 December, 2019;
originally announced December 2019.
-
Using Data Science to Understand the Film Industry's Gender Gap
Authors:
Dima Kagan,
Thomas Chesney,
Michael Fire
Abstract:
Data science can offer answers to a wide range of social science questions. Here we turn attention to the portrayal of women in movies, an industry that has a significant influence on society, impacting such aspects of life as self-esteem and career choice. To this end, we fused data from the online movie database IMDb with a dataset of movie dialogue subtitles to create the largest available corp…
▽ More
Data science can offer answers to a wide range of social science questions. Here we turn attention to the portrayal of women in movies, an industry that has a significant influence on society, impacting such aspects of life as self-esteem and career choice. To this end, we fused data from the online movie database IMDb with a dataset of movie dialogue subtitles to create the largest available corpus of movie social networks (15,540 networks). Analyzing this data, we investigated gender bias in on-screen female characters over the past century. We find a trend of improvement in all aspects of women`s roles in movies, including a constant rise in the centrality of female characters. There has also been an increase in the number of movies that pass the well-known Bechdel test, a popular--albeit flawed--measure of women in fiction. Here we propose a new and better alternative to this test for evaluating female roles in movies. Our study introduces fresh data, an open-code framework, and novel techniques that present new opportunities in the research and analysis of movies.
△ Less
Submitted 6 August, 2019; v1 submitted 15 March, 2019;
originally announced March 2019.
-
Over-Optimization of Academic Publishing Metrics: Observing Goodhart's Law in Action
Authors:
Michael Fire,
Carlos Guestrin
Abstract:
The academic publishing world is changing significantly, with ever-growing numbers of publications each year and shifting publishing patterns. However, the metrics used to measure academic success, such as the number of publications, citation number, and impact factor, have not changed for decades. Moreover, recent studies indicate that these metrics have become targets and follow Goodhart's Law,…
▽ More
The academic publishing world is changing significantly, with ever-growing numbers of publications each year and shifting publishing patterns. However, the metrics used to measure academic success, such as the number of publications, citation number, and impact factor, have not changed for decades. Moreover, recent studies indicate that these metrics have become targets and follow Goodhart's Law, according to which "when a measure becomes a target, it ceases to be a good measure." In this study, we analyzed over 120 million papers to examine how the academic publishing world has evolved over the last century. Our study shows that the validity of citation-based measures is being compromised and their usefulness is lessening. In particular, the number of publications has ceased to be a good metric as a result of longer author lists, shorter papers, and surging publication numbers. Citation-based metrics, such citation number and h-index, are likewise affected by the flood of papers, self-citations, and lengthy reference lists. Measures such as a journal's impact factor have also ceased to be good metrics due to the soaring numbers of papers that are published in top journals, particularly from the same pool of authors. Moreover, by analyzing properties of over 2600 research fields, we observed that citation-based metrics are not beneficial for comparing researchers in different fields, or even in the same department. Academic publishing has changed considerably; now we need to reconsider how we measure success.
△ Less
Submitted 20 September, 2018;
originally announced September 2018.
-
The Rise and Fall of Network Stars: Analyzing 2.5 million graphs to reveal how high-degree vertices emerge over time
Authors:
Michael Fire,
Carlos Guestrin
Abstract:
Trends change rapidly in today's world, prompting this key question: What is the mechanism behind the emergence of new trends? By representing real-world dynamic systems as complex networks, the emergence of new trends can be symbolized by vertices that "shine." That is, at a specific time interval in a network's life, certain vertices become increasingly connected to other vertices. This process…
▽ More
Trends change rapidly in today's world, prompting this key question: What is the mechanism behind the emergence of new trends? By representing real-world dynamic systems as complex networks, the emergence of new trends can be symbolized by vertices that "shine." That is, at a specific time interval in a network's life, certain vertices become increasingly connected to other vertices. This process creates new high-degree vertices, i.e., network stars. Thus, to study trends, we must look at how networks evolve over time and determine how the stars behave. In our research, we constructed the largest publicly available network evolution dataset to date, which contains 38,000 real-world networks and 2.5 million graphs. Then, we performed the first precise wide-scale analysis of the evolution of networks with various scales. Three primary observations resulted: (a) links are most prevalent among vertices that join a network at a similar time; (b) the rate that new vertices join a network is a central factor in molding a network's topology; and (c) the emergence of network stars (high-degree vertices) is correlated with fast-growing networks. We applied our learnings to develop a flexible network-generation model based on large-scale, real-world data. This model gives a better understanding of how stars rise and fall within networks, and is applicable to dynamic systems both in nature and society.
△ Less
Submitted 13 October, 2018; v1 submitted 20 June, 2017;
originally announced June 2017.
-
Generic Anomalous Vertices Detection Utilizing a Link Prediction Algorithm
Authors:
Dima Kagan,
Yuval Elovici,
Michael Fire
Abstract:
In the past decade, network structures have penetrated nearly every aspect of our lives. The detection of anomalous vertices in these networks has become increasingly important, such as in exposing computer network intruders or identifying fake online reviews. In this study, we present a novel unsupervised two-layered meta-classifier that can detect irregular vertices in complex networks solely by…
▽ More
In the past decade, network structures have penetrated nearly every aspect of our lives. The detection of anomalous vertices in these networks has become increasingly important, such as in exposing computer network intruders or identifying fake online reviews. In this study, we present a novel unsupervised two-layered meta-classifier that can detect irregular vertices in complex networks solely by using features extracted from the network topology. Following the reasoning that a vertex with many improbable links has a higher likelihood of being anomalous,we employed our method on 10 networks of various scales, from a network of several dozen students to online social networks with millions of users. In every scenario, we were able to identify anomalous vertices with lower false positive rates and higher AUCs compared to other prevalent methods. Moreover, we demonstrated that the presented algorithm is efficient both in revealing fake users and in disclosing the most influential people in social networks.
△ Less
Submitted 6 June, 2017; v1 submitted 24 October, 2016;
originally announced October 2016.
-
Time Is of the Essence: Analyzing the Effect of Vertex-Joining Time on Complex Network Evolution
Authors:
Michael Fire,
Carlos Guestrin
Abstract:
Complex networks have non-trivial characteristics and appear in many real-world systems. Such networks are vitally important, but their full underlying dynamics are not completely understood. Utilizing new data sources, however, can unveil the evolution process of these networks.
This study uses the recently published Reddit dataset, containing over 1.65 billion comments, to construct the larges…
▽ More
Complex networks have non-trivial characteristics and appear in many real-world systems. Such networks are vitally important, but their full underlying dynamics are not completely understood. Utilizing new data sources, however, can unveil the evolution process of these networks.
This study uses the recently published Reddit dataset, containing over 1.65 billion comments, to construct the largest publicly available social network corpus to date. We used this dataset to deeply examine the network evolution process, which resulted in two key observations: First, links are more likely to be created among users who join a network at a similar time. Second, the rate in which new users join a network is a central factor in molding a network's topology; i.e., different user-join patterns create different topological properties.
Based on these observations, we developed the \textit{Temporal Preferential Attachment} random network generation model. This model produces not only scale-free networks that have relative high clustering coefficients, but also networks that are sensitive to both the rate and the time in which users join the network. This results in a more accurate and flexible model of how complex networks evolve, one which more closely represents real-world data.
△ Less
Submitted 25 August, 2016; v1 submitted 24 March, 2016;
originally announced March 2016.
-
Exploring Online Ad Images Using a Deep Convolutional Neural Network Approach
Authors:
Michael Fire,
Jonathan Schler
Abstract:
Online advertising is a huge, rapidly growing advertising market in today's world. One common form of online advertising is using image ads. A decision is made (often in real time) every time a user sees an ad, and the advertiser is eager to determine the best ad to display. Consequently, many algorithms have been developed that calculate the optimal ad to show to the current user at the present t…
▽ More
Online advertising is a huge, rapidly growing advertising market in today's world. One common form of online advertising is using image ads. A decision is made (often in real time) every time a user sees an ad, and the advertiser is eager to determine the best ad to display. Consequently, many algorithms have been developed that calculate the optimal ad to show to the current user at the present time. Typically, these algorithms focus on variations of the ad, optimizing among different properties such as background color, image size, or set of images. However, there is a more fundamental layer. Our study looks at new qualities of ads that can be determined before an ad is shown (rather than online optimization) and defines which ads are most likely to be successful.
We present a set of novel algorithms that utilize deep-learning image processing, machine learning, and graph theory to investigate online advertising and to construct prediction models which can foresee an image ad's success. We evaluated our algorithms on a dataset with over 260,000 ad images, as well as a smaller dataset specifically related to the automotive industry, and we succeeded in constructing regression models for ad image click rate prediction. The obtained results emphasize the great potential of using deep-learning algorithms to effectively and efficiently analyze image ads and to create better and more innovative online ads. Moreover, the algorithms presented in this paper can help predict ad success and can be applied to analyze other large-scale image corpora.
△ Less
Submitted 2 September, 2015;
originally announced September 2015.
-
Matching Entities Across Online Social Networks
Authors:
Olga Peled,
Michael Fire,
Lior Rokach,
Yuval Elovici
Abstract:
Online Social Networks (OSNs), such as Facebook and Twitter, have become an integral part of our daily lives. There are hundreds of OSNs, each with its own focus in that each offers particular services and functionalities. Recent studies show that many OSN users create several accounts on multiple OSNs using the same or different personal information. Collecting all the available data of an indivi…
▽ More
Online Social Networks (OSNs), such as Facebook and Twitter, have become an integral part of our daily lives. There are hundreds of OSNs, each with its own focus in that each offers particular services and functionalities. Recent studies show that many OSN users create several accounts on multiple OSNs using the same or different personal information. Collecting all the available data of an individual from several OSNs and fusing it into a single profile can be useful for many purposes. In this paper, we introduce novel machine learning based methods for solving Entity Resolution (ER), a problem for matching user profiles across multiple OSNs. The presented methods are able to match between two user profiles from two different OSNs based on supervised learning techniques, which use features extracted from each one of the user profiles. By using the extracted features and supervised learning techniques, we developed classifiers which can perform entity matching between two profiles for the following scenarios: (a) matching entities across two OSNs; (b) searching for a user by similar name; and (c) de-anonymizing a user's identity.
The constructed classifiers were tested by using data collected from two popular OSNs, Facebook and Xing. We then evaluated the classifiers' performances using various evaluation measures, such as true and false positive rates, accuracy, and the Area Under the receiver operator Curve (AUC). The constructed classifiers were evaluated and their classification performance measured by AUC was quite remarkable, with an AUC of up to 0.982 and an accuracy of up to 95.9% in identifying user profiles across two OSNs.
△ Less
Submitted 4 November, 2014; v1 submitted 24 October, 2014;
originally announced October 2014.
-
Quantitative Analysis of Genealogy Using Digitised Family Trees
Authors:
Michael Fire,
Thomas Chesney,
Yuval Elovici
Abstract:
Driven by the popularity of television shows such as Who Do You Think You Are? many millions of users have uploaded their family tree to web projects such as WikiTree. Analysis of this corpus enables us to investigate genealogy computationally. The study of heritage in the social sciences has led to an increased understanding of ancestry and descent but such efforts are hampered by difficult to ac…
▽ More
Driven by the popularity of television shows such as Who Do You Think You Are? many millions of users have uploaded their family tree to web projects such as WikiTree. Analysis of this corpus enables us to investigate genealogy computationally. The study of heritage in the social sciences has led to an increased understanding of ancestry and descent but such efforts are hampered by difficult to access data. Genealogical research is typically a tedious process involving trawling through sources such as birth and death certificates, wills, letters and land deeds. Decades of research have developed and examined hypotheses on population sex ratios, marriage trends, fertility, lifespan, and the frequency of twins and triplets. These can now be tested on vast datasets containing many billions of entries using machine learning tools. Here we survey the use of genealogy data mining using family trees dating back centuries and featuring profiles on nearly 7 million individuals based in over 160 countries. These data are not typically created by trained genealogists and so we verify them with reference to third party censuses. We present results on a range of aspects of population dynamics. Our approach extends the boundaries of genealogy inquiry to precise measurement of underlying human phenomena.
△ Less
Submitted 30 August, 2014; v1 submitted 24 August, 2014;
originally announced August 2014.
-
Data Mining of Online Genealogy Datasets for Revealing Lifespan Patterns in Human Population
Authors:
Michael Fire,
Yuval Elovici
Abstract:
Online genealogy datasets contain extensive information about millions of people and their past and present family connections. This vast amount of data can assist in identifying various patterns in human population. In this study, we present methods and algorithms which can assist in identifying variations in lifespan distributions of human population in the past centuries, in detecting social an…
▽ More
Online genealogy datasets contain extensive information about millions of people and their past and present family connections. This vast amount of data can assist in identifying various patterns in human population. In this study, we present methods and algorithms which can assist in identifying variations in lifespan distributions of human population in the past centuries, in detecting social and genetic features which correlate with human lifespan, and in constructing predictive models of human lifespan based on various features which can easily be extracted from genealogy datasets.
We have evaluated the presented methods and algorithms on a large online genealogy dataset with over a million profiles and over 9 million connections, all of which were collected from the WikiTree website. Our findings indicate that significant but small positive correlations exist between the parents' lifespan and their children's lifespan. Additionally, we found slightly higher and significant correlations between the lifespans of spouses. We also discovered a very small positive and significant correlation between longevity and reproductive success in males, and a small and significant negative correlation between longevity and reproductive success in females. Moreover, our machine learning algorithms presented better than random classification results in predicting which people who outlive the age of 50 will also outlive the age of 80.
We believe that this study will be the first of many studies which utilize the wealth of data on human populations, existing in online genealogy datasets, to better understand factors which influence human lifespan. Understanding these factors can assist scientists in providing solutions for successful aging.
△ Less
Submitted 5 January, 2014; v1 submitted 18 November, 2013;
originally announced November 2013.
-
Ethical Considerations when Employing Fake Identities in OSN for Research
Authors:
Yuval Elovici,
Michael Fire,
Amir Herzberg,
Haya Shulman
Abstract:
Online Social Networks (OSNs) have rapidly become a prominent and widely used service, offering a wealth of personal and sensitive information with significant security and privacy implications. Hence, OSNs are also an important - and popular - subject for research. To perform research based on real-life evidence, however, researchers may need to access OSN data, such as texts and files uploaded b…
▽ More
Online Social Networks (OSNs) have rapidly become a prominent and widely used service, offering a wealth of personal and sensitive information with significant security and privacy implications. Hence, OSNs are also an important - and popular - subject for research. To perform research based on real-life evidence, however, researchers may need to access OSN data, such as texts and files uploaded by users and connections among users. This raises significant ethical problems. Currently, there are no clear ethical guidelines, and researchers may end up (unintentionally) performing ethically questionable research, sometimes even when more ethical research alternatives exist. For example, several studies have employed `fake identities` to collect data from OSNs, but fake identities may be used for attacks and are considered a security issue. Is it legitimate to use fake identities for studying OSNs or for collecting OSN data for research? We present a taxonomy of the ethical challenges facing researchers of OSNs and compare different approaches. We demonstrate how ethical considerations have been taken into account in previous studies that used fake identities. In addition, several possible approaches are offered to reduce or avoid ethical misconducts. We hope this work will stimulate the development and use of ethical practices and methods in the research of online social networks.
△ Less
Submitted 6 October, 2013;
originally announced October 2013.
-
Facebook Applications' Installation and Removal: A Temporal Analysis
Authors:
Dima Kagan,
Michael Fire,
Aviad Elyashar,
Yuval Elovici
Abstract:
Facebook applications are one of the reasons for Facebook attractiveness. Unfortunately, numerous users are not aware of the fact that many malicious Facebook applications exist. To educate users, to raise users' awareness and to improve Facebook users' security and privacy, we developed a Firefox add-on that alerts users to the number of installed applications on their Facebook profiles. In this…
▽ More
Facebook applications are one of the reasons for Facebook attractiveness. Unfortunately, numerous users are not aware of the fact that many malicious Facebook applications exist. To educate users, to raise users' awareness and to improve Facebook users' security and privacy, we developed a Firefox add-on that alerts users to the number of installed applications on their Facebook profiles. In this study, we present the temporal analysis of the Facebook applications' installation and removal dataset collected by our add-on. This dataset consists of information from 2,945 users, collected during a period of over a year. We used linear regression to analyze our dataset and discovered the linear connection between the average percentage change of newly installed Facebook applications and the number of days passed since the user initially installed our add-on. Additionally, we found out that users who used our Firefox add-on become more aware of their security and privacy installing on average fewer new applications. Finally, we discovered that on average 86.4% of Facebook users install an additional application every 4.2 days.
△ Less
Submitted 16 September, 2013;
originally announced September 2013.
-
Online Social Networks: Threats and Solutions
Authors:
Michael Fire,
Roy Goldschmidt,
Yuval Elovici
Abstract:
Many online social network (OSN) users are unaware of the numerous security risks that exist in these networks, including privacy violations, identity theft, and sexual harassment, just to name a few. According to recent studies, OSN users readily expose personal and private details about themselves, such as relationship status, date of birth, school name, email address, phone number, and even hom…
▽ More
Many online social network (OSN) users are unaware of the numerous security risks that exist in these networks, including privacy violations, identity theft, and sexual harassment, just to name a few. According to recent studies, OSN users readily expose personal and private details about themselves, such as relationship status, date of birth, school name, email address, phone number, and even home address. This information, if put into the wrong hands, can be used to harm users both in the virtual world and in the real world. These risks become even more severe when the users are children. In this paper we present a thorough review of the different security and privacy risks which threaten the well-being of OSN users in general, and children in particular. In addition, we present an overview of existing solutions that can provide better protection, security, and privacy for OSN users. We also offer simple-to-implement recommendations for OSN users which can improve their security and privacy when using these platforms. Furthermore, we suggest future research directions.
△ Less
Submitted 23 July, 2014; v1 submitted 15 March, 2013;
originally announced March 2013.
-
Friend or Foe? Fake Profile Identification in Online Social Networks
Authors:
Michael Fire,
Dima Kagan,
Aviad Elyashar,
Yuval Elovici
Abstract:
The amount of personal information unwillingly exposed by users on online social networks is staggering, as shown in recent research. Moreover, recent reports indicate that these networks are infested with tens of millions of fake users profiles, which may jeopardize the users' security and privacy. To identify fake users in such networks and to improve users' security and privacy, we developed th…
▽ More
The amount of personal information unwillingly exposed by users on online social networks is staggering, as shown in recent research. Moreover, recent reports indicate that these networks are infested with tens of millions of fake users profiles, which may jeopardize the users' security and privacy. To identify fake users in such networks and to improve users' security and privacy, we developed the Social Privacy Protector software for Facebook. This software contains three protection layers, which improve user privacy by implementing different methods. The software first identifies a user's friends who might pose a threat and then restricts this "friend's" exposure to the user's personal information. The second layer is an expansion of Facebook's basic privacy settings based on different types of social network usage profiles. The third layer alerts users about the number of installed applications on their Facebook profile, which have access to their private information. An initial version of the Social Privacy Protection software received high media coverage, and more than 3,000 users from more than twenty countries have installed the software, out of which 527 used the software to restrict more than nine thousand friends. In addition, we estimate that more than a hundred users accepted the software's recommendations and removed at least 1,792 Facebook applications from their profiles. By analyzing the unique dataset obtained by the software in combination with machine learning techniques, we developed classifiers, which are able to predict which Facebook profiles have high probabilities of being fake and therefore, threaten the user's well-being. Moreover, in this study, we present statistics on users' privacy settings and statistics of the number of applications installed on Facebook profiles...
△ Less
Submitted 15 March, 2013;
originally announced March 2013.
-
Organization Mining Using Online Social Networks
Authors:
Michael Fire,
Rami Puzis,
Yuval Elovici
Abstract:
Mature social networking services are one of the greatest assets of today's organizations. This valuable asset, however, can also be a threat to an organization's confidentiality. Members of social networking websites expose not only their personal information, but also details about the organizations for which they work. In this paper we analyze several commercial organizations by mining data whi…
▽ More
Mature social networking services are one of the greatest assets of today's organizations. This valuable asset, however, can also be a threat to an organization's confidentiality. Members of social networking websites expose not only their personal information, but also details about the organizations for which they work. In this paper we analyze several commercial organizations by mining data which their employees have exposed on Facebook, LinkedIn, and other publicly available sources. Using a web crawler designed for this purpose, we extract a network of informal social relationships among employees of a given target organization. Our results, obtained using centrality analysis and Machine Learning techniques applied to the structure of the informal relationships network, show that it is possible to identify leadership roles within the organization solely by this means. It is also possible to gain valuable non-trivial insights on an organization's structure by clustering its social network and gathering publicly available information on the employees within each cluster. Organizations wanting to conceal their internal structure, identity of leaders, location and specialization of branches offices, etc., must enforce strict policies to control the use of social media by their employees.
△ Less
Submitted 2 September, 2013; v1 submitted 15 March, 2013;
originally announced March 2013.
-
Social Network Based Search for Experts
Authors:
Yehonatan Bitton,
Michael Fire,
Dima Kagan,
Bracha Shapira,
Lior Rokach,
Judit Bar-Ilan
Abstract:
Our system illustrates how information retrieved from social networks can be used for suggesting experts for specific tasks. The system is designed to facilitate the task of finding the appropriate person(s) for a job, as a conference committee member, an advisor, etc. This short description will demonstrate how the system works in the context of the HCIR2012 published tasks.
Our system illustrates how information retrieved from social networks can be used for suggesting experts for specific tasks. The system is designed to facilitate the task of finding the appropriate person(s) for a job, as a conference committee member, an advisor, etc. This short description will demonstrate how the system works in the context of the HCIR2012 published tasks.
△ Less
Submitted 14 December, 2012;
originally announced December 2012.
-
Incremental Learning with Accuracy Prediction of Social and Individual Properties from Mobile-Phone Data
Authors:
Yaniv Altshuler,
Nadav Aharony,
Michael Fire,
Yuval Elovici,
Alex Pentland
Abstract:
Mobile phones are quickly becoming the primary source for social, behavioral, and environmental sensing and data collection. Today's smartphones are equipped with increasingly more sensors and accessible data types that enable the collection of literally dozens of signals related to the phone, its user, and its environment. A great deal of research effort in academia and industry is put into minin…
▽ More
Mobile phones are quickly becoming the primary source for social, behavioral, and environmental sensing and data collection. Today's smartphones are equipped with increasingly more sensors and accessible data types that enable the collection of literally dozens of signals related to the phone, its user, and its environment. A great deal of research effort in academia and industry is put into mining this raw data for higher level sense-making, such as understanding user context, inferring social networks, learning individual features, predicting outcomes, and so on. In this work we investigate the properties of learning and inference of real world data collected via mobile phones over time. In particular, we look at the dynamic learning process over time, and how the ability to predict individual parameters and social links is incrementally enhanced with the accumulation of additional data. To do this, we use the Friends and Family dataset, which contains rich data signals gathered from the smartphones of 140 adult members of a young-family residential community for over a year, and is one of the most comprehensive mobile phone datasets gathered in academia to date. We develop several models that predict social and individual properties from sensed mobile phone data, including detection of life-partners, ethnicity, and whether a person is a student or not. Then, for this set of diverse learning tasks, we investigate how the prediction accuracy evolves over time, as new data is collected. Finally, based on gained insights, we propose a method for advance prediction of the maximal learning accuracy possible for the learning task at hand, based on an initial set of measurements. This has practical implications, like informing the design of mobile data collection campaigns, or evaluating analysis strategies.
△ Less
Submitted 20 November, 2011;
originally announced November 2011.