Search | arXiv e-print repository

Spherical Vision Transformer for 360-degree Video Saliency Prediction

Authors: Mert Cokelek, Nevrez Imamoglu, Cagri Ozcinar, Erkut Erdem, Aykut Erdem

Abstract: The growing interest in omnidirectional videos (ODVs) that capture the full field-of-view (FOV) has gained 360-degree saliency prediction importance in computer vision. However, predicting where humans look in 360-degree scenes presents unique challenges, including spherical distortion, high resolution, and limited labelled data. We propose a novel vision-transformer-based model for omnidirectiona… ▽ More The growing interest in omnidirectional videos (ODVs) that capture the full field-of-view (FOV) has gained 360-degree saliency prediction importance in computer vision. However, predicting where humans look in 360-degree scenes presents unique challenges, including spherical distortion, high resolution, and limited labelled data. We propose a novel vision-transformer-based model for omnidirectional videos named SalViT360 that leverages tangent image representations. We introduce a spherical geometry-aware spatiotemporal self-attention mechanism that is capable of effective omnidirectional video understanding. Furthermore, we present a consistency-based unsupervised regularization term for projection-based 360-degree dense-prediction models to reduce artefacts in the predictions that occur after inverse projection. Our approach is the first to employ tangent images for omnidirectional saliency prediction, and our experimental results on three ODV saliency datasets demonstrate its effectiveness compared to the state-of-the-art. △ Less

Submitted 24 August, 2023; originally announced August 2023.

Comments: 12 pages, 4 figures, accepted to BMVC 2023

arXiv:2303.06907 [pdf, other]

ST360IQ: No-Reference Omnidirectional Image Quality Assessment with Spherical Vision Transformers

Authors: Nafiseh Jabbari Tofighi, Mohamed Hedi Elfkir, Nevrez Imamoglu, Cagri Ozcinar, Erkut Erdem, Aykut Erdem

Abstract: Omnidirectional images, aka 360 images, can deliver immersive and interactive visual experiences. As their popularity has increased dramatically in recent years, evaluating the quality of 360 images has become a problem of interest since it provides insights for capturing, transmitting, and consuming this new media. However, directly adapting quality assessment methods proposed for standard natura… ▽ More Omnidirectional images, aka 360 images, can deliver immersive and interactive visual experiences. As their popularity has increased dramatically in recent years, evaluating the quality of 360 images has become a problem of interest since it provides insights for capturing, transmitting, and consuming this new media. However, directly adapting quality assessment methods proposed for standard natural images for omnidirectional data poses certain challenges. These models need to deal with very high-resolution data and implicit distortions due to the spherical form of the images. In this study, we present a method for no-reference 360 image quality assessment. Our proposed ST360IQ model extracts tangent viewports from the salient parts of the input omnidirectional image and employs a vision-transformers based module processing saliency selective patches/tokens that estimates a quality score from each viewport. Then, it aggregates these scores to give a final quality score. Our experiments on two benchmark datasets, namely OIQA and CVIQ datasets, demonstrate that as compared to the state-of-the-art, our approach predicts the quality of an omnidirectional image correlated with the human-perceived image quality. The code has been available on https://github.com/Nafiseh-Tofighi/ST360IQ △ Less

Submitted 13 March, 2023; originally announced March 2023.

Comments: ICASSP 2023

arXiv:2103.08467 [pdf, other]

doi 10.3390/e24020211

Ensemble approach for detection of depression using EEG features

Authors: Egils Avots, Klavs Jermakovs, Maie Bachmann, Laura Paeske, Cagri Ozcinar, Gholamreza Anbarjafari

Abstract: Depression is a public health issue which severely affects one's well being and cause negative social and economic effect for society. To rise awareness of these problems, this publication aims to determine if long lasting effects of depression can be determined from electoencephalographic (EEG) signals. The article contains accuracy comparison for SVM, LDA, NB, kNN and D3 binary classifiers which… ▽ More Depression is a public health issue which severely affects one's well being and cause negative social and economic effect for society. To rise awareness of these problems, this publication aims to determine if long lasting effects of depression can be determined from electoencephalographic (EEG) signals. The article contains accuracy comparison for SVM, LDA, NB, kNN and D3 binary classifiers which were trained using linear (relative band powers, APV, SASI) and non-linear (HFD, LZC, DFA) EEG features. The age and gender matched dataset consisted of 10 healthy subjects and 10 subjects with depression diagnosis at some point in their lifetime. Several of the proposed feature selection and classifier combinations reached accuracy of 90% where all models where evaluated using 10-fold cross validation and averaged over 100 repetitions with random sample permutations. △ Less

Submitted 7 March, 2021; originally announced March 2021.

Comments: 8 pages, 2 figures

arXiv:2101.10396 [pdf, other]

Quality Assessment of Super-Resolved Omnidirectional Image Quality Using Tangential Views

Authors: Cagri Ozcinar, Aakanksha Rana

Abstract: Omnidirectional images (ODIs), also known as 360-degree images, enable viewers to explore all directions of a given 360-degree scene from a fixed point. Designing an immersive imaging system with ODI is challenging as such systems require very large resolution coverage of the entire 360 viewing space to provide an enhanced quality of experience (QoE). Despite remarkable progress on single image su… ▽ More Omnidirectional images (ODIs), also known as 360-degree images, enable viewers to explore all directions of a given 360-degree scene from a fixed point. Designing an immersive imaging system with ODI is challenging as such systems require very large resolution coverage of the entire 360 viewing space to provide an enhanced quality of experience (QoE). Despite remarkable progress on single image super-resolution (SISR) methods with deep-learning techniques, no study for quality assessments of super-resolved ODIs exists to analyze the quality of such SISR techniques. This paper proposes an objective, full-reference quality assessment framework which studies quality measurement for ODIs generated by GAN-based and CNN-based SISR methods. The quality assessment framework offers to utilize tangential views to cope with the spherical nature of a given ODIs. The generated tangential views are distortion-free and can be efficiently scaled to high-resolution spherical data for SISR quality measurement. We extensively evaluate two state-of-the-art SISR methods using widely used full-reference SISR quality metrics adapted to our designed framework. In addition, our study reveals that most objective metric show high performance over CNN based SISR, while subjective tests favors GAN-based architectures. △ Less

Submitted 25 January, 2021; originally announced January 2021.

Comments: Paper Accepted at Electronic Imaging

arXiv:2010.12540 [pdf, other]

Comprehensive Empirical Evaluation of Deep Learning Approaches for Session-based Recommendation in E-Commerce

Authors: Mohamed Maher, Perseverance Munga Ngoy, Aleksandrs Rebriks, Cagri Ozcinar, Josue Cuevas, Rajasekhar Sanagavarapu, Gholamreza Anbarjafari

Abstract: Boosting sales of e-commerce services is guaranteed once users find more matching items to their interests in a short time. Consequently, recommendation systems have become a crucial part of any successful e-commerce services. Although various recommendation techniques could be used in e-commerce, a considerable amount of attention has been drawn to session-based recommendation systems during the… ▽ More Boosting sales of e-commerce services is guaranteed once users find more matching items to their interests in a short time. Consequently, recommendation systems have become a crucial part of any successful e-commerce services. Although various recommendation techniques could be used in e-commerce, a considerable amount of attention has been drawn to session-based recommendation systems during the recent few years. This growing interest is due to the security concerns in collecting personalized user behavior data, especially after the recent general data protection regulations. In this work, we present a comprehensive evaluation of the state-of-the-art deep learning approaches used in the session-based recommendation. In session-based recommendation, a recommendation system counts on the sequence of events made by a user within the same session to predict and endorse other items that are more likely to correlate with his/her preferences. Our extensive experiments investigate baseline techniques (\textit{e.g.,} nearest neighbors and pattern mining algorithms) and deep learning approaches (\textit{e.g.,} recurrent neural networks, graph neural networks, and attention-based networks). Our evaluations show that advanced neural-based models and session-based nearest neighbor algorithms outperform the baseline techniques in most of the scenarios. However, we found that these models suffer more in case of long sessions when there exists drift in user interests, and when there is no enough data to model different items correctly during training. Our study suggests that using hybrid models of different approaches combined with baseline algorithms could lead to substantial results in session-based recommendations based on dataset characteristics. We also discuss the drawbacks of current session-based recommendation algorithms and further open research directions in this field. △ Less

Submitted 17 October, 2020; originally announced October 2020.

Comments: 48 pages, 17 figures, journal

arXiv:2008.03195 [pdf, other]

A Study on Visual Perception of Light Field Content

Authors: Ailbhe Gill, Emin Zerman, Cagri Ozcinar, Aljosa Smolic

Abstract: The effective design of visual computing systems depends heavily on the anticipation of visual attention, or saliency. While visual attention is well investigated for conventional 2D images and video, it is nevertheless a very active research area for emerging immersive media. In particular, visual attention of light fields (light rays of a scene captured by a grid of cameras or micro lenses) has… ▽ More The effective design of visual computing systems depends heavily on the anticipation of visual attention, or saliency. While visual attention is well investigated for conventional 2D images and video, it is nevertheless a very active research area for emerging immersive media. In particular, visual attention of light fields (light rays of a scene captured by a grid of cameras or micro lenses) has only recently become a focus of research. As they may be rendered and consumed in various ways, a primary challenge that arises is the definition of what visual perception of light field content should be. In this work, we present a visual attention study on light field content. We conducted perception experiments displaying them to users in various ways and collected corresponding visual attention data. Our analysis highlights characteristics of user behaviour in light field imaging applications. The light field data set and attention data are provided with this paper. △ Less

Submitted 7 August, 2020; originally announced August 2020.

Comments: To appear in Irish Machine Vision and Image Processing (IMVIP) 2020

ACM Class: I.2.10; I.4; I.5

arXiv:2008.01116 [pdf, other]

Sub-Pixel Back-Projection Network For Lightweight Single Image Super-Resolution

Authors: Supratik Banerjee, Cagri Ozcinar, Aakanksha Rana, Aljosa Smolic, Michael Manzke

Abstract: Convolutional neural network (CNN)-based methods have achieved great success for single-image superresolution (SISR). However, most models attempt to improve reconstruction accuracy while increasing the requirement of number of model parameters. To tackle this problem, in this paper, we study reducing the number of parameters and computational cost of CNN-based SISR methods while maintaining the a… ▽ More Convolutional neural network (CNN)-based methods have achieved great success for single-image superresolution (SISR). However, most models attempt to improve reconstruction accuracy while increasing the requirement of number of model parameters. To tackle this problem, in this paper, we study reducing the number of parameters and computational cost of CNN-based SISR methods while maintaining the accuracy of super-resolution reconstruction performance. To this end, we introduce a novel network architecture for SISR, which strikes a good trade-off between reconstruction quality and low computational complexity. Specifically, we propose an iterative back-projection architecture using sub-pixel convolution instead of deconvolution layers. We evaluate the performance of computational and reconstruction accuracy for our proposed model with extensive quantitative and qualitative evaluations. Experimental results reveal that our proposed method uses fewer parameters and reduces the computational cost while maintaining reconstruction accuracy against state-of-the-art SISR methods over well-known four SR benchmark datasets. Code is available at "https://github.com/supratikbanerjee/SubPixel-BackProjection_SuperResolution". △ Less

Submitted 3 August, 2020; originally announced August 2020.

Comments: To appear in IMVIP 2020

arXiv:1908.06752 [pdf, other]

doi 10.1109/ICASSP.2019.8683318

Towards Generating Ambisonics Using Audio-Visual Cue for Virtual Reality

Authors: Aakanksha Rana, Cagri Ozcinar, Aljoscha Smolic

Abstract: Ambisonics i.e., a full-sphere surround sound, is quintessential with 360-degree visual content to provide a realistic virtual reality (VR) experience. While 360-degree visual content capture gained a tremendous boost recently, the estimation of corresponding spatial sound is still challenging due to the required sound-field microphones or information about the sound-source locations. In this pape… ▽ More Ambisonics i.e., a full-sphere surround sound, is quintessential with 360-degree visual content to provide a realistic virtual reality (VR) experience. While 360-degree visual content capture gained a tremendous boost recently, the estimation of corresponding spatial sound is still challenging due to the required sound-field microphones or information about the sound-source locations. In this paper, we introduce a novel problem of generating Ambisonics in 360-degree videos using the audio-visual cue. With this aim, firstly, a novel 360-degree audio-visual video dataset of 265 videos is introduced with annotated sound-source locations. Secondly, a pipeline is designed for an automatic Ambisonic estimation problem. Benefiting from the deep learning-based audio-visual feature-embedding and prediction modules, our pipeline estimates the 3D sound-source locations and further use such locations to encode to the B-format. To benchmark our dataset and pipeline, we additionally propose evaluation criteria to investigate the performance using different 360-degree input representations. Our results demonstrate the efficacy of the proposed pipeline and open up a new area of research in 360-degree audio-visual analysis for future investigations. △ Less

Submitted 16 August, 2019; originally announced August 2019.

Comments: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:1908.04297 [pdf, other]

Super-resolution of Omnidirectional Images Using Adversarial Learning

Authors: Cagri Ozcinar, Aakanksha Rana, Aljosa Smolic

Abstract: An omnidirectional image (ODI) enables viewers to look in every direction from a fixed point through a head-mounted display providing an immersive experience compared to that of a standard image. Designing immersive virtual reality systems with ODIs is challenging as they require high resolution content. In this paper, we study super-resolution for ODIs and propose an improved generative adversari… ▽ More An omnidirectional image (ODI) enables viewers to look in every direction from a fixed point through a head-mounted display providing an immersive experience compared to that of a standard image. Designing immersive virtual reality systems with ODIs is challenging as they require high resolution content. In this paper, we study super-resolution for ODIs and propose an improved generative adversarial network based model which is optimized to handle the artifacts obtained in the spherical observational space. Specifically, we propose to use a fast PatchGAN discriminator, as it needs fewer parameters and improves the super-resolution at a fine scale. We also explore the generative models with adversarial learning by introducing a spherical-content specific loss function, called 360-SS. To train and test the performance of our proposed model we prepare a dataset of 4500 ODIs. Our results demonstrate the efficacy of the proposed method and identify new challenges in ODI super-resolution for future investigations. △ Less

Submitted 12 August, 2019; originally announced August 2019.

arXiv:1902.07653 [pdf, other]

On the effect of age perception biases for real age regression

Authors: Julio C. S. Jacques Junior, Cagri Ozcinar, Marina Marjanovic, Xavier Baró, Gholamreza Anbarjafari, Sergio Escalera

Abstract: Automatic age estimation from facial images represents an important task in computer vision. This paper analyses the effect of gender, age, ethnic, makeup and expression attributes of faces as sources of bias to improve deep apparent age prediction. Following recent works where it is shown that apparent age labels benefit real age estimation, rather than direct real to real age regression, our mai… ▽ More Automatic age estimation from facial images represents an important task in computer vision. This paper analyses the effect of gender, age, ethnic, makeup and expression attributes of faces as sources of bias to improve deep apparent age prediction. Following recent works where it is shown that apparent age labels benefit real age estimation, rather than direct real to real age regression, our main contribution is the integration, in an end-to-end architecture, of face attributes for apparent age prediction with an additional loss for real age regression. Experimental results on the APPA-REAL dataset indicate the proposed network successfully take advantage of the adopted attributes to improve both apparent and real age estimation. Our model outperformed a state-of-the-art architecture proposed to separately address apparent and real age regression. Finally, we present preliminary results and discussion of a proof of concept application using the proposed model to regress the apparent age of an individual based on the gender of an external observer. △ Less

Submitted 20 February, 2019; originally announced February 2019.

Comments: Accepted in the 14th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2019)

arXiv:1805.03105 [pdf, other]

Optimization of Occlusion-Inducing Depth Pixels in 3-D Video Coding

Authors: Pan Gao, Cagri Ozcinar, Aljosa Smolic

Abstract: The optimization of occlusion-inducing depth pixels in depth map coding has received little attention in the literature, since their associated texture pixels are occluded in the synthesized view and their effect on the synthesized view is considered negligible. However, the occlusion-inducing depth pixels still need to consume the bits to be transmitted, and will induce geometry distortion that i… ▽ More The optimization of occlusion-inducing depth pixels in depth map coding has received little attention in the literature, since their associated texture pixels are occluded in the synthesized view and their effect on the synthesized view is considered negligible. However, the occlusion-inducing depth pixels still need to consume the bits to be transmitted, and will induce geometry distortion that inherently exists in the synthesized view. In this paper, we propose an efficient depth map coding scheme specifically for the occlusion-inducing depth pixels by using allowable depth distortions. Firstly, we formulate a problem of minimizing the overall geometry distortion in the occlusion subject to the bit rate constraint, for which the depth distortion is properly adjusted within the set of allowable depth distortions that introduce the same disparity error as the initial depth distortion. Then, we propose a dynamic programming solution to find the optimal depth distortion vector for the occlusion. The proposed algorithm can improve the coding efficiency without alteration of the occlusion order. Simulation results confirm the performance improvement compared to other existing algorithms. △ Less

Submitted 8 May, 2018; originally announced May 2018.

arXiv:1801.08863 [pdf, other]

3D Scanning: A Comprehensive Survey

Authors: Morteza Daneshmand, Ahmed Helmi, Egils Avots, Fatemeh Noroozi, Fatih Alisinanoglu, Hasan Sait Arslan, Jelena Gorbova, Rain Eric Haamer, Cagri Ozcinar, Gholamreza Anbarjafari

Abstract: This paper provides an overview of 3D scanning methodologies and technologies proposed in the existing scientific and industrial literature. Throughout the paper, various types of the related techniques are reviewed, which consist, mainly, of close-range, aerial, structure-from-motion and terrestrial photogrammetry, and mobile, terrestrial and airborne laser scanning, as well as time-of-flight, st… ▽ More This paper provides an overview of 3D scanning methodologies and technologies proposed in the existing scientific and industrial literature. Throughout the paper, various types of the related techniques are reviewed, which consist, mainly, of close-range, aerial, structure-from-motion and terrestrial photogrammetry, and mobile, terrestrial and airborne laser scanning, as well as time-of-flight, structured-light and phase-comparison methods, along with comparative and combinational studies, the latter being intended to help make a clearer distinction on the relevance and reliability of the possible choices. Moreover, outlier detection and surface fitting procedures are discussed concisely, which are necessary post-processing stages. △ Less

Submitted 23 January, 2018; originally announced January 2018.

Comments: 18 pages, 3 figures

arXiv:1711.03362 [pdf, other]

Estimation of optimal encoding ladders for tiled 360° VR video in adaptive streaming systems

Authors: Cagri Ozcinar, Ana De Abreu, Sebastian Knorr, Aljosa Smolic

Abstract: Given the significant industrial growth of demand for virtual reality (VR), 360° video streaming is one of the most important VR applications that require cost-optimal solutions to achieve widespread proliferation of VR technology. Because of its inherent variability of data-intensive content types and its tiled-based encoding and streaming, 360° video requires new encoding ladders in adaptive str… ▽ More Given the significant industrial growth of demand for virtual reality (VR), 360° video streaming is one of the most important VR applications that require cost-optimal solutions to achieve widespread proliferation of VR technology. Because of its inherent variability of data-intensive content types and its tiled-based encoding and streaming, 360° video requires new encoding ladders in adaptive streaming systems to achieve cost-optimal and immersive streaming experiences. In this context, this paper targets both the provider's and client's perspectives and introduces a new content-aware encoding ladder estimation method for tiled 360° VR video in adaptive streaming systems. The proposed method first categories a given 360° video using its features of encoding complexity and estimates the visual distortion and resource cost of each bitrate level based on the proposed distortion and resource cost models. An optimal encoding ladder is then formed using the proposed integer linear programming (ILP) algorithm by considering practical constraints. Experimental results of the proposed method are compared with the recommended encoding ladders of professional streaming service providers. Evaluations show that the proposed encoding ladders deliver better results compared to the recommended encoding ladders in terms of objective quality for 360° video, providing optimal encoding ladders using a set of service provider's constraint parameters. △ Less

Submitted 9 November, 2017; originally announced November 2017.

Comments: The 19th IEEE International Symposium on Multimedia (ISM 2017), Taichung, Taiwan

Journal ref: The 19th IEEE International Symposium on Multimedia (ISM 2017), Taichung, Taiwan

arXiv:1711.02386 [pdf, other]

Viewport-aware adaptive 360° video streaming using tiles for virtual reality

Authors: Cagri Ozcinar, Ana De Abreu, Aljosa Smolic

Abstract: 360° video is attracting an increasing amount of attention in the context of Virtual Reality (VR). Owing to its very high-resolution requirements, existing professional streaming services for 360° video suffer from severe drawbacks. This paper introduces a novel end-to-end streaming system from encoding to displaying, to transmit 8K resolution 360° video and to provide an enhanced VR experience us… ▽ More 360° video is attracting an increasing amount of attention in the context of Virtual Reality (VR). Owing to its very high-resolution requirements, existing professional streaming services for 360° video suffer from severe drawbacks. This paper introduces a novel end-to-end streaming system from encoding to displaying, to transmit 8K resolution 360° video and to provide an enhanced VR experience using Head Mounted Displays (HMDs). The main contributions of the proposed system are about tiling, integration of the MPEG-Dynamic Adaptive Streaming over HTTP (DASH) standard, and viewport-aware bitrate level selection. Tiling and adaptive streaming enable the proposed system to deliver very high-resolution 360° video at good visual quality. Further, the proposed viewport-aware bitrate assignment selects an optimum DASH representation for each tile in a viewport-aware manner. The quality performance of the proposed system is verified in simulations with varying network bandwidth using realistic view trajectories recorded from user experiments. Our results show that the proposed streaming system compares favorably compared to existing methods in terms of PSNR and SSIM inside the viewport. △ Less

Submitted 7 November, 2017; originally announced November 2017.

Comments: IEEE International Conference on Image Processing (ICIP) 2017

Journal ref: 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 2017

Showing 1–14 of 14 results for author: Ozcinar, C