Search | arXiv e-print repository

LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

Authors: LLM-jp, :, Akiko Aizawa, Eiji Aramaki, Bowen Chen, Fei Cheng, Hiroyuki Deguchi, Rintaro Enomoto, Kazuki Fujii, Kensuke Fukumoto, Takuya Fukushima, Namgi Han, Yuto Harada, Chikara Hashimoto, Tatsuya Hiraoka, Shohei Hisada, Sosuke Hosokawa, Lu Jie, Keisuke Kamata, Teruhito Kanazawa, Hiroki Kanezashi, Hiroshi Kataoka, Satoru Katsumata, Daisuke Kawahara, Seiya Kawano , et al. (57 additional authors not shown)

Abstract: This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its… ▽ More This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit https://llm-jp.nii.ac.jp/en/. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2401.08140 [pdf, other]

ProvNeRF: Modeling per Point Provenance in NeRFs as a Stochastic Process

Authors: Kiyohiro Nakayama, Mikaela Angelina Uy, Yang You, Ke Li, Leonidas Guibas

Abstract: Neural radiance fields (NeRFs) have gained popularity across various applications. However, they face challenges in the sparse view setting, lacking sufficient constraints from volume rendering. Reconstructing and understanding a 3D scene from sparse and unconstrained cameras is a long-standing problem in classical computer vision with diverse applications. While recent works have explored NeRFs i… ▽ More Neural radiance fields (NeRFs) have gained popularity across various applications. However, they face challenges in the sparse view setting, lacking sufficient constraints from volume rendering. Reconstructing and understanding a 3D scene from sparse and unconstrained cameras is a long-standing problem in classical computer vision with diverse applications. While recent works have explored NeRFs in sparse, unconstrained view scenarios, their focus has been primarily on enhancing reconstruction and novel view synthesis. Our approach takes a broader perspective by posing the question: "from where has each point been seen?" -- which gates how well we can understand and reconstruct it. In other words, we aim to determine the origin or provenance of each 3D point and its associated information under sparse, unconstrained views. We introduce ProvNeRF, a model that enriches a traditional NeRF representation by incorporating per-point provenance, modeling likely source locations for each point. We achieve this by extending implicit maximum likelihood estimation (IMLE) for stochastic processes. Notably, our method is compatible with any pre-trained NeRF model and the associated training camera poses. We demonstrate that modeling per-point provenance offers several advantages, including uncertainty estimation, criteria-based view selection, and improved novel view synthesis, compared to state-of-the-art methods. Please visit our project page at https://provnerf.github.io △ Less

Submitted 18 January, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

arXiv:2312.09609 [pdf, other]

Semantic-Aware Transformation-Invariant RoI Align

Authors: Guo-Ye Yang, George Kiyohiro Nakayama, Zi-Kai Xiao, Tai-Jiang Mu, Xiaolei Huang, Shi-Min Hu

Abstract: Great progress has been made in learning-based object detection methods in the last decade. Two-stage detectors often have higher detection accuracy than one-stage detectors, due to the use of region of interest (RoI) feature extractors which extract transformation-invariant RoI features for different RoI proposals, making refinement of bounding boxes and prediction of object categories more robus… ▽ More Great progress has been made in learning-based object detection methods in the last decade. Two-stage detectors often have higher detection accuracy than one-stage detectors, due to the use of region of interest (RoI) feature extractors which extract transformation-invariant RoI features for different RoI proposals, making refinement of bounding boxes and prediction of object categories more robust and accurate. However, previous RoI feature extractors can only extract invariant features under limited transformations. In this paper, we propose a novel RoI feature extractor, termed Semantic RoI Align (SRA), which is capable of extracting invariant RoI features under a variety of transformations for two-stage detectors. Specifically, we propose a semantic attention module to adaptively determine different sampling areas by leveraging the global and local semantic relationship within the RoI. We also propose a Dynamic Feature Sampler which dynamically samples features based on the RoI aspect ratio to enhance the efficiency of SRA, and a new position embedding, \ie Area Embedding, to provide more accurate position information for SRA through an improved sampling area representation. Experiments show that our model significantly outperforms baseline models with slight computational overhead. In addition, it shows excellent generalization ability and can be used to improve performance with various state-of-the-art backbones and detection methods. △ Less

Submitted 15 December, 2023; originally announced December 2023.

arXiv:2310.20685 [pdf, other]

NeRF Revisited: Fixing Quadrature Instability in Volume Rendering

Authors: Mikaela Angelina Uy, Kiyohiro Nakayama, Guandao Yang, Rahul Krishna Thomas, Leonidas Guibas, Ke Li

Abstract: Neural radiance fields (NeRF) rely on volume rendering to synthesize novel views. Volume rendering requires evaluating an integral along each ray, which is numerically approximated with a finite sum that corresponds to the exact integral along the ray under piecewise constant volume density. As a consequence, the rendered result is unstable w.r.t. the choice of samples along the ray, a phenomenon… ▽ More Neural radiance fields (NeRF) rely on volume rendering to synthesize novel views. Volume rendering requires evaluating an integral along each ray, which is numerically approximated with a finite sum that corresponds to the exact integral along the ray under piecewise constant volume density. As a consequence, the rendered result is unstable w.r.t. the choice of samples along the ray, a phenomenon that we dub quadrature instability. We propose a mathematically principled solution by reformulating the sample-based rendering equation so that it corresponds to the exact integral under piecewise linear volume density. This simultaneously resolves multiple issues: conflicts between samples along different rays, imprecise hierarchical sampling, and non-differentiability of quantiles of ray termination distances w.r.t. model parameters. We demonstrate several benefits over the classical sample-based rendering equation, such as sharper textures, better geometric reconstruction, and stronger depth supervision. Our proposed formulation can be also be used as a drop-in replacement to the volume rendering equation of existing NeRF-based methods. Our project page can be found at pl-nerf.github.io. △ Less

Submitted 19 January, 2024; v1 submitted 31 October, 2023; originally announced October 2023.

Comments: Neurips 2023

arXiv:2310.07376 [pdf, other]

Point Cloud Denoising and Outlier Detection with Local Geometric Structure by Dynamic Graph CNN

Authors: Kosuke Nakayama, Hiroto Fukuta, Hiroshi Watanabe

Abstract: The digitalization of society is rapidly developing toward the realization of the digital twin and metaverse. In particular, point clouds are attracting attention as a media format for 3D space. Point cloud data is contaminated with noise and outliers due to measurement errors. Therefore, denoising and outlier detection are necessary for point cloud processing. Among them, PointCleanNet is an effe… ▽ More The digitalization of society is rapidly developing toward the realization of the digital twin and metaverse. In particular, point clouds are attracting attention as a media format for 3D space. Point cloud data is contaminated with noise and outliers due to measurement errors. Therefore, denoising and outlier detection are necessary for point cloud processing. Among them, PointCleanNet is an effective method for point cloud denoising and outlier detection. However, it does not consider the local geometric structure of the patch. We solve this problem by applying two types of graph convolutional layer designed based on the Dynamic Graph CNN. Experimental results show that the proposed methods outperform the conventional method in AUPR, which indicates outlier detection accuracy, and Chamfer Distance, which indicates denoising accuracy. △ Less

Submitted 21 October, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: 2023 IEEE 12th Global Conference on Consumer Electronics (GCCE 2023)

arXiv:2305.01921 [pdf, other]

DiffFacto: Controllable Part-Based 3D Point Cloud Generation with Cross Diffusion

Authors: Kiyohiro Nakayama, Mikaela Angelina Uy, Jiahui Huang, Shi-Min Hu, Ke Li, Leonidas J Guibas

Abstract: While the community of 3D point cloud generation has witnessed a big growth in recent years, there still lacks an effective way to enable intuitive user control in the generation process, hence limiting the general utility of such methods. Since an intuitive way of decomposing a shape is through its parts, we propose to tackle the task of controllable part-based point cloud generation. We introduc… ▽ More While the community of 3D point cloud generation has witnessed a big growth in recent years, there still lacks an effective way to enable intuitive user control in the generation process, hence limiting the general utility of such methods. Since an intuitive way of decomposing a shape is through its parts, we propose to tackle the task of controllable part-based point cloud generation. We introduce DiffFacto, a novel probabilistic generative model that learns the distribution of shapes with part-level control. We propose a factorization that models independent part style and part configuration distributions and presents a novel cross-diffusion network that enables us to generate coherent and plausible shapes under our proposed factorization. Experiments show that our method is able to generate novel shapes with multiple axes of control. It achieves state-of-the-art part-level generation quality and generates plausible and coherent shapes while enabling various downstream editing applications such as shape interpolation, mixing, and transformation editing. Project website: https://difffacto.github.io/ △ Less

Submitted 20 August, 2023; v1 submitted 3 May, 2023; originally announced May 2023.

arXiv:2203.04557 [pdf, other]

Neighborhood persistency of the linear optimization relaxation of integer linear optimization

Authors: Kei Kimura, Kotaro Nakayama

Abstract: For an integer linear optimization (ILO) problem, persistency of its linear optimization (LO) relaxation is a property that for every optimal solution of the relaxation that assigns integer values to some variables, there exists an optimal solution of the ILO problem in which these variables retain the same values. Although persistency has been used to develop heuristic, approximation, and fixed-p… ▽ More For an integer linear optimization (ILO) problem, persistency of its linear optimization (LO) relaxation is a property that for every optimal solution of the relaxation that assigns integer values to some variables, there exists an optimal solution of the ILO problem in which these variables retain the same values. Although persistency has been used to develop heuristic, approximation, and fixed-parameter algorithms for special cases of ILO, its applicability remains unknown in the literature. In this paper we reveal a maximal subclass of ILO such that its LO relaxation has persistency. Specifically, we show that the LO relaxation of ILO on unit-two-variable-per-inequality (UTVPI) systems has persistency and is (in a certain sense) maximal among such ILO. Our persistency result generalizes the results of Nemhauser and Trotter, Hochbaum et al., and Fiorini et al. Even more, we propose a stronger property called \emph{neighborhood persistency} and show that the LO relaxation of ILO on UTVPI systems in general has this property. Using this stronger result, we obtain a fixed-parameter algorithm (where the parameter is the solution size) and another proof of two-approximability for ILO on UTVPI systems where objective functions and variables are non-negative. △ Less

Submitted 9 March, 2022; originally announced March 2022.

Comments: 17 pages

arXiv:2004.13846 [pdf, ps, other]

Character-level Japanese Text Generation with Attention Mechanism for Chest Radiography Diagnosis

Authors: Kenya Sakka, Kotaro Nakayama, Nisei Kimura, Taiki Inoue, Yusuke Iwasawa, Ryohei Yamaguchi, Yosimasa Kawazoe, Kazuhiko Ohe, Yutaka Matsuo

Abstract: Chest radiography is a general method for diagnosing a patient's condition and identifying important information; therefore, radiography is used extensively in routine medical practice in various situations, such as emergency medical care and medical checkup. However, a high level of expertise is required to interpret chest radiographs. Thus, medical specialists spend considerable time in diagnosi… ▽ More Chest radiography is a general method for diagnosing a patient's condition and identifying important information; therefore, radiography is used extensively in routine medical practice in various situations, such as emergency medical care and medical checkup. However, a high level of expertise is required to interpret chest radiographs. Thus, medical specialists spend considerable time in diagnosing such huge numbers of radiographs. In order to solve these problems, methods for generating findings have been proposed. However, the study of generating chest radiograph findings has primarily focused on the English language, and to the best of our knowledge, no studies have studied Japanese data on this subject. There are two challenges involved in generating findings in the Japanese language. The first challenge is that word splitting is difficult because the boundaries of Japanese word are not clear. The second challenge is that there are numerous orthographic variants. For deal with these two challenges, we proposed an end-to-end model that generates Japanese findings at the character-level from chest radiographs. In addition, we introduced the attention mechanism to improve not only the accuracy, but also the interpretation ability of the results. We evaluated the proposed method using a public dataset with Japanese findings. The effectiveness of the proposed method was confirmed using the Bilingual Evaluation Understudy score. And, we were confirmed from the generated findings that the proposed method was able to consider the orthographic variants. Furthermore, we confirmed via visual inspection that the attention mechanism captures the features and positional information of radiographs. △ Less

Submitted 8 June, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

Comments: 8 pages, 3 figures, 2 tables

arXiv:1909.07452 [pdf, other]

BAFFLE : Blockchain Based Aggregator Free Federated Learning

Authors: Paritosh Ramanan, Kiyoshi Nakayama

Abstract: A key aspect of Federated Learning (FL) is the requirement of a centralized aggregator to maintain and update the global model. However, in many cases orchestrating a centralized aggregator might be infeasible due to numerous operational constraints. In this paper, we introduce BAFFLE, an aggregator free, blockchain driven, FL environment that is inherently decentralized. BAFFLE leverages Smart Co… ▽ More A key aspect of Federated Learning (FL) is the requirement of a centralized aggregator to maintain and update the global model. However, in many cases orchestrating a centralized aggregator might be infeasible due to numerous operational constraints. In this paper, we introduce BAFFLE, an aggregator free, blockchain driven, FL environment that is inherently decentralized. BAFFLE leverages Smart Contracts (SC) to coordinate the round delineation, model aggregation and update tasks in FL. BAFFLE boosts computational performance by decomposing the global parameter space into distinct chunks followed by a score and bid strategy. In order to characterize the performance of BAFFLE, we conduct experiments on a private Ethereum network and use the centralized and aggregator driven methods as our benchmark. We show that BAFFLE significantly reduces the gas costs for FL on the blockchain as compared to a direct adaptation of the aggregator based method. Our results also show that BAFFLE achieves high scalability and computational efficiency while delivering similar accuracy as the benchmark methods. △ Less

Submitted 18 October, 2020; v1 submitted 16 September, 2019; originally announced September 2019.

arXiv:1805.08962 [pdf, ps, other]

Tool Exchangeable Grasp/Assembly Planner

Authors: Kensuke Harada, Kento Nakayama, Weiwei Wan, Kazuyuki Nagata, Natsuki Yamanobe, Ixchel G. Ramirez-Alpizar

Abstract: This paper proposes a novel assembly planner for a manipulator which can simultaneously plan assembly sequence, robot motion, grasping configuration, and exchange of grippers. Our assembly planner assumes multiple grippers and can automatically selects a feasible one to assemble a part. For a given AND/OR graph of an assembly task, we consider generating the assembly graph from which assembly moti… ▽ More This paper proposes a novel assembly planner for a manipulator which can simultaneously plan assembly sequence, robot motion, grasping configuration, and exchange of grippers. Our assembly planner assumes multiple grippers and can automatically selects a feasible one to assemble a part. For a given AND/OR graph of an assembly task, we consider generating the assembly graph from which assembly motion of a robot can be planned. The edges of the assembly graph are composed of three kinds of paths, i.e., transfer/assembly paths, transit paths and tool exchange paths. In this paper, we first explain the proposed method for planning assembly motion sequence including the function of gripper exchange. Finally, the effectiveness of the proposed method is confirmed through some numerical examples and a physical experiment. △ Less

Submitted 23 May, 2018; originally announced May 2018.

Comments: This is to appear Int. Conf. on Intelligent Autonomous Systems

Journal ref: Int. Conf. on Intelligent Autonomous Systems, 2018

arXiv:1801.08702 [pdf, ps, other]

Improving Bi-directional Generation between Different Modalities with Variational Autoencoders

Authors: Masahiro Suzuki, Kotaro Nakayama, Yutaka Matsuo

Abstract: We investigate deep generative models that can exchange multiple modalities bi-directionally, e.g., generating images from corresponding texts and vice versa. A major approach to achieve this objective is to train a model that integrates all the information of different modalities into a joint representation and then to generate one modality from the corresponding other modality via this joint rep… ▽ More We investigate deep generative models that can exchange multiple modalities bi-directionally, e.g., generating images from corresponding texts and vice versa. A major approach to achieve this objective is to train a model that integrates all the information of different modalities into a joint representation and then to generate one modality from the corresponding other modality via this joint representation. We simply applied this approach to variational autoencoders (VAEs), which we call a joint multimodal variational autoencoder (JMVAE). However, we found that when this model attempts to generate a large dimensional modality missing at the input, the joint representation collapses and this modality cannot be generated successfully. Furthermore, we confirmed that this difficulty cannot be resolved even using a known solution. Therefore, in this study, we propose two models to prevent this difficulty: JMVAE-kl and JMVAE-h. Results of our experiments demonstrate that these methods can prevent the difficulty above and that they generate modalities bi-directionally with equal or higher likelihood than conventional VAE methods, which generate in only one direction. Moreover, we confirm that these methods can obtain the joint representation appropriately, so that they can generate various variations of modality by moving over the joint representation or changing the value of another modality. △ Less

Submitted 26 January, 2018; originally announced January 2018.

Comments: Updated version of arXiv:1611.01891

arXiv:1711.06406 [pdf, other]

Predicting Driver Attention in Critical Situations

Authors: Ye Xia, Danqing Zhang, Jinkyu Kim, Ken Nakayama, Karl Zipser, David Whitney

Abstract: Robust driver attention prediction for critical situations is a challenging computer vision problem, yet essential for autonomous driving. Because critical driving moments are so rare, collecting enough data for these situations is difficult with the conventional in-car data collection protocol---tracking eye movements during driving. Here, we first propose a new in-lab driver attention collection… ▽ More Robust driver attention prediction for critical situations is a challenging computer vision problem, yet essential for autonomous driving. Because critical driving moments are so rare, collecting enough data for these situations is difficult with the conventional in-car data collection protocol---tracking eye movements during driving. Here, we first propose a new in-lab driver attention collection protocol and introduce a new driver attention dataset, Berkeley DeepDrive Attention (BDD-A) dataset, which is built upon braking event videos selected from a large-scale, crowd-sourced driving video dataset. We further propose Human Weighted Sampling (HWS) method, which uses human gaze behavior to identify crucial frames of a driving dataset and weights them heavily during model training. With our dataset and HWS, we built a driver attention prediction model that outperforms the state-of-the-art and demonstrates sophisticated behaviors, like attending to crossing pedestrians but not giving false alarms to pedestrians safely walking on the sidewalk. Its prediction results are nearly indistinguishable from ground-truth to humans. Although only being trained with our in-lab attention data, the model also predicts in-car driver attention data of routine driving with state-of-the-art accuracy. This result not only demonstrates the performance of our model but also proves the validity and usefulness of our dataset and data collection protocol. △ Less

Submitted 5 December, 2018; v1 submitted 16 November, 2017; originally announced November 2017.

Comments: ACCV 2018

arXiv:1706.03038 [pdf, other]

Okutama-Action: An Aerial View Video Dataset for Concurrent Human Action Detection

Authors: Mohammadamin Barekatain, Miquel Martí, Hsueh-Fu Shih, Samuel Murray, Kotaro Nakayama, Yutaka Matsuo, Helmut Prendinger

Abstract: Despite significant progress in the development of human action detection datasets and algorithms, no current dataset is representative of real-world aerial view scenarios. We present Okutama-Action, a new video dataset for aerial view concurrent human action detection. It consists of 43 minute-long fully-annotated sequences with 12 action classes. Okutama-Action features many challenges missing i… ▽ More Despite significant progress in the development of human action detection datasets and algorithms, no current dataset is representative of real-world aerial view scenarios. We present Okutama-Action, a new video dataset for aerial view concurrent human action detection. It consists of 43 minute-long fully-annotated sequences with 12 action classes. Okutama-Action features many challenges missing in current datasets, including dynamic transition of actions, significant changes in scale and aspect ratio, abrupt camera movement, as well as multi-labeled actors. As a result, our dataset is more challenging than existing ones, and will help push the field forward to enable real-world applications. △ Less

Submitted 15 June, 2017; v1 submitted 9 June, 2017; originally announced June 2017.

Comments: Computer Vision and Pattern Recognition Workshops (CVPRW), Hawaii, USA, 2017

arXiv:1611.08459 [pdf, other]

Neural Machine Translation with Latent Semantic of Image and Text

Authors: Joji Toyama, Masanori Misono, Masahiro Suzuki, Kotaro Nakayama, Yutaka Matsuo

Abstract: Although attention-based Neural Machine Translation have achieved great success, attention-mechanism cannot capture the entire meaning of the source sentence because the attention mechanism generates a target word depending heavily on the relevant parts of the source sentence. The report of earlier studies has introduced a latent variable to capture the entire meaning of sentence and achieved impr… ▽ More Although attention-based Neural Machine Translation have achieved great success, attention-mechanism cannot capture the entire meaning of the source sentence because the attention mechanism generates a target word depending heavily on the relevant parts of the source sentence. The report of earlier studies has introduced a latent variable to capture the entire meaning of sentence and achieved improvement on attention-based Neural Machine Translation. We follow this approach and we believe that the capturing meaning of sentence benefits from image information because human beings understand the meaning of language not only from textual information but also from perceptual information such as that gained from vision. As described herein, we propose a neural machine translation model that introduces a continuous latent variable containing an underlying semantic extracted from texts and images. Our model, which can be trained end-to-end, requires image information only when training. Experiments conducted with an English--German translation task show that our model outperforms over the baseline. △ Less

Submitted 25 November, 2016; originally announced November 2016.

arXiv:1611.01891 [pdf, ps, other]

Joint Multimodal Learning with Deep Generative Models

Authors: Masahiro Suzuki, Kotaro Nakayama, Yutaka Matsuo

Abstract: We investigate deep generative models that can exchange multiple modalities bi-directionally, e.g., generating images from corresponding texts and vice versa. Recently, some studies handle multiple modalities on deep generative models, such as variational autoencoders (VAEs). However, these models typically assume that modalities are forced to have a conditioned relation, i.e., we can only generat… ▽ More We investigate deep generative models that can exchange multiple modalities bi-directionally, e.g., generating images from corresponding texts and vice versa. Recently, some studies handle multiple modalities on deep generative models, such as variational autoencoders (VAEs). However, these models typically assume that modalities are forced to have a conditioned relation, i.e., we can only generate modalities in one direction. To achieve our objective, we should extract a joint representation that captures high-level concepts among all modalities and through which we can exchange them bi-directionally. As described herein, we propose a joint multimodal variational autoencoder (JMVAE), in which all modalities are independently conditioned on joint representation. In other words, it models a joint distribution of modalities. Furthermore, to be able to generate missing modalities from the remaining modalities properly, we develop an additional method, JMVAE-kl, that is trained by reducing the divergence between JMVAE's encoder and prepared networks of respective modalities. Our experiments show that our proposed method can obtain appropriate joint representation from multiple modalities and that it can generate and reconstruct them more properly than conventional VAEs. We further demonstrate that JMVAE can generate multiple modalities bi-directionally. △ Less

Submitted 6 November, 2016; originally announced November 2016.

Showing 1–15 of 15 results for author: Nakayama, K