Zum Hauptinhalt springen

Showing 1–26 of 26 results for author: Polyak, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.09334  [pdf, other

    cs.CV

    Video Editing via Factorized Diffusion Distillation

    Authors: Uriel Singer, Amit Zohar, Yuval Kirstain, Shelly Sheynin, Adam Polyak, Devi Parikh, Yaniv Taigman

    Abstract: We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. To develop EVE we separately train an image editing adapter and a video generation adapter, and attach both to the same text-to-image model. Then, to align the adapters towards video editing we introduce a new unsupervised distillation procedure,… ▽ More

    Submitted 24 March, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

  2. arXiv:2311.10089  [pdf, other

    cs.CV cs.AI cs.LG

    Emu Edit: Precise Image Editing via Recognition and Generation Tasks

    Authors: Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, Yaniv Taigman

    Abstract: Instruction-based image editing holds immense potential for a variety of applications, as it enables users to perform any editing operation using a natural language instruction. However, current models in this domain often struggle with accurately executing user instructions. We present Emu Edit, a multi-task image editing model which sets state-of-the-art results in instruction-based image editin… ▽ More

    Submitted 16 November, 2023; originally announced November 2023.

  3. arXiv:2309.02591  [pdf, other

    cs.LG cs.CL cs.CV

    Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

    Authors: Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, Candace Ross, Adam Polyak, Russell Howes, Vasu Sharma, Puxin Xu, Hovhannes Tamoyan, Oron Ashual, Uriel Singer, Shang-Wen Li, Susan Zhang, Richard James, Gargi Ghosh, Yaniv Taigman, Maryam Fazel-Zarandi, Asli Celikyilmaz , et al. (2 additional authors not shown)

    Abstract: We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented, token-based, decoder-only multi-modal language model capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted fr… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

  4. arXiv:2305.01569  [pdf, other

    cs.CV cs.AI

    Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation

    Authors: Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, Omer Levy

    Abstract: The ability to collect a large dataset of human preferences from text-to-image users is usually limited to companies, making such datasets inaccessible to the public. To address this issue, we create a web app that enables text-to-image users to generate images and specify their preferences. Using this web app we build Pick-a-Pic, a large, open dataset of text-to-image prompts and real users' pref… ▽ More

    Submitted 23 November, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

  5. arXiv:2303.01000  [pdf, other

    cs.CV cs.AI

    X&Fuse: Fusing Visual Information in Text-to-Image Generation

    Authors: Yuval Kirstain, Omer Levy, Adam Polyak

    Abstract: We introduce X&Fuse, a general approach for conditioning on visual information when generating images from text. We demonstrate the potential of X&Fuse in three different text-to-image generation scenarios. (i) When a bank of images is available, we retrieve and condition on a related image (Retrieve&Fuse), resulting in significant improvements on the MS-COCO benchmark, gaining a state-of-the-art… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

  6. arXiv:2301.11280  [pdf, other

    cs.CV cs.AI cs.LG

    Text-To-4D Dynamic Scene Generation

    Authors: Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, Yaniv Taigman

    Abstract: We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions. Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency by querying a Text-to-Video (T2V) diffusion-based model. The dynamic video output generated from the provided text can be viewed from any camera locat… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

  7. arXiv:2211.01223  [pdf, other

    cs.SD eess.AS

    Audio Language Modeling using Perceptually-Guided Discrete Representations

    Authors: Felix Kreuk, Yaniv Taigman, Adam Polyak, Jade Copet, Gabriel Synnaeve, Alexandre Défossez, Yossi Adi

    Abstract: In this work, we study the task of Audio Language Modeling, in which we aim at learning probabilistic models for audio that can be used for generation and completion. We use a state-of-the-art perceptually-guided audio compression model, to encode audio to discrete representations. Next, we train a transformer-based causal language model using these representations. At inference time, we perform a… ▽ More

    Submitted 4 November, 2022; v1 submitted 2 November, 2022; originally announced November 2022.

  8. arXiv:2209.15352  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    AudioGen: Textually Guided Audio Generation

    Authors: Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi

    Abstract: We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differe… ▽ More

    Submitted 5 March, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

    Comments: Accepted to ICLR 2023

  9. arXiv:2209.14792  [pdf, other

    cs.CV cs.AI cs.LG

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Authors: Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, Yaniv Taigman

    Abstract: We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

  10. arXiv:2204.02849  [pdf, other

    cs.CV cs.AI cs.CL cs.GR cs.LG

    KNN-Diffusion: Image Generation via Large-Scale Retrieval

    Authors: Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, Yaniv Taigman

    Abstract: Recent text-to-image models have achieved impressive results. However, since they require large-scale datasets of text-image pairs, it is impractical to train them on new domains where data is scarce or not labeled. In this work, we propose using large-scale retrieval methods, in particular, efficient k-Nearest-Neighbors (kNN), which offers novel capabilities: (1) training a substantially small an… ▽ More

    Submitted 2 October, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

  11. arXiv:2203.13131  [pdf, other

    cs.CV cs.AI cs.CL cs.GR cs.LG

    Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

    Authors: Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, Yaniv Taigman

    Abstract: Recent text-to-image generation methods provide a simple yet exciting conversion capability between text and image domains. While these methods have incrementally improved the generated image fidelity and text relevancy, several pivotal gaps remain unanswered, limiting applicability and quality. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mech… ▽ More

    Submitted 24 March, 2022; originally announced March 2022.

  12. arXiv:2112.05080  [pdf, other

    cs.CV cs.AI

    Locally Shifted Attention With Early Global Integration

    Authors: Shelly Sheynin, Sagie Benaim, Adam Polyak, Lior Wolf

    Abstract: Recent work has shown the potential of transformers for computer vision applications. An image is first partitioned into patches, which are then used as input tokens for the attention mechanism. Due to the expensive quadratic cost of the attention mechanism, either a large patch size is used, resulting in coarse-grained global interactions, or alternatively, attention is applied only on a local re… ▽ More

    Submitted 22 December, 2021; v1 submitted 9 December, 2021; originally announced December 2021.

  13. arXiv:2111.07402  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Textless Speech Emotion Conversion using Discrete and Decomposed Representations

    Authors: Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi

    Abstract: Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, a… ▽ More

    Submitted 13 December, 2022; v1 submitted 14 November, 2021; originally announced November 2021.

    Comments: Paper was published at EMNLP 2022

  14. arXiv:2109.06912  [pdf, other

    eess.AS cs.CL cs.SD

    fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit

    Authors: Changhan Wang, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Ann Lee, Peng-Jen Chen, Jiatao Gu, Juan Pino

    Abstract: This paper presents fairseq S^2, a fairseq extension for speech synthesis. We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants. To enable training speech synthesis models with less curated data, a number of preprocessing tools are built and their importance is shown empirically. To facilitate faster iteration of development and analysis,… ▽ More

    Submitted 14 September, 2021; originally announced September 2021.

    Comments: Accepted to EMNLP 2021 Demo

  15. arXiv:2109.03264  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Text-Free Prosody-Aware Generative Spoken Language Modeling

    Authors: Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Morgane Rivière, Abdelrahman Mohamed, Emmanuel Dupoux, Wei-Ning Hsu

    Abstract: Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) \cite{Lakhotia2021} is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-lik… ▽ More

    Submitted 10 May, 2022; v1 submitted 7 September, 2021; originally announced September 2021.

    Comments: ACL 2022

  16. arXiv:2107.05604  [pdf, other

    cs.CL cs.LG eess.AS

    Direct speech-to-speech translation with discrete units

    Authors: Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Sravya Popuri, Xutai Ma, Adam Polyak, Yossi Adi, Qing He, Yun Tang, Juan Pino, Wei-Ning Hsu

    Abstract: We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representa… ▽ More

    Submitted 21 March, 2022; v1 submitted 12 July, 2021; originally announced July 2021.

    Comments: Accepted to ACL 2022 (long paper)

  17. arXiv:2104.00355  [pdf, other

    cs.SD cs.LG eess.AS

    Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

    Authors: Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux

    Abstract: We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and she… ▽ More

    Submitted 27 July, 2021; v1 submitted 1 April, 2021; originally announced April 2021.

    Comments: In Proceedings of Interspeech 2021

  18. arXiv:2102.01192  [pdf, other

    cs.CL

    Generative Spoken Language Modeling from Raw Audio

    Authors: Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, Emmanuel Dupoux

    Abstract: We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text u… ▽ More

    Submitted 9 September, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

  19. arXiv:2102.00429  [pdf, other

    cs.SD cs.LG eess.AS

    High Fidelity Speech Regeneration with Application to Speech Enhancement

    Authors: Adam Polyak, Lior Wolf, Yossi Adi, Ori Kabeli, Yaniv Taigman

    Abstract: Speech enhancement has seen great improvement in recent years mainly through contributions in denoising, speaker separation, and dereverberation methods that mostly deal with environmental effects on vocal audio. To enhance speech beyond the limitations of the original signal, we take a regeneration approach, in which we recreate the speech from its essence, including the semi-recognized speech, p… ▽ More

    Submitted 31 January, 2021; originally announced February 2021.

  20. arXiv:2008.02830  [pdf, other

    eess.AS cs.LG cs.SD

    Unsupervised Cross-Domain Singing Voice Conversion

    Authors: Adam Polyak, Lior Wolf, Yossi Adi, Yaniv Taigman

    Abstract: We present a wav-to-wav generative model for the task of singing voice conversion from any identity. Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator. The proposed generative architecture is invariant to the speaker's identity and can be trained to generate target singers fr… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

  21. arXiv:1904.08983  [pdf, other

    cs.SD cs.LG stat.ML

    TTS Skins: Speaker Conversion via ASR

    Authors: Adam Polyak, Lior Wolf, Yaniv Taigman

    Abstract: We present a fully convolutional wav-to-wav network for converting between speakers' voices, without relying on text. Our network is based on an encoder-decoder architecture, where the encoder is pre-trained for the task of Automatic Speech Recognition, and a multi-speaker waveform decoder is trained to reconstruct the original signal in an autoregressive manner. We train the network on narrated a… ▽ More

    Submitted 26 July, 2020; v1 submitted 18 April, 2019; originally announced April 2019.

  22. arXiv:1805.07848  [pdf, other

    cs.SD cs.AI cs.LG stat.ML

    A Universal Music Translation Network

    Authors: Noam Mor, Lior Wolf, Adam Polyak, Yaniv Taigman

    Abstract: We present a method for translating music across musical instruments, genres, and styles. This method is based on a multi-domain wavenet autoencoder, with a shared encoder and a disentangled latent space that is trained end-to-end on waveforms. Employing a diverse training dataset and large net capacity, the domain-independent encoder allows us to translate even from musical domains that were not… ▽ More

    Submitted 23 May, 2018; v1 submitted 20 May, 2018; originally announced May 2018.

  23. arXiv:1802.06984  [pdf, other

    cs.LG cs.SD eess.AS

    Fitting New Speakers Based on a Short Untranscribed Sample

    Authors: Eliya Nachmani, Adam Polyak, Yaniv Taigman, Lior Wolf

    Abstract: Learning-based Text To Speech systems have the potential to generalize from one speaker to the next and thus require a relatively short sample of any new voice. However, this promise is currently largely unrealized. We present a method that is designed to capture a new speaker from a short untranscribed audio sample. This is done by employing an additional network that given an audio sample, place… ▽ More

    Submitted 20 February, 2018; originally announced February 2018.

  24. arXiv:1707.06588  [pdf, other

    cs.LG cs.CL cs.SD

    VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop

    Authors: Yaniv Taigman, Lior Wolf, Adam Polyak, Eliya Nachmani

    Abstract: We present a new neural text to speech (TTS) method that is able to transform text to speech in voices that are sampled in the wild. Unlike other systems, our solution is able to deal with unconstrained voice samples and without requiring aligned phonemes or linguistic features. The network architecture is simpler than those in the existing literature and is based on a novel shifting buffer workin… ▽ More

    Submitted 1 February, 2018; v1 submitted 20 July, 2017; originally announced July 2017.

  25. arXiv:1704.05693  [pdf, other

    cs.CV cs.LG

    Unsupervised Creation of Parameterized Avatars

    Authors: Lior Wolf, Yaniv Taigman, Adam Polyak

    Abstract: We study the problem of mapping an input image to a tied pair consisting of a vector of parameters and an image that is created using a graphical engine from the vector of parameters. The mapping's objective is to have the output image as similar as possible to the input image. During training, no supervision is given in the form of matching inputs and outputs. This learning problem extends two… ▽ More

    Submitted 9 July, 2017; v1 submitted 19 April, 2017; originally announced April 2017.

    Comments: v2 -- a change in the references due to a request from authors

  26. arXiv:1611.02200  [pdf, other

    cs.CV

    Unsupervised Cross-Domain Image Generation

    Authors: Yaniv Taigman, Adam Polyak, Lior Wolf

    Abstract: We study the problem of transferring a sample in one domain to an analog sample in another domain. Given two related domains, S and T, we would like to learn a generative function G that maps an input sample from S to the domain T, such that the output of a given function f, which accepts inputs in either domains, would remain unchanged. Other than the function f, the training data is unsupervised… ▽ More

    Submitted 7 November, 2016; originally announced November 2016.