He's a good adventure dude.
597 stories

MobileDiffusion: Rapid text-to-image generation on-device

1 Share

Text-to-image diffusion models have shown exceptional capabilities in generating high-quality images from text prompts. However, leading models feature billions of parameters and are consequently expensive to run, requiring powerful desktops or servers (e.g., Stable Diffusion, DALL·E, and Imagen). While recent advancements in inference solutions on Android via MediaPipe and iOS via Core ML have been made in the past year, rapid (sub-second) text-to-image generation on mobile devices has remained out of reach.

To that end, in “MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices”, we introduce a novel approach with the potential for rapid text-to-image generation on-device. MobileDiffusion is an efficient latent diffusion model specifically designed for mobile devices. We also adopt DiffusionGAN to achieve one-step sampling during inference, which fine-tunes a pre-trained diffusion model while leveraging a GAN to model the denoising step. We have tested MobileDiffusion on iOS and Android premium devices, and it can run in half a second to generate a 512x512 high-quality image. Its comparably small model size of just 520M parameters makes it uniquely suited for mobile deployment.

Rapid text-to-image generation on-device.


The relative inefficiency of text-to-image diffusion models arises from two primary challenges. First, the inherent design of diffusion models requires iterative denoising to generate images, necessitating multiple evaluations of the model. Second, the complexity of the network architecture in text-to-image diffusion models involves a substantial number of parameters, regularly reaching into the billions and resulting in computationally expensive evaluations. As a result, despite the potential benefits of deploying generative models on mobile devices, such as enhancing user experience and addressing emerging privacy concerns, it remains relatively unexplored within the current literature.

The optimization of inference efficiency in text-to-image diffusion models has been an active research area. Previous studies predominantly concentrate on addressing the first challenge, seeking to reduce the number of function evaluations (NFEs). Leveraging advanced numerical solvers (e.g., DPM) or distillation techniques (e.g., progressive distillation, consistency distillation), the number of necessary sampling steps have significantly reduced from several hundreds to single digits. Some recent techniques, like DiffusionGAN and Adversarial Diffusion Distillation, even reduce to a single necessary step.

However, on mobile devices, even a small number of evaluation steps can be slow due to the complexity of model architecture. Thus far, the architectural efficiency of text-to-image diffusion models has received comparatively less attention. A handful of earlier works briefly touches upon this matter, involving the removal of redundant neural network blocks (e.g., SnapFusion). However, these efforts lack a comprehensive analysis of each component within the model architecture, thereby falling short of providing a holistic guide for designing highly efficient architectures.


Effectively overcoming the challenges imposed by the limited computational power of mobile devices requires an in-depth and holistic exploration of the model's architectural efficiency. In pursuit of this objective, our research undertakes a detailed examination of each constituent and computational operation within Stable Diffusion’s UNet architecture. We present a comprehensive guide for crafting highly efficient text-to-image diffusion models culminating in the MobileDiffusion.

The design of MobileDiffusion follows that of latent diffusion models. It contains three components: a text encoder, a diffusion UNet, and an image decoder. For the text encoder, we use CLIP-ViT/L14, which is a small model (125M parameters) suitable for mobile. We then turn our focus to the diffusion UNet and image decoder.

Diffusion UNet

As illustrated in the figure below, diffusion UNets commonly interleave transformer blocks and convolution blocks. We conduct a comprehensive investigation of these two fundamental building blocks. Throughout the study, we control the training pipeline (e.g., data, optimizer) to study the effects of different architectures.

In classic text-to-image diffusion models, a transformer block consists of a self-attention layer (SA) for modeling long-range dependencies among visual features, a cross-attention layer (CA) to capture interactions between text conditioning and visual features, and a feed-forward layer (FF) to post-process the output of attention layers. These transformer blocks hold a pivotal role in text-to-image diffusion models, serving as the primary components responsible for text comprehension. However, they also pose a significant efficiency challenge, given the computational expense of the attention operation, which is quadratic to the sequence length. We follow the idea of UViT architecture, which places more transformer blocks at the bottleneck of the UNet. This design choice is motivated by the fact that the attention computation is less resource-intensive at the bottleneck due to its lower dimensionality.

Our UNet architecture incorporates more transformers in the middle, and skips self-attention (SA) layers at higher resolutions.

Convolution blocks, in particular ResNet blocks, are deployed at each level of the UNet. While these blocks are instrumental for feature extraction and information flow, the associated computational costs, especially at high-resolution levels, can be substantial. One proven approach in this context is separable convolution. We observed that replacing regular convolution layers with lightweight separable convolution layers in the deeper segments of the UNet yields similar performance.

In the figure below, we compare the UNets of several diffusion models. Our MobileDiffusion exhibits superior efficiency in terms of FLOPs (floating-point operations) and number of parameters.

Comparison of some diffusion UNets.

Image decoder

In addition to the UNet, we also optimized the image decoder. We trained a variational autoencoder (VAE) to encode an RGB image to an 8-channel latent variable, with 8× smaller spatial size of the image. A latent variable can be decoded to an image and gets 8× larger in size. To further enhance efficiency, we design a lightweight decoder architecture by pruning the original’s width and depth. The resulting lightweight decoder leads to a significant performance boost, with nearly 50% latency improvement and better quality. For more details, please refer to our paper.

VAE reconstruction. Our VAE decoders have better visual quality than SD (Stable Diffusion).

Decoder   #Params (M)     PSNR↑     SSIM↑     LPIPS↓  
SD 49.5 26.7 0.76 0.037
Ours 39.3 30.0 0.83 0.032
Ours-Lite     9.8 30.2 0.84 0.032

Quality evaluation of VAE decoders. Our lite decoder is much smaller than SD, with better quality metrics, including peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS).

One-step sampling

In addition to optimizing the model architecture, we adopt a DiffusionGAN hybrid to achieve one-step sampling. Training DiffusionGAN hybrid models for text-to-image generation encounters several intricacies. Notably, the discriminator, a classifier distinguishing real data and generated data, must make judgments based on both texture and semantics. Moreover, the cost of training text-to-image models can be extremely high, particularly in the case of GAN-based models, where the discriminator introduces additional parameters. Purely GAN-based text-to-image models (e.g., StyleGAN-T, GigaGAN) confront similar complexities, resulting in highly intricate and expensive training.

To overcome these challenges, we use a pre-trained diffusion UNet to initialize the generator and discriminator. This design enables seamless initialization with the pre-trained diffusion model. We postulate that the internal features within the diffusion model contain rich information of the intricate interplay between textual and visual data. This initialization strategy significantly streamlines the training.

The figure below illustrates the training procedure. After initialization, a noisy image is sent to the generator for one-step diffusion. The result is evaluated against ground truth with a reconstruction loss, similar to diffusion model training. We then add noise to the output and send it to the discriminator, whose result is evaluated with a GAN loss, effectively adopting the GAN to model a denoising step. By using pre-trained weights to initialize the generator and the discriminator, the training becomes a fine-tuning process, which converges in less than 10K iterations.

Illustration of DiffusionGAN fine-tuning.


Below we show example images generated by our MobileDiffusion with DiffusionGAN one-step sampling. With such a compact model (520M parameters in total), MobileDiffusion can generate high-quality diverse images for various domains.

Images generated by our MobileDiffusion

We measured the performance of our MobileDiffusion on both iOS and Android devices, using different runtime optimizers. The latency numbers are reported below. We see that MobileDiffusion is very efficient and can run within half a second to generate a 512x512 image. This lightning speed potentially enables many interesting use cases on mobile devices.

Latency measurements (s) on mobile devices.


With superior efficiency in terms of latency and size, MobileDiffusion has the potential to be a very friendly option for mobile deployments given its capability to enable a rapid image generation experience while typing text prompts. And we will ensure any application of this technology will be in-line with Google’s responsible AI practices.


We like to thank our collaborators and contributors that helped bring MobileDiffusion to on-device: Zhisheng Xiao, Yanwu Xu, Jiuqiang Tang, Haolin Jia, Lutz Justen, Daniel Fenner, Ronald Wotzlaw, Jianing Wei, Raman Sarokin, Juhyun Lee, Andrei Kulik, Chuo-Ling Chang, and Matthias Grundmann.

Read the whole story
160 days ago
Share this story

Saturday Morning Breakfast Cereal - Wish

1 Comment

Click here to go see the bonus panel!

She sells printer ink cheaper than anyone, but you DO NOT ASK HOW.

Today's News:
Read the whole story
183 days ago
Nipple genie? I wish for broad spectrum antivirals and antifugals that are safe, inexpensive to reproduce, and do not create resistance in their target pathogens
2md wish is similar but anticancer.
Share this story

We need to tell people ChatGPT will lie to them, not debate linguistics

2 Comments and 5 Shares

ChatGPT lies to people. This is a serious bug that has so far resisted all attempts at a fix. We need to prioritize helping people understand this, not debating the most precise terminology to use to describe it.

We accidentally invented computers that can lie to us

I tweeted (and tooted) this:

Mainly I was trying to be pithy and amusing, but this thought was inspired by reading Sam Bowman's excellent review of the field, Eight Things to Know about Large Language Models. In particular this:

More capable models can better recognize the specific circumstances under which they are trained. Because of this, they are more likely to learn to act as expected in precisely those circumstances while behaving competently but unexpectedly in others. This can surface in the form of problems that Perez et al. (2022) call sycophancy, where a model answers subjective questions in a way that flatters their user’s stated beliefs, and sandbagging, where models are more likely to endorse common misconceptions when their user appears to be less educated.

Sycophancy and sandbagging are my two favourite new pieces of AI terminology!

What I find fascinating about this is that these extremely problematic behaviours are not the system working as intended: they are bugs! And we haven't yet found a reliable way to fix them.

(Here's the paper that snippet references: Discovering Language Model Behaviors with Model-Written Evaluations from December 2022.)

"But a machine can't deliberately tell a lie"

I got quite a few replies complaining that it's inappropriate to refer to LLMs as "lying", because to do so anthropomorphizes them and implies a level of intent which isn't possible.

I completely agree that anthropomorphism is bad: these models are fancy matrix arithmetic, not entities with intent and opinions.

But in this case, I think the visceral clarity of being able to say "ChatGPT will lie to you" is a worthwhile trade.

Science fiction has been presenting us with a model of "artificial intelligence" for decades. It's firmly baked into our culture that an "AI" is an all-knowing computer, incapable of lying and able to answer any question with pin-point accuracy.

Large language models like ChatGPT, on first encounter, seem to fit that bill. They appear astonishingly capable, and their command of human language can make them seem like a genuine intelligence, at least at first glance.

But the more time you spend with them, the more that illusion starts to fall apart.

They fail spectacularly when prompted with logic puzzles, or basic arithmetic, or when asked to produce citations or link to sources for the information they present.

Most concerningly, they hallucinate or confabulate: they make things up! My favourite example of this remains their ability to entirely imagine the content of a URL. I still see this catching people out every day. It's remarkably convincing.

Why ChatGPT and Bing Chat are so good at making things up is an excellent in-depth exploration of this issue from Benj Edwards at Ars Technica.

We need to explain this in straight-forward terms

We're trying to solve two problems here:

  1. ChatGPT cannot be trusted to provide factual information. It has a very real risk of making things up, and if people don't understand it they are guaranteed to be mislead.
  2. Systems like ChatGPT are not sentient, or even intelligent systems. They do not have opinions, or feelings, or a sense of self. We must resist the temptation to anthropomorphize them.

I believe that the most direct form of harm caused by LLMs today is the way they mislead their users. The first problem needs to take precedence.

It is vitally important that new users understand that these tools cannot be trusted to provide factual answers. We need to help people get there as quickly as possible.

Which of these two messages do you think is more effective?

ChatGPT will lie to you


ChatGPT doesn't lie, lying is too human and implies intent. It hallucinates. Actually no, hallucination still implies human-like thought. It confabulates. That's a term used in psychiatry to describe when someone replaces a gap in one's memory by a falsification that one believes to be true - though of course these things don't have human minds so even confabulation is unnecessarily anthropomorphic. I hope you've enjoyed this linguistic detour!

Let's go with the first one. We should be shouting this message from the rooftops: ChatGPT will lie to you.

That doesn't mean it's not useful - it can be astonishingly useful, for all kinds of purposes... but seeking truthful, factual answers is very much not one of them. And everyone needs to understand that.

Convincing people that these aren't a sentient AI out of a science fiction story can come later. Once people understand their flaws this should be an easier argument to make!

Should we warn people off or help them on?

This situation raises an ethical conundrum: if these tools can't be trusted, and people are demonstrably falling for their traps, should we encourage people not to use them at all, or even campaign to have them banned?

Every day I personally find new problems that I can solve more effectively with the help of large language models. Some recent examples from just the last few weeks:

Each of these represents a problem I could have solved without ChatGPT... but at a time cost that would have been prohibitively expensive, to the point that I wouldn't have bothered.

I wrote more about this in AI-enhanced development makes me more ambitious with my projects.

Honestly, at this point using ChatGPT in the way that I do feels like a massively unfair competitive advantage. I'm not worried about AI taking people's jobs: I'm worried about the impact of AI-enhanced developers like myself.

It genuinely feels unethical for me not to help other people learn to use these tools as effectively as possible. I want everyone to be able to do what I can do with them, as safely and responsibly as possible.

I think the message we should be emphasizing is this:

These are incredibly powerful tools. They are far harder to use effectively than they first appear. Invest the effort, but approach with caution: we accidentally invented computers that can lie to us and we can't figure out how to make them stop.

There's a time for linguistics, and there's a time for grabbing the general public by the shoulders and shouting "It lies! The computer lies to you! Don't trust anything it says!"

Read the whole story
459 days ago
They don't lie any more than a pen or a paintbrush, a typewriter or a stone. It's a goddamn tool. Don't give a tool agency by expecting truth from it. This pearl clutching is nauseating.
Share this story
1 public comment
457 days ago
I get that "AI lies" is a concise message. Let me offer a better message: "AI bullshits." Why? Because liars know what the truth is. Bullshitters don't. A liar has an ulterior motive to deceive. A bullshitter really hopes they got the right answer. It also points to a solution: AI telling the user how confident it is about any given sentence. If AI could own up to its bullshit, well, maybe we would create something better than mankind.
Washington, District of Columbia

Musk admits NPR isn’t state-affiliated after asking questions he could have Googled

1 Comment and 2 Shares
NPR's Twitter profile with the newly applied

Enlarge / NPR's Twitter profile as of April 9, 2023.

When Elon Musk slapped NPR's Twitter account with a "US state-affiliated media" label last week, it quickly became clear he didn't know much about how NPR operates or how it's funded. After admitting the state-affiliated label was wrong, Musk changed NPR's tag yesterday to "Government Funded Media"—even though NPR gets less than 1 percent of its annual funding directly from the US government.

The state-affiliated tag took NPR and many others by surprise, in part because it contradicted Twitter's own policy that cited NPR and the BBC as examples of state-financed media organizations that retain editorial independence. Twitter has historically applied its state-affiliated tag to state-controlled news organizations like Russia's RT and China's Xinhua.

Twitter changed its policy to remove the reference to editorial independence at NPR and the BBC, but didn't scrub the old language from another Twitter help page that still describes both NPR and the BBC as editorially independent. The BBC's main Twitter account is also newly labeled as "Government Funded Media" after previously having no label.

In emails with NPR reporter Bobby Allyn, Musk asked basic questions that he could have found answers to with a quick Internet search. "He didn't seem to understand the difference between public media and state-controlled media," Allyn said Friday in an interview with Mary Louise Kelly on the show All Things Considered.

Allyn continued:

He asked me at one point, quote, "what's the breakdown of NPR's annual funding?" And he asked, "who appoints leadership at NPR?" These are questions you can get by Googling, but for some reason he wanted to ask me. And also, let's take a moment and pause on these questions, Mary Louise, because he made a major policy decision, right? And after doing so, he is just now asking for the basic facts. This is not exactly how most CEOs in America operate. Anyway, I answered his questions. About 1 percent of NPR's budget is from federal grants, and an independent board appoints NPR's CEO, who picks leadership.

Musk: Label “might not be accurate”

Musk could have gotten the NPR funding information from this NPR page, which says, "On average, less than 1 percent of NPR's annual operating budget comes in the form of grants from CPB [Corporation for Public Broadcasting] and federal agencies and departments."

Corporate sponsorships are the top contributor to NPR funding, accounting for 39 percent of average annual revenue between 2018 and 2022. NPR gets another 31 percent of its funding in programming fees from member organizations. Federal funding indirectly contributes to the latter category because the publicly funded CPB provides annual grants to public radio stations that pay NPR for programming.

Musk's emails were further detailed in an article by Allyn. After Allyn told Musk that NPR gets only 1 percent of its money from the government, Musk replied, "Well, then we should fix it."

"The operating principle at new Twitter is simply fair and equal treatment, so if we label non-US accounts as govt, then we should do the same for US, but it sounds like that might not be accurate here," Musk wrote in another email to Allyn.

NPR's current government-funded label links to Twitter's policy, which includes the Twitter's definition of state-affiliated media accounts but doesn't provide a definition of government-funded.

Ex-Twitter exec explains pre-Musk labeling

Allyn's article quoted a former Twitter executive who helped develop the state-affiliation labels. The executive "said that editorial independence had long been the deciding factor in whether to issue the designation." The article continued:

The People's Daily in China, and Sputnik and RT in Russia, for instance, received the labels, but outlets with editorial autonomy that received some government funding did not.

"In the end, [we] felt that the most fair and balanced way to implement labels was to call out state connections that had a demonstrated track record of influencing content of news reporting," the former Twitter executive said.

That meant that NPR, the government-funded outlet Voice of America, "and even Al Jazeera didn't qualify under our designation," the former employee said. The point of the labels, the former executive said, was to help users understand what they're seeing on the platform.

Al Jazeera's Twitter accounts are not labeled as either state-affiliated or government-funded. Twitter added a government-funded label to the US-owned Voice of America's Twitter account sometime this weekend.

We contacted NPR about the new "Government Funded Media" label and will update this article if we get a response. NPR has stopped posting tweets since getting the state-affiliated tag, and updated its bio to read, "NPR is an independent news organization committed to informing the public about the world around us. You can find us every other place you read the news."

Twitter no longer a “credible platform”

KCRW, an NPR member station in Santa Monica, California, emailed listeners to tell them that KCRW will no longer post on Twitter from its main accounts. KCRW President Jennifer Ferro noted that the state-affiliated media tag is "a term the platform applies to propaganda outlets in countries without a free press, a guaranteed right in the United States."

"There is a chance that Twitter will remove the label from NPR. Even so, we no longer have confidence that Twitter is a credible platform," the email said.

PEN America, a 100-year-old nonprofit that advocates for free expression through literature, criticized Twitter for labeling NPR as state-affiliated media.

"Twitter has inexplicably added a warning to NPR's Twitter account, labeling the venerated news outlet as state sponsored media, on par with Russia Today and other mouthpieces for authoritarian regimes," the group said. PEN America pointed to Twitter's definition of state-affiliated media as "outlets where the state exercises control over editorial content through financial resources, direct or indirect political pressures, and/or control over production and distribution."

"That is unquestionably not NPR, which assiduously maintains editorial independence... the US government exercises no editorial control over NPR whatsoever," the group said.

Read Comments

Read the whole story
459 days ago
Good, burn twitter to the ground. Mass media should not be free anonymously.
Share this story

ChatGPT is making up fake Guardian articles. Here’s how we’re responding | Chris Moran

1 Comment and 2 Shares

The risks inherent in the technology, plus the speed of its take-up, demonstrate why it’s so vital that we keep track of it

  • Chris Moran is the Guardian’s head of editorial innovation

Last month one of our journalists received an interesting email. A researcher had come across mention of a Guardian article, written by the journalist on a specific subject from a few years before. But the piece was proving elusive on our website and in search. Had the headline perhaps been changed since it was launched? Had it been removed intentionally from the website because of a problem we’d identified? Or had we been forced to take it down by the subject of the piece through legal means?

The reporter couldn’t remember writing the specific piece, but the headline certainly sounded like something they would have written. It was a subject they were identified with and had a record of covering. Worried that there may have been some mistake at our end, they asked colleagues to go back through our systems to track it down. Despite the detailed records we keep of all our content, and especially around deletions or legal issues, they could find no trace of its existence.

Continue reading...
Read the whole story
460 days ago
Why are people accusing ChatGPT of saying true things? That's not how it works.
Share this story

'Slavery was wrong' among things teachers can't say anymore - The Washington Post

1 Comment and 2 Shares
Read the whole story
487 days ago
Wouldn't it be the slavery "is" wrong seeing that there's still more than a bit going on?
Share this story
Next Page of Stories