Google's Gemini announcement

Yesterday, Google and DeepMind announced Gemini, their all-in-one and most capable AI model, so I want to share what I found most interesting about it, and how it challenges OpenAI's GPT-4.

Gemini visual graphic, Google's largest and most capable AI model
Google and DeepMind's Gemini.

Multimodality Model

The main characteristic of Gemini is that it has multi-modality baked in from its conception.

What multi-modality means, is that the model can seamlessly understand and generate content in more than one type of format.

According to Google’s release, Gemini can operate across all possible formats, meaning text, audio, image, audio, and video.

In comparison, most OpenAI’s models and other available models, are single-modality models.

For example, GPT-3.5 is a text model, referred to as a Large Language Model (LLM), because it can only understand text and it can only generate text.

Another example is DALL-E 3, a sort of an image model since it can only generate images, based on a text prompt.

OpenAI describes GPT-4 as a multi-modality model because it can operate with 2 formats of inputs, text and image.

Now, ChatGPT in its version 4 can take images, audio, and text as inputs and produce outputs in the same formats.

This might look like a multi-modality model, but what happens under the hood, is that OpenAI has an integration or coordination layer.

This layer's job is to stitch together the communication between different models to produce the results we get in ChatGPT’s UI, but it doesn’t mean it has multi-modality capabilities on its own.

In contrast, Gemini was developed to take all types of inputs and produce the output it considers best suited to fit the user's request.

This might not seem a big deal, but behind the scenes, the difference is important, since it’s an all-in-one solution they can offer to companies to create outstanding experiences.

If you are looking to create something similar using OpenAI solutions, you need to develop and maintain the integration layer yourself, besides the conversational UI of course.

On top of all of this, Gemini beats GPT-4 in 17 out of 18 tests, according to their benchmarks. The fight is on!

Bespoke Interfaces

Another thing I found interesting and mindblowing, to be honest, was the generation of a user interface in response to the user’s prompts.

Having developed web UIs throughout my career, and knowing the complexity it takes, I can only say this was quite impressive.

It starts as a chat interface but after identifying users' intent, it creates on the fly a "visually rich experience" 🤯, meaning a web UI with relevant data, to better guide the user with its requests.

You better watch the video, if you want to see it for yourself 👉 here

To generate the code, it uses Flutter, the web framework created by Google to build user interfaces, which is not the most complex or widely adopted framework among web developers but still impressive.

It's also interesting in the video, how it can disclaim part of the reasoning process and considerations behind the actions taken by Gemini.

Vision Companion

The last thing I will point out is, among a sea of possible use cases and applications, how Gemini could be used for vision-impaired people to provide real-time descriptions of their environment. Quite impressive!

This is the video everybody has seen (1.8 million views in less than 48 hours after its release), but in case you haven't, it's a funny one to watch.

Availability

Soon Bard will be powered by an optimized version of Gemini, but full Gemini capabilities aren’t available to the public, or companies yet.

Not at least until next week on December 13th when they will host the Google Cloud Applied AI Summit, and probably announce the API release. Surely I’ll attend the sessions and write about it.

That’s all for this one!