GPT-4o delivers human-like AI interaction with text, audio, and vision integration

Ryan Daws — Tue, 14 May 2024 12:43:56 +0000

OpenAI has launched its new flagship model, GPT-4o, which seamlessly integrates text, audio, and visual inputs and outputs, promising to enhance the naturalness of machine interactions.

GPT-4o, where the “o” stands for “omni,” is designed to cater to a broader spectrum of input and output modalities. “It accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs,” OpenAI announced.

Users can expect a response time as quick as 232 milliseconds, mirroring human conversational speed, with an impressive average response time of 320 milliseconds.

Pioneering capabilities

The introduction of GPT-4o marks a leap from its predecessors by processing all inputs and outputs through a single neural network. This approach enables the model to retain critical information and context that were previously lost in the separate model pipeline used in earlier versions.

Prior to GPT-4o, ‘Voice Mode’ could handle audio interactions with latencies of 2.8 seconds for GPT-3.5 and 5.4 seconds for GPT-4. The previous setup involved three distinct models: one for transcribing audio to text, another for textual responses, and a third for converting text back to audio. This segmentation led to loss of nuances such as tone, multiple speakers, and background noise.

As an integrated solution, GPT-4o boasts notable improvements in vision and audio understanding. It can perform more complex tasks such as harmonising songs, providing real-time translations, and even generating outputs with expressive elements like laughter and singing. Examples of its broad capabilities include preparing for interviews, translating languages on the fly, and generating customer service responses.

Nathaniel Whittemore, Founder and CEO of Superintelligent, commented: “Product announcements are going to inherently be more divisive than technology announcements because it’s harder to tell if a product is going to be truly different until you actually interact with it. And especially when it comes to a different mode of human-computer interaction, there is even more room for diverse beliefs about how useful it’s going to be.

“That said, the fact that there wasn’t a GPT-4.5 or GPT-5 announced is also distracting people from the technological advancement that this is a natively multimodal model. It’s not a text model with a voice or image addition; it is a multimodal token in, multimodal token out. This opens up a huge array of use cases that are going to take some time to filter into the consciousness.”

Performance and safety

GPT-4o matches GPT-4 Turbo performance levels in English text and coding tasks but outshines significantly in non-English languages, making it a more inclusive and versatile model. It sets a new benchmark in reasoning with a high score of 88.7% on 0-shot COT MMLU (general knowledge questions) and 87.2% on the 5-shot no-CoT MMLU.

The model also excels in audio and translation benchmarks, surpassing previous state-of-the-art models like Whisper-v3. In multilingual and vision evaluations, it demonstrates superior performance, enhancing OpenAI’s multilingual, audio, and vision capabilities.

OpenAI has incorporated robust safety measures into GPT-4o by design, incorporating techniques to filter training data and refining behaviour through post-training safeguards. The model has been assessed through a Preparedness Framework and complies with OpenAI’s voluntary commitments. Evaluations in areas like cybersecurity, persuasion, and model autonomy indicate that GPT-4o does not exceed a ‘Medium’ risk level across any category.

Further safety assessments involved extensive external red teaming with over 70 experts in various domains, including social psychology, bias, fairness, and misinformation. This comprehensive scrutiny aims to mitigate risks introduced by the new modalities of GPT-4o.

Availability and future integration

Starting today, GPT-4o’s text and image capabilities are available in ChatGPT—including a free tier and extended features for Plus users. A new Voice Mode powered by GPT-4o will enter alpha testing within ChatGPT Plus in the coming weeks.

Developers can access GPT-4o through the API for text and vision tasks, benefiting from its doubled speed, halved price, and enhanced rate limits compared to GPT-4 Turbo.

OpenAI plans to expand GPT-4o’s audio and video functionalities to a select group of trusted partners via the API, with broader rollout expected in the near future. This phased release strategy aims to ensure thorough safety and usability testing before making the full range of capabilities publicly available.

“It’s hugely significant that they’ve made this model available for free to everyone, as well as making the API 50% cheaper. That is a massive increase in accessibility,” explained Whittemore.

OpenAI invites community feedback to continuously refine GPT-4o, emphasising the importance of user input in identifying and closing gaps where GPT-4 Turbo might still outperform.

(Image Credit: OpenAI)

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post GPT-4o delivers human-like AI interaction with text, audio, and vision integration appeared first on AI News.

Google’s next-gen AI model Gemini outperforms GPT-4

Ryan Daws — Wed, 06 Dec 2023 15:41:29 +0000

Google has unveiled Gemini, a cutting-edge AI model that stands as the company’s most capable and versatile to date.

Demis Hassabis, CEO and Co-Founder of Google DeepMind, introduced Gemini as a multimodal model that is capable of seamlessly understanding and combining various types of information, including text, code, audio, image, and video.

Gemini comes in three optimised versions: Ultra, Pro, and Nano. The Ultra model boasts state-of-the-art performance, surpassing human experts in language understanding and demonstrating unprecedented capabilities in tasks ranging from coding to multimodal benchmarks.

What sets Gemini apart is its native multimodality, eliminating the need for stitching together separate components for different modalities. This groundbreaking approach, fine-tuned through large-scale collaborative efforts across Google teams, positions Gemini as a flexible and efficient model capable of running on data centres to mobile devices.

One of Gemini’s standout features is its sophisticated multimodal reasoning, enabling it to extract insights from vast datasets with remarkable precision. The model’s prowess extends to understanding and generating high-quality code in popular programming languages.

However, as Google ventures into this new era of AI, responsibility and safety remain paramount. Gemini undergoes rigorous safety evaluations, including assessments for bias and toxicity. Google is actively collaborating with external experts to address potential blind spots and ensure the model’s ethical deployment.

Gemini 1.0 is now rolling out across various Google products – including the Bard chatbot – with plans for integration into Search, Ads, Chrome, and Duet AI. However, the Bard upgrade will not be released in Europe pending clearance from regulators.

Developers and enterprise customers can access Gemini Pro via the Gemini API in Google AI Studio or Google Cloud Vertex AI. Android developers will also be able to build with Gemini Nano via AICore, a new system capability available in Android 14.

(Image Credit: Google)

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with Cyber Security & Cloud Expo and Digital Transformation Week.