Close Menu
    What's Hot

    Appeals Court Rejects Prisoner’s Lawsuit Over Alleged $354M Bitcoin Loss

    November 6, 2025

    You Could’ve Made a 100x Prediction Market Bet on Last Night’s Elections

    November 6, 2025

    Could Dogecoin’s Bear Market Be Starting?

    November 6, 2025
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    CryptoMarketVision
    • Home
    • AI News
    • Altcoin
    • Bitcoin
    • Business
    • Market Analysis
    • Mining
    • Trending Cryptos
    • Moneyprofitt
    • More
      • About Us
      • Contact Us
      • Terms and Conditions
      • Privacy Policy
      • Disclaimer
    CryptoMarketVision
    Home»AI News»World's largest open-source multimodal dataset delivers 17x training efficiency, unlocking enterprise AI that connects documents, audio and video
    World's largest open-source multimodal dataset delivers 17x training efficiency, unlocking enterprise AI that connects documents, audio and video
    AI News

    World's largest open-source multimodal dataset delivers 17x training efficiency, unlocking enterprise AI that connects documents, audio and video

    adminBy adminOctober 17, 2025No Comments7 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email



    AI models are only as good as the data they're trained on. That data generally needs to be labeled, curated and organized before models can learn from it in an effective way.

    One of the big missing links in the AI ecosystem has been the availability of a large high-quality open-source multimodal dataset. That changes today with the debut of the EMM-1 dataset which is comprised of 1 billion data pairs and 100M data groups across 5 modalities: text, image, video, audio and 3d point clouds. Multimodal datasets combine different types of data that AI systems can process together. This mirrors how humans perceive the world using multiple senses simultaneously. These datasets enable AI systems to make richer inferences by understanding relationships across data types, rather than processing each modality in isolation.

    EMM-1 is developed by data labeling platform vendor Encord. The company's platform enables teams to curate, label and manage training data at scale using both automated and human-in-the-loop workflows. Alongside the new model, Encord developed the EBind training methodology that prioritizes data quality over raw computational scale. The approach enabled a compact 1.8 billion parameter model to match the performance of models up to 17 times larger while slashing training time from days to hours on a single GPU rather than GPU clusters.

    "The big trick for us was to really focus on the data and to make the data very, very high quality," Encord Co-Founder and CEO Eric Landau told VentureBeat in an exclusive interview. "We were able to get to the same level of performance as models 20 times larger, not because we were super clever on the architecture, but because we trained it with really good data overall."

    The data quality advantage

    Encord's dataset is 100 times larger than the next comparable multimodal dataset, according to Landau. It operates at petabyte scale with terabytes of raw data and over 1 million human annotations.

    But scale alone doesn't explain the performance gains. The technical innovation centers on addressing what Landau calls an "under-appreciated" problem in AI training: data leakage between training and evaluation sets.

    "The leakage problem was one which we spent a lot of time on," Landau explained. "In a lot of data sets, there is a kind of leakage between different subsets of the data. Leakage actually boosts your results. It makes your evaluations look better. But it's one thing that we were quite diligent about."

    Data leakage occurs when information from test data inadvertently appears in training data, artificially inflating model performance metrics. Many benchmark datasets suffer from this contamination. Encord deployed hierarchical clustering techniques to ensure clean separation while maintaining representative distribution across data types. The company also used clustering to address bias and ensure diverse representation.

    How EBind boosts efficiency

    The data quality improvements work in tandem with an architectural approach designed for efficiency

    Encord's EBind extends the CLIP (Contrastive Language-Image Pre-training) approach (originally developed by OpenAI) from two modalities to five. CLIP learns to associate images and text in a shared representation space, enabling tasks like searching for images using text descriptions.

    Where CLIP learns to associate images and text in a shared latent space, EBind does the same across images, text, audio, 3D point clouds and video.

    The architectural choice prioritizes parameter efficiency. Rather than deploying separate specialized models for each modality pair, EBind uses a single base model with one encoder per modality.

    "Other methodologies, what they do is they use a bunch of different models, and they route to the best model for embedding these pairs, so they tend to explode in the number of parameters," Landau said. "We found we could use a single base model and just train one encoder per modality, so keeping it very simple and very parameter efficient, if we fed that overall architecture really, really good data."

    The resulting model rivals OmniBind, a much larger competitor in the multimodal space, but requires dramatically fewer computational resources for both training and inference. This makes EBind deployable in resource-constrained environments including edge devices for robotics and autonomous systems.

    The enterprise value of a multi-modal dataset

    Multimodal models enable enterprise use cases that span different data types.

    Most organizations store different data types in separate systems: documents in content management platforms, audio recordings in communication tools, training videos in learning management systems and structured data in databases. Multimodal models can search and retrieve across all of these simultaneously.

    "Enterprises have all different types of data. They don't just have documents. They have audio recordings, and they have training videos, and they have CSV files," Landau said. "Let's say you're a lawyer and you have a case file that has video evidence and also documents and recordings, and it's all scattered across a lot of silos of data. You can use EBind to pick all of the relevant data and bundle together to search and surface the right data much quicker than you would have before."

    The same principle applies across verticals. Healthcare providers can link patient imaging data to clinical notes and diagnostic audio. Financial services firms can connect transaction records to compliance call recordings and customer communications. Manufacturing operations can tie equipment sensor data to maintenance video logs and inspection reports.

    Beyond office environments, physical AI represents another frontier. Landau highlighted autonomous vehicles that benefit from both visual perception and audio cues like emergency sirens. In manufacturing and warehousing, robots that combine visual recognition with audio feedback and spatial awareness can operate more safely and effectively than vision-only systems.

    Enterprise use case: Extending computer vision with multimodal context

    Captur AI, an Encord customer, illustrates how companies are planning to use the dataset for specific business applications. The startup provides on-device image verification for mobile apps, validating photos in real-time for authenticity, compliance and quality before upload. The company works with shared mobility providers like Lime and delivery companies capturing billions of package photos.

    Captur AI processes over 100 million images on-device and specializes in distilling models to 6-10 megabytes so they can run on smartphones without cloud connectivity. But CEO Charlotte Bax sees multimodal capabilities as critical for expanding into higher-value use cases.

    "The market for us is massive. You submit photos for returns and retails. You submit photos to insurance companies for claims. You submit photos when you're listing something on eBay," Bax told VentureBeat in an exclusive interview. "Some of those use cases are very high risk or high value if something goes wrong, like insurance, the image only captures part of the context and audio can be an important signal."

    Bax cited digital vehicle inspections as a prime example. When customers photograph vehicle damage for insurance claims, they often describe what happened verbally while capturing images. Audio context can significantly improve claim accuracy and reduce fraud.

    "As you're doing that, oftentimes the customer is actually describing what's happened," Bax said. "A few of our potential prospects in InsurTech have asked us if we can actually do audio as well, because then that adds this additional bit of context for the user who's submitting the claim."

    The challenge lies in maintaining Captur AI's core advantage: running models efficiently on-device rather than requiring cloud processing. The company plans to use Encord's dataset to train compact multimodal models that preserve real-time, offline capabilities while adding audio and sequential image context.

    "The most important thing you can do is try and get as much context as possible," Bax said. "Can you get LLMs to be small enough to run on a device within the next three years, or can you run multimodal models on the device? Solving data quality before image upload is the interesting frontier."

    What this means for enterprises

    Encord's results challenge fundamental assumptions about AI development and suggest that the next competitive battleground may be data operations rather than infrastructure scale.

    Multimodal datasets unlock new capabilities. The ability to train models that understand relationships across data types opens use cases that single-modality systems cannot address.

    Data operations deserve equal investment with compute infrastructure. The 17x parameter efficiency gain from better data curation represents orders of magnitude in cost savings. Organizations pouring resources into GPU clusters while treating data quality as an afterthought may be optimizing the wrong variable.

    For enterprises building multimodal AI systems, Landau's assessment captures the strategic shift.

     "We were able to get to the same level of performance as models much  larger, not because we were super clever on the architecture, but because we trained it with really good data overall," he said.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    admin
    • Website

    Related Posts

    The Hyundai Metaplant: A New Era in EV Manufacturing

    November 5, 2025

    Databricks research reveals that building better AI judges isn't just a technical concern, it's a people problem

    November 5, 2025

    Snowflake builds new intelligence that goes beyond RAG to query and aggregate thousands of documents at once

    November 4, 2025

    OpenAI spreads $600B cloud AI bet across AWS, Oracle, Microsoft

    November 4, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Appeals Court Rejects Prisoner’s Lawsuit Over Alleged $354M Bitcoin Loss

    November 6, 2025

    You Could’ve Made a 100x Prediction Market Bet on Last Night’s Elections

    November 6, 2025

    Could Dogecoin’s Bear Market Be Starting?

    November 6, 2025

    Subscribe to Updates

    Get the latest sports news from SportsSite about soccer, football and tennis.

    Welcome to Crypto Market Vision – your trusted source for everything crypto Our mission is simple: to make the world of cryptocurrency clear, accessible, and actionable for everyone. Whether you are a beginner exploring Bitcoin for the first time or a seasoned trader looking for market insights, our goal is to keep you informed, empowered, and ahead of the curve.

    Facebook X (Twitter) Instagram Pinterest YouTube
    Top Insights

    Appeals Court Rejects Prisoner’s Lawsuit Over Alleged $354M Bitcoin Loss

    November 6, 2025

    You Could’ve Made a 100x Prediction Market Bet on Last Night’s Elections

    November 6, 2025

    Could Dogecoin’s Bear Market Be Starting?

    November 6, 2025
    Get Informed

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • Contact Us
    • About Us
    • Terms and Conditions
    • Privacy Policy
    • Disclaimer

    © 2025 cryptomarketvision.com. All rights reserved. Designed by DD.

    Type above and press Enter to search. Press Esc to cancel.

    ethereum
    Ethereum (ETH) $ 3,440.50
    tether
    Tether (USDT) $ 1.00
    bitcoin
    Bitcoin (BTC) $ 103,777.81
    xrp
    XRP (XRP) $ 2.35
    bnb
    BNB (BNB) $ 960.45
    solana
    Wrapped SOL (SOL) $ 162.32
    usd-coin
    USDC (USDC) $ 0.999974
    dogecoin
    Dogecoin (DOGE) $ 0.166912