The state of AI in 2021: Machine learning in production, MLOps and data-centric AI – ZDNet

With lessons learned from operationalizing AI, the emphasis is shifting from shiny new models to perhaps more mundane, but practical aspects such as data quality and data pipeline management
By for Big on Data | | Topic: Big Data Analytics
It’s that time of year again: Reports on the state of AI for 2021 are out. A few days back, it was the Machine learning, Artificial Intelligence and Data report by Matt Turck, that ZDNet Big on Data colleague Tony Baer covered. This week, it’s the State of AI 2021 report, by Nathan Benaich and Ian Hogarth.
After releasing what probably was the most comprehensive report on the State of AI in 2020Air Street Capital and RAAIS founder Nathan Benaich and AI angel investor and UCL IIPP visiting professor Ian Hogarth are back for more.
In what is becoming a valued yearly tradition, we caught up with Benaich and Hogarth to discuss topics that stood out for us in the report.
First off, there is overlap with the topics that Turck covered and Baer reported on, and for good reason. As Baer pointed out, the wave of IPOs and proliferation of unicorns is turning this market into its own sector, and that is impossible to ignore. For an overview of market trends, we encourage readers to have a look at Baer’s coverage.
That said, our feeling is that the State of AI 2021 report covers more topics: the latest developments in AI research, industry, talent, and politics, while it also ventures on predictions. In fact, Benaich and Hogarth keep track of their predictions, and they are doing pretty well. For example, in 2020 they correctly predicted the obstacles in Arm’s acquisition by Nvidia, and AI and biotech-related IPOs.
As Benaich noted, by virtue of being investors at different mostly early stages machine learning companies, they have access to major AI labs, academic groups, up and coming startups, bigger companies, as well as people who work in government. So they try to synthesize all those different angles in a public good product that is open source and aims to holistically inform all stakeholders.
We picked some overarching themes that stood out for us in the report, as we have also identified them throughout the year. The first one is MLOps — the art and science of bringing machine learning to production. In operationalizing AI, the emphasis is shifting from shiny new models to perhaps more mundane, but practical aspects.
With the increasing power and availability of machine learning models, gains from model improvements have become marginal. In this context, the machine learning community is growing increasingly aware of the importance of better data practices, and more generally better MLOps, to build reliable machine learning products.
With the increasing power and availability of machine learning models, gains from model improvements have become marginal. In this context, the machine learning community is growing increasingly aware of the importance of better data practices, and more generally better MLOps, to build reliable machine learning products.
Benaich noted that they thought it important to highlight renewed attention in more industry minded academic work around data quality and various issues that can reside within data that ultimately propagate towards ML models, determining whether models predicts well or not:
“A lot of academia was focused on competing on static benchmarks, showing model performance offline on these benchmarks, and then moving into industry. So generation one was a lot about — let’s just get a model that works for a specific problem, and then deal with any issues or any changes whenever they happen.
Ethics of AI: Benefits and risks of artificial intelligence
The increasing scale of AI is raising the stakes for major ethical questions.
Read More
There’s been a huge amount of money and interest and engineering time that’s been thrown into MLOps. And this is motivated by the idea that machine learning is not like a static software product that you can write once and forget about. You have to constantly update it, and it’s not just [about] updating the model.
You have to look at how your classes might drift over time, or if you’re still using the right benchmarks to determine whether a new model that you trained is going to work in production or not. You may see issues like choosing different random seeds for your model and then seeing completely different behavior on real world data, or even that data that you’ve been using is garbage”.
That sounds intuitively right, and probably resonates with anyone who has worked with machine learning models and data pipelines. Now people are giving names to that phenomena, such as distribution shifts (mismatches in dataset versions) and data cascades (issues with data influencing downstream operations). As naming things is the first step to start analyzing them and taking them more seriously, that’s a good thing.
A distribution shift happens when data at test/deployment time is different from the training data. In production, this often happens in the form of concept drifts, where the test data gradually changes over time.
As machine learning is increasingly used in real-world applications, the need for a solid understanding of distributional shifts becomes paramount. This begins with designing challenging benchmarks, Benaich and Hogarth state in the report.
Benaich believes that it’s hard to pin specific distribution shift examples in the real world, because organizations would probably not want the world to know they were affected by such issues. But one of the areas this could affect would be around pricing on various retail websites.
Frequently, there is a machine-learning powered dynamic pricing engine in the back-end, and its output depends on how much information they have about you, noted Benaich. So distribution shift may mean you end up getting a very, very different price for a particular product that you’re looking at, depending on which data is being utilized. Interestingly, this exact practice is targeted by China’s market regulator.
Benaich emphasized the fact that there were at least two major new datasets released aiming to deal with distributions shifts, WILDS and Shifts, developed by a number of American and Japanese universities and companies and Yandex, respectively.
Having more industry-oriented datasets being used in academia means the ultimately academic projects are more likely to succeed in the production environment, because there’s less distribution shift when you move from industry to academia and vice versa, noted Benaich.
The importance of data is not new — there are well-established mathematical, algorithmic, and systems techniques for working with data, which have been developed over decades.
What is new is how to build on and re-examine these techniques in light of modern AI models and methods. Just a few years ago, we did not have long-lived AI systems or the current breed of powerful deep models.
Google researchers define data cascades as “compounding events causing negative, downstream effects from data issues”. Supported by a survey of 53 practitioners from the US, India, East and West African countries, they warn that current practices undervalue data quality and result in data cascades.
It’s a fairly intuitive idea — the domino effect. If you have a problem at the start, it’s going to likely come down by the time you get to the last domino. What’s notable is that the overwhelming majority of data scientists reports having experienced one of these issues.
When trying to attribute why these issues actually happened, it was mostly due to lack of recognition of the importance of data within the context of their work in AI, or lack of training in the domain, or not getting access to enough specialized data for the particular problem that they were solving.
What that points to is that in the world of machine learning there is more nuance than “good data” and “bad data”. As datasets are multi-faceted, with different subsets used in different contexts, and different versions evolving, context is key in defining data quality. The insights from machine learning in production incite a shift of focus from model-centric to data-centric AI.
Data-centric AI is a notion developed in Hazy Research, Chris Ré’s Research Group at Stanford. As noted, the importance of data is not new — there are well-established mathematical, algorithmic, and systems techniques for working with data, which have been developed over decades.
What is new is how to build on and re-examine these techniques in light of modern AI models and methods. Just a few years ago, we did not have long-lived AI systems or the current breed of powerful deep models.
Join us next week as we continue the conversation with Benaich and Hogarth, to cover topics such as language models, AI commercialization, and AI-powered biotechnology.: 
By for Big on Data | | Topic: Big Data Analytics
Developer
Open source backend as a service Appwrite gets $10M seed funding to commercialize traction
Artificial Intelligence
An AI-powered revenue operating system for aviation and beyond: FLYR Labs Lands $150 Million in Series C Funding
Edge Computing
Machine learning at the edge: A hardware and software ecosystem
Artificial Intelligence
The state of MLOps in 2021
Please review our terms of service to complete your newsletter subscription.
You agree to receive updates, promotions, and alerts from ZDNet.com. You may unsubscribe at any time. By joining ZDNet, you agree to our Terms of Use and Privacy Policy.
You agree to receive updates, promotions, and alerts from ZDNet.com. You may unsubscribe at any time. By signing up, you agree to receive the selected newsletter(s) which you may unsubscribe from at any time. You also agree to the Terms of Use and acknowledge the data collection and usage practices outlined in our Privacy Policy.
Elastic to buy ‘continuous profiling’ startup Optimyze
Elastic intends to join the computer profiling function of Optimyze to the enterprise observability functions of metrics, logs and traces.
Facebook: Here comes the AI of the Metaverse
Facebook has gathered thousands of hours of first-person video in order to develop neural networks that operate more capably with data seen from a first-person point of view. …
Singapore must take caution with AI use, review approach to public trust
Multi-ethnic Asian country needs to take special care navigating its use of artificial intelligence in some areas, specifically, law enforcement, as well as recognise that fostering …
Medcast uses revamped CRM system to help upskill 20,000 nurses during COVID-19
Medcast was able track and report the number of nurses that participated in its care nursing and high dependence nursing courses.
Marketers want to influence your dreams, consumers not so much
Seventy-seven percent of marketers want to play ads right before people sleep to influence dreams and 39% of consumers are into the idea.
Dell to deploy new tools aimed to speed up data analytics at the edge
Dell says it’s eager to help enterprises generate value from data as close as possible to the point it was created.
Dataminr acquires Krizo to add incident response to Dataminr Pulse
The acquisition of Krizo will enable Dataminr to go from a company that monitors real-time events to one that responds to them.
Opendoor discusses the secret sauce: ‘A deeper mechanism to the world’
Real estate disruptor Opendoor uses deep learning to figure the right price for a house. Still, its head of data science says deep learning is a tool to uncover something deeper, something …
Google launches serverless Spark, AI workbench, new data offerings at Cloud Next
Google Cloud BigQuery, Vertex AI, Spanner all see important enhancements. Tableau partnership and Google Earth Engine further enable analytics scenarios on Google Cloud. …
© 2021 ZDNET, A RED VENTURES COMPANY. ALL RIGHTS RESERVED. Privacy Policy | Cookie Settings | Advertise | Terms of Use

source
Connect with Chris Hood, a digital strategist that can help you with AI.

Leave a Reply

Your email address will not be published. Required fields are marked *

© 2021 AI Caosuo - Proudly powered by theme Octo