[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:01.20,0:00:03.76,Default,,0000,0000,0000,,Hello everyone, my name is Victor. I'm Dialogue: 0,0:00:03.76,0:00:05.36,Default,,0000,0000,0000,,your friendly neighborhood data Dialogue: 0,0:00:05.36,0:00:07.76,Default,,0000,0000,0000,,scientist from DreamCatcher. So in this Dialogue: 0,0:00:07.76,0:00:10.16,Default,,0000,0000,0000,,presentation, I would like to talk about Dialogue: 0,0:00:10.16,0:00:12.76,Default,,0000,0000,0000,,a specific industry use case of AI or Dialogue: 0,0:00:12.76,0:00:15.07,Default,,0000,0000,0000,,machine learning which is predictive Dialogue: 0,0:00:15.07,0:00:19.00,Default,,0000,0000,0000,,maintenance. So I will be covering these Dialogue: 0,0:00:19.00,0:00:21.32,Default,,0000,0000,0000,,topics and feel free to jump forward to Dialogue: 0,0:00:21.32,0:00:23.36,Default,,0000,0000,0000,,the specific part in the video where I Dialogue: 0,0:00:23.36,0:00:25.16,Default,,0000,0000,0000,,talk about all these topics. So I'm going Dialogue: 0,0:00:25.16,0:00:27.16,Default,,0000,0000,0000,,to start off with a general preview of Dialogue: 0,0:00:27.16,0:00:29.08,Default,,0000,0000,0000,,AI and machine learning. Then, I'll Dialogue: 0,0:00:29.08,0:00:30.84,Default,,0000,0000,0000,,discuss the use case which is predictive Dialogue: 0,0:00:30.84,0:00:32.72,Default,,0000,0000,0000,,maintenance. I'll talk about the basics Dialogue: 0,0:00:32.72,0:00:34.80,Default,,0000,0000,0000,,of machine learning, the workflow of Dialogue: 0,0:00:34.80,0:00:37.24,Default,,0000,0000,0000,,machine learning, and then we will come Dialogue: 0,0:00:37.24,0:00:40.76,Default,,0000,0000,0000,,to the meat of this presentation which Dialogue: 0,0:00:40.76,0:00:43.68,Default,,0000,0000,0000,,is essentially a demonstration of the Dialogue: 0,0:00:43.68,0:00:45.40,Default,,0000,0000,0000,,machine learning workflow from end to Dialogue: 0,0:00:45.40,0:00:47.58,Default,,0000,0000,0000,,end on a real life predictive Dialogue: 0,0:00:47.58,0:00:51.52,Default,,0000,0000,0000,,maintenance domain problem. All right, so Dialogue: 0,0:00:51.52,0:00:53.64,Default,,0000,0000,0000,,without any further ado, let's jump into Dialogue: 0,0:00:53.64,0:00:56.68,Default,,0000,0000,0000,,it. So let's start off with a quick Dialogue: 0,0:00:56.68,0:01:00.08,Default,,0000,0000,0000,,preview of AI and machine learning. Well Dialogue: 0,0:01:00.08,0:01:03.60,Default,,0000,0000,0000,,AI is a very general term, it encompasses Dialogue: 0,0:01:03.60,0:01:06.68,Default,,0000,0000,0000,,the entire area of science and Dialogue: 0,0:01:06.68,0:01:09.04,Default,,0000,0000,0000,,engineering that is related to creating Dialogue: 0,0:01:09.04,0:01:10.84,Default,,0000,0000,0000,,software programs and machines that Dialogue: 0,0:01:10.84,0:01:13.76,Default,,0000,0000,0000,,will be capable of performing tasks Dialogue: 0,0:01:13.76,0:01:16.08,Default,,0000,0000,0000,,that would normally require human Dialogue: 0,0:01:16.08,0:01:19.60,Default,,0000,0000,0000,,intelligence. But AI is a catchall term, Dialogue: 0,0:01:19.60,0:01:22.92,Default,,0000,0000,0000,,so really when we talk about apply AI, Dialogue: 0,0:01:22.92,0:01:25.92,Default,,0000,0000,0000,,how we use AI in our daily work, we are Dialogue: 0,0:01:25.92,0:01:27.72,Default,,0000,0000,0000,,really going to be talking about machine Dialogue: 0,0:01:27.72,0:01:30.00,Default,,0000,0000,0000,,learning. So machine learning is the Dialogue: 0,0:01:30.00,0:01:31.68,Default,,0000,0000,0000,,design and application of software Dialogue: 0,0:01:31.68,0:01:34.08,Default,,0000,0000,0000,,algorithms that are capable of learning Dialogue: 0,0:01:34.08,0:01:37.96,Default,,0000,0000,0000,,on their own without any explicit human Dialogue: 0,0:01:37.96,0:01:40.40,Default,,0000,0000,0000,,intervention. And the primary purpose of Dialogue: 0,0:01:40.40,0:01:43.28,Default,,0000,0000,0000,,these algorithms are to optimize Dialogue: 0,0:01:43.28,0:01:46.84,Default,,0000,0000,0000,,performance in a specific task. And the Dialogue: 0,0:01:46.84,0:01:49.68,Default,,0000,0000,0000,,primary performance or the primary task Dialogue: 0,0:01:49.68,0:01:52.00,Default,,0000,0000,0000,,that you want to optimize performance in Dialogue: 0,0:01:52.00,0:01:54.24,Default,,0000,0000,0000,,is to be able to make accurate Dialogue: 0,0:01:54.24,0:01:57.48,Default,,0000,0000,0000,,predictions about future outcomes based Dialogue: 0,0:01:57.48,0:02:00.56,Default,,0000,0000,0000,,on the analysis of historical data Dialogue: 0,0:02:00.56,0:02:02.96,Default,,0000,0000,0000,,from the past. So essentially machine Dialogue: 0,0:02:02.96,0:02:05.32,Default,,0000,0000,0000,,learning is about making predictions Dialogue: 0,0:02:05.32,0:02:06.88,Default,,0000,0000,0000,,about the future or what we call Dialogue: 0,0:02:06.88,0:02:08.92,Default,,0000,0000,0000,,predictive analytics. Dialogue: 0,0:02:08.92,0:02:11.00,Default,,0000,0000,0000,,And there are many different Dialogue: 0,0:02:11.00,0:02:12.72,Default,,0000,0000,0000,,kinds of algorithms that are available in Dialogue: 0,0:02:12.72,0:02:14.52,Default,,0000,0000,0000,,machine learning under the three primary Dialogue: 0,0:02:14.52,0:02:16.44,Default,,0000,0000,0000,,categories of supervised learning, Dialogue: 0,0:02:16.44,0:02:18.92,Default,,0000,0000,0000,,unsupervised learning, and reinforcement Dialogue: 0,0:02:18.92,0:02:21.44,Default,,0000,0000,0000,,learning. And here we can see some of the Dialogue: 0,0:02:21.44,0:02:23.56,Default,,0000,0000,0000,,different kinds of algorithms and their Dialogue: 0,0:02:23.56,0:02:27.48,Default,,0000,0000,0000,,use cases in various areas in Dialogue: 0,0:02:27.48,0:02:29.68,Default,,0000,0000,0000,,industry. So we have various domain use Dialogue: 0,0:02:29.68,0:02:30.48,Default,,0000,0000,0000,,cases Dialogue: 0,0:02:30.48,0:02:31.80,Default,,0000,0000,0000,,for all these different kind of Dialogue: 0,0:02:31.80,0:02:33.84,Default,,0000,0000,0000,,algorithms, and we can see that different Dialogue: 0,0:02:33.84,0:02:38.12,Default,,0000,0000,0000,,algorithms are fitted for different use cases. Dialogue: 0,0:02:38.12,0:02:41.00,Default,,0000,0000,0000,,Deep learning is an advanced form Dialogue: 0,0:02:41.00,0:02:42.40,Default,,0000,0000,0000,,of machine learning that's based on Dialogue: 0,0:02:42.40,0:02:44.28,Default,,0000,0000,0000,,something called an artificial neural Dialogue: 0,0:02:44.28,0:02:46.32,Default,,0000,0000,0000,,network or ANN for short, and this Dialogue: 0,0:02:46.32,0:02:47.84,Default,,0000,0000,0000,,essentially simulates the structure of Dialogue: 0,0:02:47.84,0:02:49.52,Default,,0000,0000,0000,,the human brain whereby neurons Dialogue: 0,0:02:49.52,0:02:51.36,Default,,0000,0000,0000,,interconnect and work together to Dialogue: 0,0:02:51.36,0:02:54.96,Default,,0000,0000,0000,,process and learn new information. So DL Dialogue: 0,0:02:54.96,0:02:57.24,Default,,0000,0000,0000,,is the foundational technology for most Dialogue: 0,0:02:57.24,0:02:59.36,Default,,0000,0000,0000,,of the popular AI tools that you Dialogue: 0,0:02:59.36,0:03:01.40,Default,,0000,0000,0000,,probably have heard of today. So I'm sure Dialogue: 0,0:03:01.40,0:03:03.20,Default,,0000,0000,0000,,you have heard of ChatGPT if you haven't Dialogue: 0,0:03:03.20,0:03:05.36,Default,,0000,0000,0000,,been living in a cave for the past 2 Dialogue: 0,0:03:05.36,0:03:08.28,Default,,0000,0000,0000,,years. And yeah, so ChatGPT is an example Dialogue: 0,0:03:08.28,0:03:10.12,Default,,0000,0000,0000,,of what we call a large language model Dialogue: 0,0:03:10.12,0:03:11.60,Default,,0000,0000,0000,,and that's based on this technology Dialogue: 0,0:03:11.60,0:03:14.88,Default,,0000,0000,0000,,called deep learning. Also, all the modern Dialogue: 0,0:03:14.88,0:03:17.44,Default,,0000,0000,0000,,computer vision applications where a Dialogue: 0,0:03:17.44,0:03:20.04,Default,,0000,0000,0000,,computer program can classify images or Dialogue: 0,0:03:20.04,0:03:23.24,Default,,0000,0000,0000,,detect images or recognize images on Dialogue: 0,0:03:23.24,0:03:25.28,Default,,0000,0000,0000,,its own, okay, we call this computer Dialogue: 0,0:03:25.28,0:03:27.76,Default,,0000,0000,0000,,vision applications. They also use Dialogue: 0,0:03:27.76,0:03:29.52,Default,,0000,0000,0000,,this particular form of machine learning Dialogue: 0,0:03:29.52,0:03:31.56,Default,,0000,0000,0000,,called deep learning, right? So this is a Dialogue: 0,0:03:31.56,0:03:33.64,Default,,0000,0000,0000,,example of an artificial neural network. Dialogue: 0,0:03:33.64,0:03:35.20,Default,,0000,0000,0000,,For example, here I have an image of a Dialogue: 0,0:03:35.20,0:03:37.16,Default,,0000,0000,0000,,bird that's fed into this artificial Dialogue: 0,0:03:37.16,0:03:39.56,Default,,0000,0000,0000,,neural network, and output from this Dialogue: 0,0:03:39.56,0:03:41.24,Default,,0000,0000,0000,,artificial neural network is a Dialogue: 0,0:03:41.24,0:03:43.96,Default,,0000,0000,0000,,classification of this image into one of Dialogue: 0,0:03:43.96,0:03:46.40,Default,,0000,0000,0000,,these three potential categories. So in Dialogue: 0,0:03:46.40,0:03:49.08,Default,,0000,0000,0000,,this case, if the ANN has been trained Dialogue: 0,0:03:49.08,0:03:51.80,Default,,0000,0000,0000,,properly, we fit in this image, this Dialogue: 0,0:03:51.80,0:03:54.08,Default,,0000,0000,0000,,ANN should correctly classify this image Dialogue: 0,0:03:54.08,0:03:56.88,Default,,0000,0000,0000,,as a bird, right? So this is a image Dialogue: 0,0:03:56.88,0:03:58.96,Default,,0000,0000,0000,,classification problem which is a Dialogue: 0,0:03:58.96,0:04:01.08,Default,,0000,0000,0000,,classic use case for an artificial Dialogue: 0,0:04:01.08,0:04:03.93,Default,,0000,0000,0000,,neural network in the field of computer Dialogue: 0,0:04:03.93,0:04:07.88,Default,,0000,0000,0000,,vision. And just like in the case of Dialogue: 0,0:04:07.88,0:04:09.40,Default,,0000,0000,0000,,machine learning, there are a variety of Dialogue: 0,0:04:09.40,0:04:11.64,Default,,0000,0000,0000,,algorithms that are available for Dialogue: 0,0:04:11.64,0:04:13.60,Default,,0000,0000,0000,,deep learning under the category of Dialogue: 0,0:04:13.60,0:04:15.00,Default,,0000,0000,0000,,supervised learning and also Dialogue: 0,0:04:15.00,0:04:16.84,Default,,0000,0000,0000,,unsupervised learning. Dialogue: 0,0:04:16.84,0:04:19.20,Default,,0000,0000,0000,,All right, so this is how we can Dialogue: 0,0:04:19.20,0:04:20.84,Default,,0000,0000,0000,,kind of categorize this. You can think of Dialogue: 0,0:04:20.84,0:04:23.88,Default,,0000,0000,0000,,AI is a general area of smart systems Dialogue: 0,0:04:23.88,0:04:26.56,Default,,0000,0000,0000,,and machine. Machine learning is Dialogue: 0,0:04:26.56,0:04:29.36,Default,,0000,0000,0000,,basically apply AI and deep learning Dialogue: 0,0:04:29.36,0:04:29.82,Default,,0000,0000,0000,,is a Dialogue: 0,0:04:29.82,0:04:32.56,Default,,0000,0000,0000,,subspecialization of machine learning Dialogue: 0,0:04:32.56,0:04:35.00,Default,,0000,0000,0000,,using a particular architecture called Dialogue: 0,0:04:35.00,0:04:38.76,Default,,0000,0000,0000,,an artificial neural network. Dialogue: 0,0:04:38.76,0:04:42.16,Default,,0000,0000,0000,,And generative AI, so if you talk Dialogue: 0,0:04:42.16,0:04:45.28,Default,,0000,0000,0000,,about ChatGPT, okay, Google Gemini, Dialogue: 0,0:04:45.28,0:04:47.64,Default,,0000,0000,0000,,Microsoft Copilot, okay, all these Dialogue: 0,0:04:47.64,0:04:49.60,Default,,0000,0000,0000,,examples of generative AI, they are Dialogue: 0,0:04:49.60,0:04:51.60,Default,,0000,0000,0000,,basically large language models, and they Dialogue: 0,0:04:51.60,0:04:53.88,Default,,0000,0000,0000,,are a further subcategory within the Dialogue: 0,0:04:53.88,0:04:55.17,Default,,0000,0000,0000,,area of deep Dialogue: 0,0:04:55.17,0:04:57.76,Default,,0000,0000,0000,,learning. And there are many applications Dialogue: 0,0:04:57.76,0:04:59.40,Default,,0000,0000,0000,,of machine learning in industry right Dialogue: 0,0:04:59.40,0:05:01.76,Default,,0000,0000,0000,,now, so pick which particular industry Dialogue: 0,0:05:01.76,0:05:03.68,Default,,0000,0000,0000,,are you involved in, and these are all the Dialogue: 0,0:05:03.68,0:05:05.06,Default,,0000,0000,0000,,specific areas of Dialogue: 0,0:05:05.06,0:05:09.96,Default,,0000,0000,0000,,applications, right? So probably, I'm Dialogue: 0,0:05:09.96,0:05:11.68,Default,,0000,0000,0000,,going to guess the vast majority of you Dialogue: 0,0:05:11.68,0:05:12.88,Default,,0000,0000,0000,,who are watching this video, you're Dialogue: 0,0:05:12.88,0:05:14.36,Default,,0000,0000,0000,,probably coming from the manufacturing Dialogue: 0,0:05:14.36,0:05:16.64,Default,,0000,0000,0000,,industry, and so in the manufacturing Dialogue: 0,0:05:16.64,0:05:18.48,Default,,0000,0000,0000,,industry some of the standard use cases Dialogue: 0,0:05:18.48,0:05:20.04,Default,,0000,0000,0000,,for machine learning and deep learning Dialogue: 0,0:05:20.04,0:05:23.08,Default,,0000,0000,0000,,are predicting potential problems, okay? Dialogue: 0,0:05:23.08,0:05:25.32,Default,,0000,0000,0000,,So sometimes you call this predictive Dialogue: 0,0:05:25.32,0:05:27.16,Default,,0000,0000,0000,,maintenance where you want to predict Dialogue: 0,0:05:27.16,0:05:28.80,Default,,0000,0000,0000,,when a problem is going to happen and Dialogue: 0,0:05:28.80,0:05:30.40,Default,,0000,0000,0000,,then kind of address it before it Dialogue: 0,0:05:30.40,0:05:32.76,Default,,0000,0000,0000,,happens. And then monitoring systems, Dialogue: 0,0:05:32.76,0:05:35.20,Default,,0000,0000,0000,,automating your manufacturing assembly Dialogue: 0,0:05:35.20,0:05:37.88,Default,,0000,0000,0000,,line or production line, okay, smart Dialogue: 0,0:05:37.88,0:05:40.12,Default,,0000,0000,0000,,scheduling, and detecting anomaly on your Dialogue: 0,0:05:40.12,0:05:41.48,Default,,0000,0000,0000,,production line. Dialogue: 0,0:05:42.39,0:05:44.16,Default,,0000,0000,0000,,Okay, so let's talk about the use Dialogue: 0,0:05:44.16,0:05:45.68,Default,,0000,0000,0000,,case here which is predictive Dialogue: 0,0:05:45.68,0:05:49.28,Default,,0000,0000,0000,,maintenance, right? So what is predictive Dialogue: 0,0:05:49.28,0:05:51.72,Default,,0000,0000,0000,,maintenance? Well predictive maintenance, Dialogue: 0,0:05:51.72,0:05:53.20,Default,,0000,0000,0000,,here's the long definition, is a Dialogue: 0,0:05:53.20,0:05:54.64,Default,,0000,0000,0000,,equipment maintenance strategy that Dialogue: 0,0:05:54.64,0:05:56.28,Default,,0000,0000,0000,,relies on real-time monitoring of Dialogue: 0,0:05:56.28,0:05:58.36,Default,,0000,0000,0000,,equipment conditions and data to predict Dialogue: 0,0:05:58.36,0:06:00.28,Default,,0000,0000,0000,,equipment failures in advance. Dialogue: 0,0:06:00.28,0:06:02.68,Default,,0000,0000,0000,,And this uses advanced data models, Dialogue: 0,0:06:02.68,0:06:05.24,Default,,0000,0000,0000,,analytics, and machine learning whereby Dialogue: 0,0:06:05.24,0:06:07.48,Default,,0000,0000,0000,,we can reliably assess when failures are Dialogue: 0,0:06:07.48,0:06:09.20,Default,,0000,0000,0000,,more likely to occur, including which Dialogue: 0,0:06:09.20,0:06:11.12,Default,,0000,0000,0000,,components are more likely to be Dialogue: 0,0:06:11.12,0:06:13.56,Default,,0000,0000,0000,,affected on your production or assembly Dialogue: 0,0:06:13.56,0:06:16.60,Default,,0000,0000,0000,,line. So where does predictive Dialogue: 0,0:06:16.60,0:06:18.76,Default,,0000,0000,0000,,maintenance fit into the overall scheme Dialogue: 0,0:06:18.76,0:06:20.76,Default,,0000,0000,0000,,of things, right? So let's talk about the Dialogue: 0,0:06:20.76,0:06:23.04,Default,,0000,0000,0000,,kind of standard way that, you know, Dialogue: 0,0:06:23.04,0:06:25.52,Default,,0000,0000,0000,,factories or production Dialogue: 0,0:06:25.52,0:06:27.68,Default,,0000,0000,0000,,lines, assembly lines in factories tend Dialogue: 0,0:06:27.68,0:06:31.08,Default,,0000,0000,0000,,to handle maintenance issues say Dialogue: 0,0:06:31.08,0:06:33.12,Default,,0000,0000,0000,,10 or 20 years ago, right? So what you Dialogue: 0,0:06:33.12,0:06:34.52,Default,,0000,0000,0000,,have is the, what you would probably Dialogue: 0,0:06:34.52,0:06:36.40,Default,,0000,0000,0000,,start off is the most basic mode Dialogue: 0,0:06:36.40,0:06:38.24,Default,,0000,0000,0000,,which is reactive maintenance. So you Dialogue: 0,0:06:38.24,0:06:40.68,Default,,0000,0000,0000,,just wait until your machine breaks down Dialogue: 0,0:06:40.68,0:06:43.04,Default,,0000,0000,0000,,and then you repair, right? The simplest, Dialogue: 0,0:06:43.04,0:06:44.72,Default,,0000,0000,0000,,but, of course, I'm sure if you have worked on a Dialogue: 0,0:06:44.72,0:06:46.72,Default,,0000,0000,0000,,production line for any period of time, Dialogue: 0,0:06:46.72,0:06:48.88,Default,,0000,0000,0000,,you know that this reactive maintenance Dialogue: 0,0:06:48.88,0:06:50.76,Default,,0000,0000,0000,,can give you a whole bunch of headaches Dialogue: 0,0:06:50.76,0:06:52.16,Default,,0000,0000,0000,,especially if the machine breaks down Dialogue: 0,0:06:52.16,0:06:54.12,Default,,0000,0000,0000,,just before a critical delivery deadline, Dialogue: 0,0:06:54.12,0:06:55.52,Default,,0000,0000,0000,,right? Then you're going to have a Dialogue: 0,0:06:55.52,0:06:56.80,Default,,0000,0000,0000,,backlog of orders and you're going to Dialogue: 0,0:06:56.80,0:06:59.16,Default,,0000,0000,0000,,run to a lot of problems. Okay, so we move on Dialogue: 0,0:06:59.16,0:07:00.88,Default,,0000,0000,0000,,to preventive maintenance which is Dialogue: 0,0:07:00.88,0:07:03.84,Default,,0000,0000,0000,,you regularly schedule a maintenance of Dialogue: 0,0:07:03.84,0:07:07.00,Default,,0000,0000,0000,,your production machines to reduce Dialogue: 0,0:07:07.00,0:07:08.80,Default,,0000,0000,0000,,the failure rate. So you might do Dialogue: 0,0:07:08.80,0:07:10.52,Default,,0000,0000,0000,,maintenance once every month, once every Dialogue: 0,0:07:10.52,0:07:13.12,Default,,0000,0000,0000,,two weeks, whatever. Okay, this is great, Dialogue: 0,0:07:13.12,0:07:15.24,Default,,0000,0000,0000,,but the problem, of course, then is well Dialogue: 0,0:07:15.24,0:07:16.20,Default,,0000,0000,0000,,sometimes you're doing too much Dialogue: 0,0:07:16.20,0:07:18.40,Default,,0000,0000,0000,,maintenance, it's not really necessary, Dialogue: 0,0:07:18.40,0:07:20.64,Default,,0000,0000,0000,,and it still doesn't totally prevent Dialogue: 0,0:07:20.64,0:07:23.24,Default,,0000,0000,0000,,this, you know, a failure of the Dialogue: 0,0:07:23.24,0:07:25.64,Default,,0000,0000,0000,,machine that occurs outside of your planned Dialogue: 0,0:07:25.64,0:07:28.68,Default,,0000,0000,0000,,maintenance, right? So a bit of an Dialogue: 0,0:07:28.68,0:07:31.16,Default,,0000,0000,0000,,improvement, but not that much better. Dialogue: 0,0:07:31.16,0:07:33.28,Default,,0000,0000,0000,,And then, these last two categories is Dialogue: 0,0:07:33.28,0:07:34.68,Default,,0000,0000,0000,,where we bring in AI and machine Dialogue: 0,0:07:34.68,0:07:36.76,Default,,0000,0000,0000,,learning. So with machine learning, we're Dialogue: 0,0:07:36.76,0:07:39.28,Default,,0000,0000,0000,,going to use sensors to do real-time Dialogue: 0,0:07:39.28,0:07:41.76,Default,,0000,0000,0000,,monitoring of the data, and then using Dialogue: 0,0:07:41.76,0:07:43.32,Default,,0000,0000,0000,,that data we're going to build a machine Dialogue: 0,0:07:43.32,0:07:46.48,Default,,0000,0000,0000,,learning model which helps us to predict, Dialogue: 0,0:07:46.48,0:07:50.00,Default,,0000,0000,0000,,with a reasonable level of accuracy, when Dialogue: 0,0:07:50.00,0:07:52.52,Default,,0000,0000,0000,,the next failure is going to happen on Dialogue: 0,0:07:52.52,0:07:54.44,Default,,0000,0000,0000,,your assembly or production line on a Dialogue: 0,0:07:54.44,0:07:57.44,Default,,0000,0000,0000,,specific component or specific machine, Dialogue: 0,0:07:57.44,0:07:59.52,Default,,0000,0000,0000,,right? So you just want to be predict to Dialogue: 0,0:07:59.52,0:08:01.96,Default,,0000,0000,0000,,a high level of accuracy like maybe Dialogue: 0,0:08:01.96,0:08:04.44,Default,,0000,0000,0000,,to the specific day, even the specific Dialogue: 0,0:08:04.44,0:08:06.40,Default,,0000,0000,0000,,hour, or even minute itself when you Dialogue: 0,0:08:06.40,0:08:08.36,Default,,0000,0000,0000,,expect that particular product to fail Dialogue: 0,0:08:08.36,0:08:10.96,Default,,0000,0000,0000,,or the particular machine to fail. All Dialogue: 0,0:08:10.96,0:08:12.64,Default,,0000,0000,0000,,right, so these are the advantages of Dialogue: 0,0:08:12.64,0:08:14.88,Default,,0000,0000,0000,,predictive maintenance. It minimizes Dialogue: 0,0:08:14.88,0:08:16.72,Default,,0000,0000,0000,,the occurrence of unscheduled downtime, it Dialogue: 0,0:08:16.72,0:08:18.08,Default,,0000,0000,0000,,gives you a real-time overview of your Dialogue: 0,0:08:18.08,0:08:19.92,Default,,0000,0000,0000,,current condition of assets, ensures Dialogue: 0,0:08:19.92,0:08:22.68,Default,,0000,0000,0000,,minimal disruptions to productivity, Dialogue: 0,0:08:22.68,0:08:24.72,Default,,0000,0000,0000,,optimizes time you spend on maintenance work, Dialogue: 0,0:08:24.72,0:08:26.64,Default,,0000,0000,0000,,optimizes the use of spare parts, and so Dialogue: 0,0:08:26.64,0:08:28.28,Default,,0000,0000,0000,,on. And of course there are some Dialogue: 0,0:08:28.28,0:08:30.64,Default,,0000,0000,0000,,disadvantages, which is the Dialogue: 0,0:08:30.64,0:08:32.56,Default,,0000,0000,0000,,primary one, you need a specialized set Dialogue: 0,0:08:32.56,0:08:35.52,Default,,0000,0000,0000,,of skills among your engineers to Dialogue: 0,0:08:35.52,0:08:37.72,Default,,0000,0000,0000,,understand and create machine learning Dialogue: 0,0:08:37.72,0:08:40.60,Default,,0000,0000,0000,,models that can work on the real-time Dialogue: 0,0:08:40.60,0:08:43.56,Default,,0000,0000,0000,,data that you're getting. Okay, so we're Dialogue: 0,0:08:43.56,0:08:45.00,Default,,0000,0000,0000,,going to take a look at some real life Dialogue: 0,0:08:45.00,0:08:47.20,Default,,0000,0000,0000,,use cases. So these are a bunch of links Dialogue: 0,0:08:47.20,0:08:48.72,Default,,0000,0000,0000,,here, so if you navigate to these links Dialogue: 0,0:08:48.72,0:08:50.12,Default,,0000,0000,0000,,here, you'll be able to get a look at Dialogue: 0,0:08:50.12,0:08:54.36,Default,,0000,0000,0000,,some real life use cases of machine Dialogue: 0,0:08:54.36,0:08:57.64,Default,,0000,0000,0000,,learning in predictive maintenance. So Dialogue: 0,0:08:57.64,0:09:00.96,Default,,0000,0000,0000,,the IBM website, okay, gives you a look at Dialogue: 0,0:09:00.96,0:09:04.88,Default,,0000,0000,0000,,a bunch of five use cases, so you can Dialogue: 0,0:09:04.88,0:09:06.52,Default,,0000,0000,0000,,click on these links and follow up with Dialogue: 0,0:09:06.52,0:09:08.28,Default,,0000,0000,0000,,them if you want to read more. Okay, this Dialogue: 0,0:09:08.28,0:09:11.48,Default,,0000,0000,0000,,is waste management, manufacturing, okay, Dialogue: 0,0:09:11.48,0:09:14.76,Default,,0000,0000,0000,,building services, and renewable energy, Dialogue: 0,0:09:14.76,0:09:16.88,Default,,0000,0000,0000,,and also mining, right? So these are all Dialogue: 0,0:09:16.88,0:09:18.28,Default,,0000,0000,0000,,use cases, if you want to know more about Dialogue: 0,0:09:18.28,0:09:20.48,Default,,0000,0000,0000,,them, you can read up and follow them Dialogue: 0,0:09:20.48,0:09:23.60,Default,,0000,0000,0000,,from this website. And this website Dialogue: 0,0:09:23.60,0:09:25.76,Default,,0000,0000,0000,,gives, this is a pretty good website. I Dialogue: 0,0:09:25.76,0:09:27.72,Default,,0000,0000,0000,,would really encourage you to just look Dialogue: 0,0:09:27.72,0:09:28.88,Default,,0000,0000,0000,,through this if you're interested in Dialogue: 0,0:09:28.88,0:09:31.16,Default,,0000,0000,0000,,predictive maintenance. So here, it tells Dialogue: 0,0:09:31.16,0:09:34.28,Default,,0000,0000,0000,,you about, you know, an industry survey of Dialogue: 0,0:09:34.28,0:09:36.36,Default,,0000,0000,0000,,predictive maintenance. We can see that a Dialogue: 0,0:09:36.36,0:09:38.20,Default,,0000,0000,0000,,large portion of the industry, Dialogue: 0,0:09:38.20,0:09:39.68,Default,,0000,0000,0000,,manufacturing industry agreed that Dialogue: 0,0:09:39.68,0:09:41.36,Default,,0000,0000,0000,,predictive maintenance is a real need to Dialogue: 0,0:09:41.36,0:09:43.96,Default,,0000,0000,0000,,stay competitive and predictive Dialogue: 0,0:09:43.96,0:09:45.24,Default,,0000,0000,0000,,maintenance is essential for Dialogue: 0,0:09:45.24,0:09:46.72,Default,,0000,0000,0000,,manufacturing industry and will gain Dialogue: 0,0:09:46.72,0:09:48.28,Default,,0000,0000,0000,,additional strength in the future. So Dialogue: 0,0:09:48.28,0:09:50.20,Default,,0000,0000,0000,,this is a survey that was done quite Dialogue: 0,0:09:50.20,0:09:52.04,Default,,0000,0000,0000,,some time ago and this was the results Dialogue: 0,0:09:52.04,0:09:53.88,Default,,0000,0000,0000,,that we got back. So we can see the vast Dialogue: 0,0:09:53.88,0:09:55.72,Default,,0000,0000,0000,,majority of key industry players in the Dialogue: 0,0:09:55.72,0:09:57.64,Default,,0000,0000,0000,,manufacturing sector, they consider Dialogue: 0,0:09:57.64,0:09:59.00,Default,,0000,0000,0000,,predictive maintenance to be a very Dialogue: 0,0:09:59.00,0:09:59.84,Default,,0000,0000,0000,,important Dialogue: 0,0:09:59.84,0:10:01.60,Default,,0000,0000,0000,,activity that they want to Dialogue: 0,0:10:01.60,0:10:04.52,Default,,0000,0000,0000,,incorporate into their workflow, right? Dialogue: 0,0:10:04.52,0:10:07.72,Default,,0000,0000,0000,,And we can see here the kind of ROI that Dialogue: 0,0:10:07.72,0:10:10.68,Default,,0000,0000,0000,,we expect on investment in predictive Dialogue: 0,0:10:10.68,0:10:13.40,Default,,0000,0000,0000,,maintenance, so 45% reduction in downtime, Dialogue: 0,0:10:13.40,0:10:17.12,Default,,0000,0000,0000,,25% growth in productivity, 75% fault Dialogue: 0,0:10:17.12,0:10:19.48,Default,,0000,0000,0000,,elimination, 30% reduction in maintenance Dialogue: 0,0:10:19.48,0:10:22.64,Default,,0000,0000,0000,,cost, okay? And best of all, if you really Dialogue: 0,0:10:22.64,0:10:25.04,Default,,0000,0000,0000,,want to kind of take a look at examples, Dialogue: 0,0:10:25.04,0:10:26.68,Default,,0000,0000,0000,,all right, so there are all these Dialogue: 0,0:10:26.68,0:10:28.12,Default,,0000,0000,0000,,different companies that have Dialogue: 0,0:10:28.12,0:10:30.16,Default,,0000,0000,0000,,significantly invested in predictive Dialogue: 0,0:10:30.16,0:10:31.64,Default,,0000,0000,0000,,maintenance technology in their Dialogue: 0,0:10:31.64,0:10:34.24,Default,,0000,0000,0000,,manufacturing processes. So PepsiCo, we Dialogue: 0,0:10:34.24,0:10:38.96,Default,,0000,0000,0000,,have got Frito-Lay, General Motors, Mondi, Ecoplant, Dialogue: 0,0:10:38.96,0:10:40.96,Default,,0000,0000,0000,,all right? So you can jump over here Dialogue: 0,0:10:40.96,0:10:42.96,Default,,0000,0000,0000,,and take a look at some of these Dialogue: 0,0:10:42.96,0:10:46.04,Default,,0000,0000,0000,,use cases. Let me perhaps, let me try and Dialogue: 0,0:10:46.04,0:10:48.08,Default,,0000,0000,0000,,open this up, for example, Mondi, right? You Dialogue: 0,0:10:48.08,0:10:51.88,Default,,0000,0000,0000,,can see Mondi has impl- oops. Mondi has used Dialogue: 0,0:10:51.88,0:10:53.72,Default,,0000,0000,0000,,this particular piece of software Dialogue: 0,0:10:53.72,0:10:55.84,Default,,0000,0000,0000,,called MATLAB, all right, or MathWorks Dialogue: 0,0:10:55.84,0:10:59.76,Default,,0000,0000,0000,,sorry, to do predictive maintenance Dialogue: 0,0:10:59.76,0:11:01.92,Default,,0000,0000,0000,,for their manufacturing processes using Dialogue: 0,0:11:01.92,0:11:05.08,Default,,0000,0000,0000,,machine learning. And we can talk, you can Dialogue: 0,0:11:05.08,0:11:07.68,Default,,0000,0000,0000,,study how they have used it, all right, Dialogue: 0,0:11:07.68,0:11:09.00,Default,,0000,0000,0000,,and how it works, what was their Dialogue: 0,0:11:09.00,0:11:10.92,Default,,0000,0000,0000,,challenge, all right, the problems they Dialogue: 0,0:11:10.92,0:11:12.64,Default,,0000,0000,0000,,were facing, the solution that they use Dialogue: 0,0:11:12.64,0:11:14.56,Default,,0000,0000,0000,,using this MathWorks Consulting piece of Dialogue: 0,0:11:14.56,0:11:17.16,Default,,0000,0000,0000,,software, and data that they collected in Dialogue: 0,0:11:17.16,0:11:20.40,Default,,0000,0000,0000,,a MATLAB database, all right, sorry Dialogue: 0,0:11:20.40,0:11:23.64,Default,,0000,0000,0000,,in a Oracle database. Dialogue: 0,0:11:23.64,0:11:26.40,Default,,0000,0000,0000,,So using MathWorks from MATLAB, all Dialogue: 0,0:11:26.40,0:11:27.96,Default,,0000,0000,0000,,right, they were able to create a deep Dialogue: 0,0:11:27.96,0:11:30.56,Default,,0000,0000,0000,,learning model to, you know, to Dialogue: 0,0:11:30.56,0:11:32.84,Default,,0000,0000,0000,,solve this particular issue for their Dialogue: 0,0:11:32.84,0:11:35.72,Default,,0000,0000,0000,,domain. So if you're interested, please, I Dialogue: 0,0:11:35.72,0:11:37.64,Default,,0000,0000,0000,,strongly encourage you to read up on all Dialogue: 0,0:11:37.64,0:11:40.44,Default,,0000,0000,0000,,these real life customer stories with Dialogue: 0,0:11:40.44,0:11:43.40,Default,,0000,0000,0000,,showcase use cases for predictive Dialogue: 0,0:11:43.40,0:11:48.24,Default,,0000,0000,0000,,maintenance. Okay, so that's it for Dialogue: 0,0:11:48.24,0:11:52.20,Default,,0000,0000,0000,,real life use cases for predictive maintenance. Dialogue: 0,0:11:53.82,0:11:56.60,Default,,0000,0000,0000,,Now in this topic, I'm Dialogue: 0,0:11:56.60,0:11:58.00,Default,,0000,0000,0000,,going to talk about machine learning Dialogue: 0,0:11:58.00,0:12:00.04,Default,,0000,0000,0000,,basics, so what is actually involved Dialogue: 0,0:12:00.04,0:12:01.48,Default,,0000,0000,0000,,in machine learning, and I'm going to Dialogue: 0,0:12:01.48,0:12:03.84,Default,,0000,0000,0000,,give a very quick, fast, conceptual, high Dialogue: 0,0:12:03.84,0:12:05.92,Default,,0000,0000,0000,,level overview of machine learning, all Dialogue: 0,0:12:05.92,0:12:09.00,Default,,0000,0000,0000,,right? So there are several categories of Dialogue: 0,0:12:09.00,0:12:10.96,Default,,0000,0000,0000,,machine learning, supervised, unsupervised, Dialogue: 0,0:12:10.96,0:12:13.00,Default,,0000,0000,0000,,semi-supervised, reinforcement, and deep Dialogue: 0,0:12:13.00,0:12:15.88,Default,,0000,0000,0000,,learning, okay? And let's talk about the Dialogue: 0,0:12:15.88,0:12:19.36,Default,,0000,0000,0000,,most common and widely used category of Dialogue: 0,0:12:19.36,0:12:20.56,Default,,0000,0000,0000,,machine learning which is called Dialogue: 0,0:12:20.56,0:12:25.04,Default,,0000,0000,0000,,supervised learning. So the particular use Dialogue: 0,0:12:25.04,0:12:26.28,Default,,0000,0000,0000,,case here that I'm going to be Dialogue: 0,0:12:26.28,0:12:28.56,Default,,0000,0000,0000,,discussing, predictive maintenance, it's Dialogue: 0,0:12:28.56,0:12:31.32,Default,,0000,0000,0000,,basically a form of supervised learning. Dialogue: 0,0:12:31.32,0:12:33.48,Default,,0000,0000,0000,,So how does supervised learning work? Dialogue: 0,0:12:33.48,0:12:35.20,Default,,0000,0000,0000,,Well in supervised learning, you're going Dialogue: 0,0:12:35.20,0:12:37.24,Default,,0000,0000,0000,,to create a machine learning model by Dialogue: 0,0:12:37.24,0:12:39.36,Default,,0000,0000,0000,,providing what is called a labelled data Dialogue: 0,0:12:39.36,0:12:41.68,Default,,0000,0000,0000,,set as a input to a machine learning Dialogue: 0,0:12:41.68,0:12:44.68,Default,,0000,0000,0000,,program or algorithm. And this dataset Dialogue: 0,0:12:44.68,0:12:46.44,Default,,0000,0000,0000,,is going to contain what is called an Dialogue: 0,0:12:46.44,0:12:48.76,Default,,0000,0000,0000,,independent or feature variables, all Dialogue: 0,0:12:48.76,0:12:51.24,Default,,0000,0000,0000,,right, so this will be a set of variables. Dialogue: 0,0:12:51.24,0:12:52.96,Default,,0000,0000,0000,,And there will be one dependent or Dialogue: 0,0:12:52.96,0:12:54.96,Default,,0000,0000,0000,,target variable which we also call the Dialogue: 0,0:12:54.96,0:12:57.72,Default,,0000,0000,0000,,label, and the idea is that the Dialogue: 0,0:12:57.72,0:12:59.84,Default,,0000,0000,0000,,independent or the feature variables are Dialogue: 0,0:12:59.84,0:13:01.60,Default,,0000,0000,0000,,the attributes or properties of your Dialogue: 0,0:13:01.60,0:13:04.16,Default,,0000,0000,0000,,data set that influence the dependent or Dialogue: 0,0:13:04.16,0:13:07.76,Default,,0000,0000,0000,,the target variable, okay? So this process Dialogue: 0,0:13:07.76,0:13:09.12,Default,,0000,0000,0000,,that I've just described is called Dialogue: 0,0:13:09.12,0:13:11.60,Default,,0000,0000,0000,,training the machine learning model, and Dialogue: 0,0:13:11.60,0:13:14.28,Default,,0000,0000,0000,,the model is fundamentally a Dialogue: 0,0:13:14.28,0:13:16.40,Default,,0000,0000,0000,,mathematical function that best Dialogue: 0,0:13:16.40,0:13:18.40,Default,,0000,0000,0000,,approximates the relationship between Dialogue: 0,0:13:18.40,0:13:20.64,Default,,0000,0000,0000,,the independent variables and the Dialogue: 0,0:13:20.64,0:13:22.64,Default,,0000,0000,0000,,dependent variable. All right, so that's Dialogue: 0,0:13:22.64,0:13:24.48,Default,,0000,0000,0000,,quite a bit of a mouthful, so let's jump Dialogue: 0,0:13:24.48,0:13:26.32,Default,,0000,0000,0000,,into a diagram that maybe illustrates Dialogue: 0,0:13:26.32,0:13:27.88,Default,,0000,0000,0000,,this more clearly. So let's say you have Dialogue: 0,0:13:27.88,0:13:30.00,Default,,0000,0000,0000,,a dataset here, an Excel spreadsheet, Dialogue: 0,0:13:30.00,0:13:32.16,Default,,0000,0000,0000,,right? And this Excel spreadsheet has a Dialogue: 0,0:13:32.16,0:13:34.04,Default,,0000,0000,0000,,bunch of columns here and a bunch of Dialogue: 0,0:13:34.04,0:13:36.80,Default,,0000,0000,0000,,rows, okay? So these rows here represent Dialogue: 0,0:13:36.80,0:13:39.00,Default,,0000,0000,0000,,observations, or these rows are what Dialogue: 0,0:13:39.00,0:13:40.96,Default,,0000,0000,0000,,we call observations or samples or data Dialogue: 0,0:13:40.96,0:13:43.12,Default,,0000,0000,0000,,points in our data set, okay? So let's Dialogue: 0,0:13:43.12,0:13:46.88,Default,,0000,0000,0000,,assume this data set is gathered by a Dialogue: 0,0:13:46.88,0:13:49.96,Default,,0000,0000,0000,,marketing manager at a mall, at a retail Dialogue: 0,0:13:49.96,0:13:52.28,Default,,0000,0000,0000,,mall, all right? So they've got all this Dialogue: 0,0:13:52.28,0:13:54.92,Default,,0000,0000,0000,,information about the customers who Dialogue: 0,0:13:54.92,0:13:56.80,Default,,0000,0000,0000,,purchase products at this mall, all right? Dialogue: 0,0:13:56.80,0:13:58.52,Default,,0000,0000,0000,,So some of the information they've Dialogue: 0,0:13:58.52,0:14:00.00,Default,,0000,0000,0000,,gotten about the customers are their Dialogue: 0,0:14:00.00,0:14:01.84,Default,,0000,0000,0000,,gender, their age, their income, and the Dialogue: 0,0:14:01.84,0:14:03.60,Default,,0000,0000,0000,,number of children. So all this Dialogue: 0,0:14:03.60,0:14:05.68,Default,,0000,0000,0000,,information about the customers, we call Dialogue: 0,0:14:05.68,0:14:07.36,Default,,0000,0000,0000,,this the independent or the feature Dialogue: 0,0:14:07.36,0:14:10.08,Default,,0000,0000,0000,,variables, all right? And based on all Dialogue: 0,0:14:10.08,0:14:12.76,Default,,0000,0000,0000,,this information about the customer, we Dialogue: 0,0:14:12.76,0:14:16.20,Default,,0000,0000,0000,,also managed to get some or we record Dialogue: 0,0:14:16.20,0:14:17.60,Default,,0000,0000,0000,,the information about how much the Dialogue: 0,0:14:17.60,0:14:20.48,Default,,0000,0000,0000,,customer spends, all right? So this Dialogue: 0,0:14:20.48,0:14:22.08,Default,,0000,0000,0000,,information or these numbers here, we call Dialogue: 0,0:14:22.08,0:14:23.84,Default,,0000,0000,0000,,this the target variable or the Dialogue: 0,0:14:23.84,0:14:26.60,Default,,0000,0000,0000,,dependent variable, right? So on the Dialogue: 0,0:14:26.60,0:14:29.52,Default,,0000,0000,0000,,single row, the data point, one single sample, one Dialogue: 0,0:14:29.52,0:14:32.56,Default,,0000,0000,0000,,single data point, contains all the data Dialogue: 0,0:14:32.56,0:14:35.04,Default,,0000,0000,0000,,for the feature variables and one single Dialogue: 0,0:14:35.04,0:14:37.80,Default,,0000,0000,0000,,value for the label or the target Dialogue: 0,0:14:37.80,0:14:41.20,Default,,0000,0000,0000,,variable, okay? And the primary purpose of Dialogue: 0,0:14:41.20,0:14:43.24,Default,,0000,0000,0000,,the machine learning model is to create Dialogue: 0,0:14:43.24,0:14:45.52,Default,,0000,0000,0000,,a mapping from all your feature Dialogue: 0,0:14:45.52,0:14:48.16,Default,,0000,0000,0000,,variables to your target variable, so Dialogue: 0,0:14:48.16,0:14:50.92,Default,,0000,0000,0000,,somehow there's going to be a function, Dialogue: 0,0:14:50.92,0:14:52.16,Default,,0000,0000,0000,,okay, this will be a mathematical Dialogue: 0,0:14:52.16,0:14:54.80,Default,,0000,0000,0000,,function that maps all the values of Dialogue: 0,0:14:54.80,0:14:57.04,Default,,0000,0000,0000,,your feature variable to the value of Dialogue: 0,0:14:57.04,0:14:59.64,Default,,0000,0000,0000,,your target variable. In other words, this Dialogue: 0,0:14:59.64,0:15:01.28,Default,,0000,0000,0000,,function represents the relationship Dialogue: 0,0:15:01.28,0:15:03.36,Default,,0000,0000,0000,,between your feature variables and your Dialogue: 0,0:15:03.36,0:15:07.08,Default,,0000,0000,0000,,target variable, okay? So this whole thing, Dialogue: 0,0:15:07.08,0:15:08.56,Default,,0000,0000,0000,,this training process, we call this the Dialogue: 0,0:15:08.56,0:15:11.32,Default,,0000,0000,0000,,fitting the model. And the target Dialogue: 0,0:15:11.32,0:15:13.24,Default,,0000,0000,0000,,variable or the label, this thing here, Dialogue: 0,0:15:13.24,0:15:15.12,Default,,0000,0000,0000,,this column here, or the values here, Dialogue: 0,0:15:15.12,0:15:17.40,Default,,0000,0000,0000,,these are critical for providing a Dialogue: 0,0:15:17.40,0:15:19.00,Default,,0000,0000,0000,,context to do the fitting or the Dialogue: 0,0:15:19.00,0:15:21.16,Default,,0000,0000,0000,,training of the model. And once you've Dialogue: 0,0:15:21.16,0:15:23.36,Default,,0000,0000,0000,,got a trained and fitted model, you can Dialogue: 0,0:15:23.36,0:15:25.96,Default,,0000,0000,0000,,then use the model to make an accurate Dialogue: 0,0:15:25.96,0:15:28.32,Default,,0000,0000,0000,,prediction of target values Dialogue: 0,0:15:28.32,0:15:30.24,Default,,0000,0000,0000,,corresponding to new feature values that Dialogue: 0,0:15:30.24,0:15:32.52,Default,,0000,0000,0000,,the model has yet to encounter or yet to Dialogue: 0,0:15:32.52,0:15:34.76,Default,,0000,0000,0000,,see, and this, as I've already said Dialogue: 0,0:15:34.76,0:15:36.24,Default,,0000,0000,0000,,earlier, this is called predictive Dialogue: 0,0:15:36.24,0:15:38.48,Default,,0000,0000,0000,,analytics, okay? So let's see what's Dialogue: 0,0:15:38.48,0:15:40.12,Default,,0000,0000,0000,,actually happening here, you take your Dialogue: 0,0:15:40.12,0:15:43.08,Default,,0000,0000,0000,,training data, all right, so this is this Dialogue: 0,0:15:43.08,0:15:44.88,Default,,0000,0000,0000,,whole bunch of data, this data set here Dialogue: 0,0:15:44.88,0:15:47.44,Default,,0000,0000,0000,,consisting of a thousand rows of Dialogue: 0,0:15:47.44,0:15:49.92,Default,,0000,0000,0000,,data, 10,000 rows of data, you take this Dialogue: 0,0:15:49.92,0:15:52.04,Default,,0000,0000,0000,,entire data set, all right, this entire Dialogue: 0,0:15:52.04,0:15:54.00,Default,,0000,0000,0000,,data set, you jam it into your machine Dialogue: 0,0:15:54.00,0:15:56.52,Default,,0000,0000,0000,,learning algorithm, and a couple of hours Dialogue: 0,0:15:56.52,0:15:58.08,Default,,0000,0000,0000,,later your machine learning algorithm Dialogue: 0,0:15:58.08,0:16:01.36,Default,,0000,0000,0000,,comes up with a model. And the model is Dialogue: 0,0:16:01.36,0:16:04.20,Default,,0000,0000,0000,,essentially a function that maps all Dialogue: 0,0:16:04.20,0:16:05.96,Default,,0000,0000,0000,,your feature variables which is these Dialogue: 0,0:16:05.96,0:16:08.20,Default,,0000,0000,0000,,four columns here, to your target Dialogue: 0,0:16:08.20,0:16:10.44,Default,,0000,0000,0000,,variable which is this one single column Dialogue: 0,0:16:10.44,0:16:14.28,Default,,0000,0000,0000,,here, okay? So once you have the model, you Dialogue: 0,0:16:14.28,0:16:17.04,Default,,0000,0000,0000,,can put in a new data point. So basically Dialogue: 0,0:16:17.04,0:16:19.08,Default,,0000,0000,0000,,the new data point represents data about a Dialogue: 0,0:16:19.08,0:16:20.96,Default,,0000,0000,0000,,new customer, a new customer that you Dialogue: 0,0:16:20.96,0:16:23.12,Default,,0000,0000,0000,,have never seen before. So let's say Dialogue: 0,0:16:23.12,0:16:25.08,Default,,0000,0000,0000,,you've already got information about Dialogue: 0,0:16:25.08,0:16:27.56,Default,,0000,0000,0000,,10,000 customers that have visited this Dialogue: 0,0:16:27.56,0:16:29.92,Default,,0000,0000,0000,,mall and how much each of these 10,000 Dialogue: 0,0:16:29.92,0:16:31.52,Default,,0000,0000,0000,,customers have spent when they are at this Dialogue: 0,0:16:31.52,0:16:34.04,Default,,0000,0000,0000,,mall. So now you have a totally new Dialogue: 0,0:16:34.04,0:16:35.80,Default,,0000,0000,0000,,customer that comes in the mall, this Dialogue: 0,0:16:35.80,0:16:37.80,Default,,0000,0000,0000,,customer has never come into this mall Dialogue: 0,0:16:37.80,0:16:39.84,Default,,0000,0000,0000,,before, and what we know about this Dialogue: 0,0:16:39.84,0:16:42.68,Default,,0000,0000,0000,,customer is that he is a male, the age is Dialogue: 0,0:16:42.68,0:16:45.20,Default,,0000,0000,0000,,50, the income is 18, and they have nine Dialogue: 0,0:16:45.20,0:16:48.16,Default,,0000,0000,0000,,children. So now when you take this data Dialogue: 0,0:16:48.16,0:16:50.52,Default,,0000,0000,0000,,and you pump that into your model, your Dialogue: 0,0:16:50.52,0:16:52.92,Default,,0000,0000,0000,,model is going to make a prediction, it's Dialogue: 0,0:16:52.92,0:16:55.72,Default,,0000,0000,0000,,going to say, hey, you know what? Based on Dialogue: 0,0:16:55.72,0:16:57.28,Default,,0000,0000,0000,,everything that I have been trained before Dialogue: 0,0:16:57.28,0:16:59.36,Default,,0000,0000,0000,,and based on the model I've developed, Dialogue: 0,0:16:59.36,0:17:01.96,Default,,0000,0000,0000,,I am going to predict that a customer Dialogue: 0,0:17:01.96,0:17:04.88,Default,,0000,0000,0000,,that is of a male gender, of the age 50 Dialogue: 0,0:17:04.88,0:17:08.28,Default,,0000,0000,0000,,with the income of 18, and nine children, Dialogue: 0,0:17:08.28,0:17:12.40,Default,,0000,0000,0000,,that customer is going to spend 25 ringgit Dialogue: 0,0:17:12.40,0:17:15.84,Default,,0000,0000,0000,,at the mall. And this is it, this is what Dialogue: 0,0:17:15.84,0:17:18.60,Default,,0000,0000,0000,,you want. Right there, right here, Dialogue: 0,0:17:18.60,0:17:21.32,Default,,0000,0000,0000,,can you see here? That is the final Dialogue: 0,0:17:21.32,0:17:23.48,Default,,0000,0000,0000,,output of your machine learning model. Dialogue: 0,0:17:23.48,0:17:27.36,Default,,0000,0000,0000,,It's going to make a prediction about Dialogue: 0,0:17:27.36,0:17:29.76,Default,,0000,0000,0000,,something that it has not ever seen Dialogue: 0,0:17:29.76,0:17:32.92,Default,,0000,0000,0000,,before, okay? That is the core, this is Dialogue: 0,0:17:32.92,0:17:35.52,Default,,0000,0000,0000,,essentially the core of machine learning. Dialogue: 0,0:17:35.52,0:17:38.64,Default,,0000,0000,0000,,Predictive analytics, making prediction Dialogue: 0,0:17:38.64,0:17:40.12,Default,,0000,0000,0000,,about the future Dialogue: 0,0:17:41.17,0:17:43.80,Default,,0000,0000,0000,,based on a historical data set. Dialogue: 0,0:17:44.38,0:17:47.44,Default,,0000,0000,0000,,Okay, so there are two areas of Dialogue: 0,0:17:47.44,0:17:49.48,Default,,0000,0000,0000,,supervised learning, regression and Dialogue: 0,0:17:49.48,0:17:51.40,Default,,0000,0000,0000,,classification. So regression is used to Dialogue: 0,0:17:51.40,0:17:53.44,Default,,0000,0000,0000,,predict a numerical target variable, such Dialogue: 0,0:17:53.44,0:17:55.32,Default,,0000,0000,0000,,as the price of a house or the salary of Dialogue: 0,0:17:55.32,0:17:57.80,Default,,0000,0000,0000,,an employee, whereas classification is Dialogue: 0,0:17:57.80,0:17:59.92,Default,,0000,0000,0000,,used to predict a categorical target Dialogue: 0,0:17:59.92,0:18:03.56,Default,,0000,0000,0000,,variable or class label, okay? So for Dialogue: 0,0:18:03.56,0:18:05.80,Default,,0000,0000,0000,,classification you can have either Dialogue: 0,0:18:05.80,0:18:08.68,Default,,0000,0000,0000,,binary or multiclass, so, for example, Dialogue: 0,0:18:08.68,0:18:11.56,Default,,0000,0000,0000,,binary will be just true or false, zero Dialogue: 0,0:18:11.56,0:18:14.84,Default,,0000,0000,0000,,or one. So whether your machine is going Dialogue: 0,0:18:14.84,0:18:17.36,Default,,0000,0000,0000,,to fail or is it not going to fail, right? Dialogue: 0,0:18:17.36,0:18:19.00,Default,,0000,0000,0000,,So just two classes, two possible, Dialogue: 0,0:18:19.00,0:18:21.64,Default,,0000,0000,0000,,outcomes, or is the customer going to Dialogue: 0,0:18:21.64,0:18:23.68,Default,,0000,0000,0000,,make a purchase or is the customer not Dialogue: 0,0:18:23.68,0:18:26.16,Default,,0000,0000,0000,,going to make a purchase. We call this Dialogue: 0,0:18:26.16,0:18:28.12,Default,,0000,0000,0000,,binary classification. And then for Dialogue: 0,0:18:28.12,0:18:29.68,Default,,0000,0000,0000,,multiclass, when there are more than two Dialogue: 0,0:18:29.68,0:18:32.56,Default,,0000,0000,0000,,classes or types of values. So, for Dialogue: 0,0:18:32.56,0:18:34.04,Default,,0000,0000,0000,,example, here this would be a Dialogue: 0,0:18:34.04,0:18:35.76,Default,,0000,0000,0000,,classification problem. So if you have a Dialogue: 0,0:18:35.76,0:18:37.96,Default,,0000,0000,0000,,data set here, you've got information Dialogue: 0,0:18:37.96,0:18:39.36,Default,,0000,0000,0000,,about your customers, you've got your Dialogue: 0,0:18:39.36,0:18:41.16,Default,,0000,0000,0000,,gender of the customer, the age of the Dialogue: 0,0:18:41.16,0:18:42.92,Default,,0000,0000,0000,,customer, the salary of the customer, and Dialogue: 0,0:18:42.92,0:18:44.64,Default,,0000,0000,0000,,you also have record about whether the Dialogue: 0,0:18:44.64,0:18:47.68,Default,,0000,0000,0000,,customer made a purchase or not, okay? So Dialogue: 0,0:18:47.68,0:18:50.08,Default,,0000,0000,0000,,you can take this data set to train a Dialogue: 0,0:18:50.08,0:18:52.44,Default,,0000,0000,0000,,classification model, and then the Dialogue: 0,0:18:52.44,0:18:54.12,Default,,0000,0000,0000,,classification model can then make a Dialogue: 0,0:18:54.12,0:18:56.32,Default,,0000,0000,0000,,prediction about a new customer, and Dialogue: 0,0:18:56.32,0:18:58.80,Default,,0000,0000,0000,,they're going to predict zero which Dialogue: 0,0:18:58.80,0:19:00.48,Default,,0000,0000,0000,,means the customer didn't make a Dialogue: 0,0:19:00.48,0:19:03.16,Default,,0000,0000,0000,,purchase or one which means the customer Dialogue: 0,0:19:03.16,0:19:06.32,Default,,0000,0000,0000,,make a purchase, right? And regression, Dialogue: 0,0:19:06.32,0:19:08.60,Default,,0000,0000,0000,,this is regression, so let's say you want Dialogue: 0,0:19:08.60,0:19:11.28,Default,,0000,0000,0000,,to predict the wind speed, and you've got Dialogue: 0,0:19:11.28,0:19:13.80,Default,,0000,0000,0000,,historical data about all these four Dialogue: 0,0:19:13.80,0:19:16.56,Default,,0000,0000,0000,,other independent variables or feature Dialogue: 0,0:19:16.56,0:19:18.04,Default,,0000,0000,0000,,variables, so you have recorded Dialogue: 0,0:19:18.04,0:19:19.64,Default,,0000,0000,0000,,temperature, the pressure, the relative Dialogue: 0,0:19:19.64,0:19:21.80,Default,,0000,0000,0000,,humidity, and the wind direction for the Dialogue: 0,0:19:21.80,0:19:24.80,Default,,0000,0000,0000,,past 10 days, 15 days, or whatever, okay? So Dialogue: 0,0:19:24.80,0:19:26.76,Default,,0000,0000,0000,,now you are going to train your machine Dialogue: 0,0:19:26.76,0:19:28.72,Default,,0000,0000,0000,,learning model using this data set, and Dialogue: 0,0:19:28.72,0:19:31.68,Default,,0000,0000,0000,,the target variable column, okay, this Dialogue: 0,0:19:31.68,0:19:33.76,Default,,0000,0000,0000,,column here, the label is basically a Dialogue: 0,0:19:33.76,0:19:37.08,Default,,0000,0000,0000,,number, right? So now with this number, Dialogue: 0,0:19:37.08,0:19:39.60,Default,,0000,0000,0000,,this is a regression model, and so now Dialogue: 0,0:19:39.60,0:19:41.76,Default,,0000,0000,0000,,you can put in a new data point, so a new Dialogue: 0,0:19:41.76,0:19:45.08,Default,,0000,0000,0000,,data point means a new set of values for Dialogue: 0,0:19:45.08,0:19:46.96,Default,,0000,0000,0000,,temperature, pressure, relative humidity, Dialogue: 0,0:19:46.96,0:19:48.60,Default,,0000,0000,0000,,and wind direction, and your machine Dialogue: 0,0:19:48.60,0:19:50.68,Default,,0000,0000,0000,,learning model will then predict the Dialogue: 0,0:19:50.68,0:19:53.64,Default,,0000,0000,0000,,wind speed for that new data point, okay? Dialogue: 0,0:19:53.64,0:19:57.48,Default,,0000,0000,0000,,So that's a regression model. Dialogue: 0,0:19:59.16,0:20:02.28,Default,,0000,0000,0000,,All right. So in this particular topic Dialogue: 0,0:20:02.28,0:20:04.92,Default,,0000,0000,0000,,I'm going to talk about the workflow of Dialogue: 0,0:20:04.92,0:20:07.96,Default,,0000,0000,0000,,that's involved in machine learning. So Dialogue: 0,0:20:07.96,0:20:12.64,Default,,0000,0000,0000,,in the previous slides, I talked about Dialogue: 0,0:20:12.64,0:20:14.60,Default,,0000,0000,0000,,developing the model, all right? But Dialogue: 0,0:20:14.60,0:20:16.36,Default,,0000,0000,0000,,that's just one part of the entire Dialogue: 0,0:20:16.36,0:20:19.08,Default,,0000,0000,0000,,workflow. So in real life when you use Dialogue: 0,0:20:19.08,0:20:20.48,Default,,0000,0000,0000,,machine learning, there's an end-to-end Dialogue: 0,0:20:20.48,0:20:22.48,Default,,0000,0000,0000,,workflow that's involved. So the first Dialogue: 0,0:20:22.48,0:20:24.16,Default,,0000,0000,0000,,thing, of course, is you need to get your Dialogue: 0,0:20:24.16,0:20:26.88,Default,,0000,0000,0000,,data, and then you need to clean your Dialogue: 0,0:20:26.88,0:20:29.00,Default,,0000,0000,0000,,data, and then you need to explore your Dialogue: 0,0:20:29.00,0:20:30.80,Default,,0000,0000,0000,,data. You need to see what's going on in Dialogue: 0,0:20:30.80,0:20:33.28,Default,,0000,0000,0000,,your data set, right? And your data set, Dialogue: 0,0:20:33.28,0:20:35.72,Default,,0000,0000,0000,,real life data sets are not trivial, they Dialogue: 0,0:20:35.72,0:20:38.76,Default,,0000,0000,0000,,are hundreds of rows, thousands of rows, Dialogue: 0,0:20:38.76,0:20:40.64,Default,,0000,0000,0000,,sometimes millions of rows, billions of Dialogue: 0,0:20:40.64,0:20:43.08,Default,,0000,0000,0000,,rows, we're talking about billions or Dialogue: 0,0:20:43.08,0:20:45.12,Default,,0000,0000,0000,,millions of data points especially if Dialogue: 0,0:20:45.12,0:20:47.12,Default,,0000,0000,0000,,you're using an IoT sensor to get data Dialogue: 0,0:20:47.12,0:20:49.00,Default,,0000,0000,0000,,in real time. So you've got all these Dialogue: 0,0:20:49.00,0:20:51.32,Default,,0000,0000,0000,,super large data sets, you need to clean Dialogue: 0,0:20:51.32,0:20:53.40,Default,,0000,0000,0000,,them, and explore them, and then you need Dialogue: 0,0:20:53.40,0:20:56.36,Default,,0000,0000,0000,,to prepare them into a right format so Dialogue: 0,0:20:56.36,0:20:59.60,Default,,0000,0000,0000,,that you can put them into the training Dialogue: 0,0:20:59.60,0:21:01.52,Default,,0000,0000,0000,,process to create your machine learning Dialogue: 0,0:21:01.52,0:21:04.80,Default,,0000,0000,0000,,model, and then subsequently you check Dialogue: 0,0:21:04.80,0:21:07.56,Default,,0000,0000,0000,,how good is the model, right? How accurate Dialogue: 0,0:21:07.56,0:21:10.08,Default,,0000,0000,0000,,is the model in terms of its ability to Dialogue: 0,0:21:10.08,0:21:12.56,Default,,0000,0000,0000,,generate predictions for the Dialogue: 0,0:21:12.56,0:21:14.96,Default,,0000,0000,0000,,future, right? How accurate are the Dialogue: 0,0:21:14.96,0:21:16.68,Default,,0000,0000,0000,,predictions that are coming up from your Dialogue: 0,0:21:16.68,0:21:18.40,Default,,0000,0000,0000,,machine learning model. So that's Dialogue: 0,0:21:18.40,0:21:20.76,Default,,0000,0000,0000,,validating or evaluating your model, and Dialogue: 0,0:21:20.76,0:21:22.56,Default,,0000,0000,0000,,then subsequently if you determine that Dialogue: 0,0:21:22.56,0:21:25.40,Default,,0000,0000,0000,,your model is of adequate accuracy to Dialogue: 0,0:21:25.40,0:21:27.24,Default,,0000,0000,0000,,meet whatever your domain use case Dialogue: 0,0:21:27.24,0:21:29.40,Default,,0000,0000,0000,,requirements are, right? So let's say the Dialogue: 0,0:21:29.40,0:21:31.44,Default,,0000,0000,0000,,accuracy that's required for your domain Dialogue: 0,0:21:31.44,0:21:32.44,Default,,0000,0000,0000,,use case is Dialogue: 0,0:21:32.44,0:21:35.32,Default,,0000,0000,0000,,85%, okay? If my machine learning model Dialogue: 0,0:21:35.32,0:21:38.52,Default,,0000,0000,0000,,can give an 85% accuracy rate, I think Dialogue: 0,0:21:38.52,0:21:40.16,Default,,0000,0000,0000,,it's good enough, then I'm going to Dialogue: 0,0:21:40.16,0:21:42.88,Default,,0000,0000,0000,,deploy it into real world use case. So Dialogue: 0,0:21:42.88,0:21:45.00,Default,,0000,0000,0000,,here the machine learning model gets Dialogue: 0,0:21:45.00,0:21:48.44,Default,,0000,0000,0000,,deployed on the server, and then other, Dialogue: 0,0:21:48.44,0:21:50.76,Default,,0000,0000,0000,,you know, other data sources are going to Dialogue: 0,0:21:50.76,0:21:52.56,Default,,0000,0000,0000,,be captured from somewhere. That data is Dialogue: 0,0:21:52.56,0:21:54.20,Default,,0000,0000,0000,,pump into the machine learning model. The Dialogue: 0,0:21:54.20,0:21:55.44,Default,,0000,0000,0000,,machine learning model generates Dialogue: 0,0:21:55.44,0:21:57.76,Default,,0000,0000,0000,,predictions, and those predictions are Dialogue: 0,0:21:57.76,0:21:59.60,Default,,0000,0000,0000,,then used to make decisions on the Dialogue: 0,0:21:59.60,0:22:02.00,Default,,0000,0000,0000,,factory floor in real time or in any Dialogue: 0,0:22:02.00,0:22:04.56,Default,,0000,0000,0000,,other particular scenario. And then you Dialogue: 0,0:22:04.56,0:22:06.84,Default,,0000,0000,0000,,constantly monitor and update the model, Dialogue: 0,0:22:06.84,0:22:09.36,Default,,0000,0000,0000,,you get more new data, and then the Dialogue: 0,0:22:09.36,0:22:11.96,Default,,0000,0000,0000,,entire cycle repeats itself. So that's Dialogue: 0,0:22:11.96,0:22:14.48,Default,,0000,0000,0000,,your machine learning workflow, okay, in a Dialogue: 0,0:22:14.48,0:22:16.92,Default,,0000,0000,0000,,nutshell. Here's another example of Dialogue: 0,0:22:16.92,0:22:18.52,Default,,0000,0000,0000,,the same thing maybe in a slightly Dialogue: 0,0:22:18.52,0:22:20.04,Default,,0000,0000,0000,,different format, so, again, you have your Dialogue: 0,0:22:20.04,0:22:22.16,Default,,0000,0000,0000,,data collection and preparation. Here we Dialogue: 0,0:22:22.16,0:22:24.36,Default,,0000,0000,0000,,talk more about the different kinds of Dialogue: 0,0:22:24.36,0:22:26.52,Default,,0000,0000,0000,,algorithms that available to create a Dialogue: 0,0:22:26.52,0:22:28.12,Default,,0000,0000,0000,,model, and I'll talk about this more in Dialogue: 0,0:22:28.12,0:22:30.00,Default,,0000,0000,0000,,detail when we look at the real world Dialogue: 0,0:22:30.00,0:22:32.32,Default,,0000,0000,0000,,example of a end-to-end machine learning Dialogue: 0,0:22:32.32,0:22:34.56,Default,,0000,0000,0000,,workflow for the predictive maintenance Dialogue: 0,0:22:34.56,0:22:36.88,Default,,0000,0000,0000,,use case. So once you have chosen the Dialogue: 0,0:22:36.88,0:22:38.84,Default,,0000,0000,0000,,appropriate algorithm, you then have Dialogue: 0,0:22:38.84,0:22:41.24,Default,,0000,0000,0000,,trained your model, you then have Dialogue: 0,0:22:41.24,0:22:44.08,Default,,0000,0000,0000,,selected the appropriate train model Dialogue: 0,0:22:44.08,0:22:46.44,Default,,0000,0000,0000,,among the multiple models. You are Dialogue: 0,0:22:46.44,0:22:47.80,Default,,0000,0000,0000,,probably going to develop multiple Dialogue: 0,0:22:47.80,0:22:49.56,Default,,0000,0000,0000,,models from multiple algorithms, you're Dialogue: 0,0:22:49.56,0:22:51.68,Default,,0000,0000,0000,,going to evaluate them all, and then Dialogue: 0,0:22:51.68,0:22:53.20,Default,,0000,0000,0000,,you're going to say, hey, you know what? Dialogue: 0,0:22:53.20,0:22:55.28,Default,,0000,0000,0000,,After I've evaluated and tested that, Dialogue: 0,0:22:55.28,0:22:57.48,Default,,0000,0000,0000,,I've chosen the best model, I'm going to Dialogue: 0,0:22:57.48,0:22:59.64,Default,,0000,0000,0000,,deploy the model, all right, so this is Dialogue: 0,0:22:59.64,0:23:02.64,Default,,0000,0000,0000,,for real life production use, okay? Real Dialogue: 0,0:23:02.64,0:23:04.28,Default,,0000,0000,0000,,life sensor data is going to be pumped Dialogue: 0,0:23:04.28,0:23:06.04,Default,,0000,0000,0000,,into my model, my model is going to Dialogue: 0,0:23:06.04,0:23:08.04,Default,,0000,0000,0000,,generate predictions, the predicted data Dialogue: 0,0:23:08.04,0:23:10.12,Default,,0000,0000,0000,,is going to used immediately in real Dialogue: 0,0:23:10.12,0:23:12.84,Default,,0000,0000,0000,,time for real life decision making, and Dialogue: 0,0:23:12.84,0:23:15.00,Default,,0000,0000,0000,,then I'm going to monitor, right, the Dialogue: 0,0:23:15.00,0:23:17.44,Default,,0000,0000,0000,,results. So somebody's using the Dialogue: 0,0:23:17.44,0:23:19.28,Default,,0000,0000,0000,,predictions from my model, if the Dialogue: 0,0:23:19.28,0:23:21.88,Default,,0000,0000,0000,,predictions are lousy, that goes into the Dialogue: 0,0:23:21.88,0:23:23.44,Default,,0000,0000,0000,,monitoring, the monitoring system Dialogue: 0,0:23:23.44,0:23:25.28,Default,,0000,0000,0000,,captures that. If the predictions are Dialogue: 0,0:23:25.28,0:23:27.72,Default,,0000,0000,0000,,fantastic, well that is also captured by the Dialogue: 0,0:23:27.72,0:23:29.80,Default,,0000,0000,0000,,monitoring system, and that gets Dialogue: 0,0:23:29.80,0:23:32.36,Default,,0000,0000,0000,,feedback again to the next cycle of my Dialogue: 0,0:23:32.36,0:23:33.68,Default,,0000,0000,0000,,machine learning Dialogue: 0,0:23:33.68,0:23:35.96,Default,,0000,0000,0000,,pipeline. Okay, so that's the kind of Dialogue: 0,0:23:35.96,0:23:38.36,Default,,0000,0000,0000,,overall view, and here are the kind of Dialogue: 0,0:23:38.36,0:23:41.56,Default,,0000,0000,0000,,key phases of your workflow. So one of Dialogue: 0,0:23:41.56,0:23:43.96,Default,,0000,0000,0000,,the important phases is called EDA, Dialogue: 0,0:23:43.96,0:23:47.52,Default,,0000,0000,0000,,exploratory data analysis and in this Dialogue: 0,0:23:47.52,0:23:49.88,Default,,0000,0000,0000,,particular phase, you're going to Dialogue: 0,0:23:49.88,0:23:53.12,Default,,0000,0000,0000,,do a lot of stuff, primarily just to Dialogue: 0,0:23:53.12,0:23:54.88,Default,,0000,0000,0000,,understand your data set. So like I said, Dialogue: 0,0:23:54.88,0:23:56.56,Default,,0000,0000,0000,,real life data sets, they tend to be very Dialogue: 0,0:23:56.56,0:23:59.32,Default,,0000,0000,0000,,complex, and they tend to have various Dialogue: 0,0:23:59.32,0:24:01.04,Default,,0000,0000,0000,,statistical properties, all right, Dialogue: 0,0:24:01.04,0:24:02.68,Default,,0000,0000,0000,,statistics is a very important component Dialogue: 0,0:24:02.68,0:24:05.60,Default,,0000,0000,0000,,of machine learning. So an EDA helps you Dialogue: 0,0:24:05.60,0:24:07.48,Default,,0000,0000,0000,,to kind of get an overview of your data Dialogue: 0,0:24:07.48,0:24:09.68,Default,,0000,0000,0000,,set, get an overview of any problems in Dialogue: 0,0:24:09.68,0:24:11.52,Default,,0000,0000,0000,,your data set like any data that's Dialogue: 0,0:24:11.52,0:24:13.44,Default,,0000,0000,0000,,missing, the statistical properties of your Dialogue: 0,0:24:13.44,0:24:15.16,Default,,0000,0000,0000,,data set, the distribution of your data Dialogue: 0,0:24:15.16,0:24:17.28,Default,,0000,0000,0000,,set, the statistical correlation of Dialogue: 0,0:24:17.28,0:24:19.19,Default,,0000,0000,0000,,variables in your data set, etc, Dialogue: 0,0:24:19.19,0:24:23.40,Default,,0000,0000,0000,,etc. Okay, then we have data cleaning or Dialogue: 0,0:24:23.40,0:24:25.28,Default,,0000,0000,0000,,sometimes you call it data cleansing, and Dialogue: 0,0:24:25.28,0:24:27.60,Default,,0000,0000,0000,,in this phase what you want to do is Dialogue: 0,0:24:27.60,0:24:29.44,Default,,0000,0000,0000,,primarily, you want to kind of do things Dialogue: 0,0:24:29.44,0:24:31.96,Default,,0000,0000,0000,,like remove duplicate records or rows in Dialogue: 0,0:24:31.96,0:24:33.68,Default,,0000,0000,0000,,your table, you want to make sure that Dialogue: 0,0:24:33.68,0:24:36.80,Default,,0000,0000,0000,,your data or your data Dialogue: 0,0:24:36.80,0:24:39.40,Default,,0000,0000,0000,,points or your samples have appropriate IDs, Dialogue: 0,0:24:39.40,0:24:41.08,Default,,0000,0000,0000,,and most importantly, you want to make Dialogue: 0,0:24:41.08,0:24:43.04,Default,,0000,0000,0000,,sure there's not too many missing values Dialogue: 0,0:24:43.04,0:24:44.88,Default,,0000,0000,0000,,in your data set. So what I mean by Dialogue: 0,0:24:44.88,0:24:46.32,Default,,0000,0000,0000,,missing values are things like that, Dialogue: 0,0:24:46.32,0:24:48.20,Default,,0000,0000,0000,,right? You have got a data set, and for Dialogue: 0,0:24:48.20,0:24:51.64,Default,,0000,0000,0000,,some reason there are some cells or Dialogue: 0,0:24:51.64,0:24:54.56,Default,,0000,0000,0000,,locations in your data set which are Dialogue: 0,0:24:54.56,0:24:56.52,Default,,0000,0000,0000,,missing values, right? And if you have a Dialogue: 0,0:24:56.52,0:24:58.68,Default,,0000,0000,0000,,lot of these missing values, then you've Dialogue: 0,0:24:58.68,0:25:00.44,Default,,0000,0000,0000,,got a poor quality data set, and you're Dialogue: 0,0:25:00.44,0:25:02.20,Default,,0000,0000,0000,,not going to be able to build a good Dialogue: 0,0:25:02.20,0:25:04.16,Default,,0000,0000,0000,,model from this data set. You're not Dialogue: 0,0:25:04.16,0:25:06.00,Default,,0000,0000,0000,,going to be able to train a good machine Dialogue: 0,0:25:06.00,0:25:08.12,Default,,0000,0000,0000,,learning model from a data set with a Dialogue: 0,0:25:08.12,0:25:10.20,Default,,0000,0000,0000,,lot of missing values like this. So you Dialogue: 0,0:25:10.20,0:25:11.88,Default,,0000,0000,0000,,have to figure out whether there are a Dialogue: 0,0:25:11.88,0:25:13.40,Default,,0000,0000,0000,,lot of missing values in your data set, Dialogue: 0,0:25:13.40,0:25:15.40,Default,,0000,0000,0000,,how do you handle them. Another thing Dialogue: 0,0:25:15.40,0:25:16.92,Default,,0000,0000,0000,,that's important in data cleansing is Dialogue: 0,0:25:16.92,0:25:18.80,Default,,0000,0000,0000,,figuring out the outliers in your data Dialogue: 0,0:25:18.80,0:25:21.92,Default,,0000,0000,0000,,set. So outliers are things like this, Dialogue: 0,0:25:21.92,0:25:24.04,Default,,0000,0000,0000,,you know, data points that are very far from Dialogue: 0,0:25:24.04,0:25:26.44,Default,,0000,0000,0000,,the general trend of data points in your Dialogue: 0,0:25:26.44,0:25:29.56,Default,,0000,0000,0000,,data set, right? And so there are also Dialogue: 0,0:25:29.56,0:25:31.92,Default,,0000,0000,0000,,several ways to detect outliers in your Dialogue: 0,0:25:31.92,0:25:34.20,Default,,0000,0000,0000,,data set, and there are several ways to Dialogue: 0,0:25:34.20,0:25:36.64,Default,,0000,0000,0000,,handle outliers in your data set. Dialogue: 0,0:25:36.64,0:25:38.20,Default,,0000,0000,0000,,Similarly as well, there are several ways Dialogue: 0,0:25:38.20,0:25:39.96,Default,,0000,0000,0000,,to handle missing values in your data Dialogue: 0,0:25:39.96,0:25:42.88,Default,,0000,0000,0000,,set. So handling missing values, handling Dialogue: 0,0:25:42.88,0:25:45.68,Default,,0000,0000,0000,,outliers, those are really two very key Dialogue: 0,0:25:45.68,0:25:47.28,Default,,0000,0000,0000,,importance of data Dialogue: 0,0:25:47.28,0:25:49.12,Default,,0000,0000,0000,,cleansing, and there are many, many Dialogue: 0,0:25:49.12,0:25:50.76,Default,,0000,0000,0000,,techniques to handle this, so a data Dialogue: 0,0:25:50.76,0:25:52.00,Default,,0000,0000,0000,,scientist needs to be acquainted with Dialogue: 0,0:25:52.00,0:25:55.36,Default,,0000,0000,0000,,all of this. All right, why do I need to Dialogue: 0,0:25:55.36,0:25:58.00,Default,,0000,0000,0000,,do data cleansing? Well, here is the key Dialogue: 0,0:25:58.00,0:25:59.36,Default,,0000,0000,0000,,point. Dialogue: 0,0:25:59.36,0:26:02.80,Default,,0000,0000,0000,,If you have a very poor quality data set, Dialogue: 0,0:26:02.80,0:26:04.88,Default,,0000,0000,0000,,which means you've got a lot of outliers Dialogue: 0,0:26:04.88,0:26:06.72,Default,,0000,0000,0000,,which are errors in your data set, or you Dialogue: 0,0:26:06.72,0:26:08.16,Default,,0000,0000,0000,,got a lot of missing values in your data Dialogue: 0,0:26:08.16,0:26:10.84,Default,,0000,0000,0000,,set, even though you've got a fantastic Dialogue: 0,0:26:10.84,0:26:13.04,Default,,0000,0000,0000,,algorithm, you've got a fantastic model, Dialogue: 0,0:26:13.04,0:26:15.72,Default,,0000,0000,0000,,the predictions that your model is going Dialogue: 0,0:26:15.72,0:26:18.96,Default,,0000,0000,0000,,to give is absolutely rubbish. It's kind Dialogue: 0,0:26:18.96,0:26:22.08,Default,,0000,0000,0000,,of like taking water and putting water Dialogue: 0,0:26:22.08,0:26:26.00,Default,,0000,0000,0000,,into the tank of a Mercedes-Benz. So Dialogue: 0,0:26:26.00,0:26:28.44,Default,,0000,0000,0000,,Mercedes-Benz is a great car, but if you Dialogue: 0,0:26:28.44,0:26:30.08,Default,,0000,0000,0000,,take water and put it into your Dialogue: 0,0:26:30.08,0:26:33.40,Default,,0000,0000,0000,,Mercedes-Benz, it will just die, right? Your Dialogue: 0,0:26:33.40,0:26:36.52,Default,,0000,0000,0000,,car will just die, it can't run on water, Dialogue: 0,0:26:36.52,0:26:38.28,Default,,0000,0000,0000,,right? On the other hand, if you have a Dialogue: 0,0:26:38.28,0:26:41.56,Default,,0000,0000,0000,,Myvi, Myvi is just a lousy, shit car, but if Dialogue: 0,0:26:41.56,0:26:44.84,Default,,0000,0000,0000,,you take a high octane, good petrol and Dialogue: 0,0:26:44.84,0:26:47.24,Default,,0000,0000,0000,,you put into a Myvi, the Myvi will just go at, Dialogue: 0,0:26:47.24,0:26:49.48,Default,,0000,0000,0000,,you know, 100 miles an hour. It would just Dialogue: 0,0:26:49.48,0:26:51.16,Default,,0000,0000,0000,,completely destroy the Mercedes-Benz in Dialogue: 0,0:26:51.16,0:26:53.36,Default,,0000,0000,0000,,terms of performance, so it Dialogue: 0,0:26:53.36,0:26:54.80,Default,,0000,0000,0000,,doesn't really matter what model you're Dialogue: 0,0:26:54.80,0:26:57.08,Default,,0000,0000,0000,,using here, right? So you can be using the most Dialogue: 0,0:26:57.08,0:26:58.68,Default,,0000,0000,0000,,fantastic model like the Dialogue: 0,0:26:58.68,0:27:01.20,Default,,0000,0000,0000,,Mercedes-Benz or machine learning, but if Dialogue: 0,0:27:01.20,0:27:03.08,Default,,0000,0000,0000,,your data is lousy quality, your Dialogue: 0,0:27:03.08,0:27:06.48,Default,,0000,0000,0000,,predictions is also going to be rubbish, Dialogue: 0,0:27:06.48,0:27:10.00,Default,,0000,0000,0000,,okay? So cleansing data set is, in fact, Dialogue: 0,0:27:10.00,0:27:11.88,Default,,0000,0000,0000,,probably the most important thing that Dialogue: 0,0:27:11.88,0:27:13.64,Default,,0000,0000,0000,,data scientists need to do and that's Dialogue: 0,0:27:13.64,0:27:15.52,Default,,0000,0000,0000,,what they spend most of the time doing, Dialogue: 0,0:27:15.52,0:27:17.60,Default,,0000,0000,0000,,right, building the model, training the Dialogue: 0,0:27:17.60,0:27:20.24,Default,,0000,0000,0000,,model, getting the right algorithms, and Dialogue: 0,0:27:20.24,0:27:23.24,Default,,0000,0000,0000,,so on, that's really a small portion of Dialogue: 0,0:27:23.24,0:27:25.20,Default,,0000,0000,0000,,the actual machine learning workflow, Dialogue: 0,0:27:25.20,0:27:27.36,Default,,0000,0000,0000,,right? The actual machine learning Dialogue: 0,0:27:27.36,0:27:29.68,Default,,0000,0000,0000,,workflow, the vast majority of time is on Dialogue: 0,0:27:29.68,0:27:31.56,Default,,0000,0000,0000,,cleaning and organizing your Dialogue: 0,0:27:31.56,0:27:33.36,Default,,0000,0000,0000,,data. Then you have something called Dialogue: 0,0:27:33.36,0:27:35.08,Default,,0000,0000,0000,,feature engineering which is you Dialogue: 0,0:27:35.08,0:27:37.00,Default,,0000,0000,0000,,preprocess the feature variables of Dialogue: 0,0:27:37.00,0:27:38.92,Default,,0000,0000,0000,,your original data set prior to using Dialogue: 0,0:27:38.92,0:27:40.60,Default,,0000,0000,0000,,them to train the model, and this is Dialogue: 0,0:27:40.60,0:27:41.96,Default,,0000,0000,0000,,either through addition, deletion, Dialogue: 0,0:27:41.96,0:27:43.60,Default,,0000,0000,0000,,combination, or transformation of these Dialogue: 0,0:27:43.60,0:27:45.40,Default,,0000,0000,0000,,variables. And then the idea is you want Dialogue: 0,0:27:45.40,0:27:47.00,Default,,0000,0000,0000,,to improve the predictive accuracy of Dialogue: 0,0:27:47.00,0:27:49.32,Default,,0000,0000,0000,,the model, and also because some models Dialogue: 0,0:27:49.32,0:27:51.08,Default,,0000,0000,0000,,can only work with numeric data, so you Dialogue: 0,0:27:51.08,0:27:53.72,Default,,0000,0000,0000,,need to transform categorical data into Dialogue: 0,0:27:53.72,0:27:57.04,Default,,0000,0000,0000,,numeric data. All right, so just now, in Dialogue: 0,0:27:57.04,0:27:58.80,Default,,0000,0000,0000,,the earlier slides, I showed you that you Dialogue: 0,0:27:58.80,0:28:00.76,Default,,0000,0000,0000,,take your original data set, you pump it Dialogue: 0,0:28:00.76,0:28:03.20,Default,,0000,0000,0000,,into algorithm, and then a couple of hours Dialogue: 0,0:28:03.20,0:28:05.20,Default,,0000,0000,0000,,later, you get a machine learning model, Dialogue: 0,0:28:05.20,0:28:08.64,Default,,0000,0000,0000,,right? So you didn't do anything to your Dialogue: 0,0:28:08.64,0:28:10.16,Default,,0000,0000,0000,,data set, to the feature variables in Dialogue: 0,0:28:10.16,0:28:12.16,Default,,0000,0000,0000,,your data set before you pump it into a Dialogue: 0,0:28:12.16,0:28:14.40,Default,,0000,0000,0000,,machine learning algorithm. So Dialogue: 0,0:28:14.40,0:28:15.84,Default,,0000,0000,0000,,what I showed you earlier is you just Dialogue: 0,0:28:15.84,0:28:18.92,Default,,0000,0000,0000,,take the data set exactly as it is and Dialogue: 0,0:28:18.92,0:28:20.80,Default,,0000,0000,0000,,you just pump it into the algorithm, Dialogue: 0,0:28:20.80,0:28:23.12,Default,,0000,0000,0000,,couple of hours later, you get a model, Dialogue: 0,0:28:23.12,0:28:27.64,Default,,0000,0000,0000,,right? But that's not what generally Dialogue: 0,0:28:27.64,0:28:29.60,Default,,0000,0000,0000,,happens in in real life. In real life, Dialogue: 0,0:28:29.60,0:28:31.56,Default,,0000,0000,0000,,you're going to take all the original Dialogue: 0,0:28:31.56,0:28:34.32,Default,,0000,0000,0000,,feature variables from your data set and Dialogue: 0,0:28:34.32,0:28:36.72,Default,,0000,0000,0000,,you're going to transform them in some Dialogue: 0,0:28:36.72,0:28:38.96,Default,,0000,0000,0000,,way. So you can see here these are the Dialogue: 0,0:28:38.96,0:28:42.12,Default,,0000,0000,0000,,columns of data from my original data set, Dialogue: 0,0:28:42.12,0:28:46.04,Default,,0000,0000,0000,,and before I actually put all these data Dialogue: 0,0:28:46.04,0:28:48.24,Default,,0000,0000,0000,,points from my original data set into my Dialogue: 0,0:28:48.24,0:28:50.72,Default,,0000,0000,0000,,algorithm to train and get my model, I Dialogue: 0,0:28:50.72,0:28:54.96,Default,,0000,0000,0000,,will actually transform them, okay? So the Dialogue: 0,0:28:54.96,0:28:57.60,Default,,0000,0000,0000,,transformation of these feature variable Dialogue: 0,0:28:57.60,0:29:00.60,Default,,0000,0000,0000,,values, we call this feature engineering. Dialogue: 0,0:29:00.60,0:29:02.44,Default,,0000,0000,0000,,And there are many, many techniques to do Dialogue: 0,0:29:02.44,0:29:04.96,Default,,0000,0000,0000,,feature engineering, so one-hot encoding, Dialogue: 0,0:29:04.96,0:29:08.28,Default,,0000,0000,0000,,scaling, log transformation, Dialogue: 0,0:29:08.28,0:29:10.48,Default,,0000,0000,0000,,discretization, date extraction, boolean Dialogue: 0,0:29:10.48,0:29:12.04,Default,,0000,0000,0000,,logic, etc, etc. Dialogue: 0,0:29:12.04,0:29:14.88,Default,,0000,0000,0000,,Okay, then finally we do something Dialogue: 0,0:29:14.88,0:29:16.80,Default,,0000,0000,0000,,called a train-test split, so where we Dialogue: 0,0:29:16.80,0:29:19.44,Default,,0000,0000,0000,,take our original dataset, right? So this Dialogue: 0,0:29:19.44,0:29:21.36,Default,,0000,0000,0000,,was the original dataset, and we break Dialogue: 0,0:29:21.36,0:29:23.72,Default,,0000,0000,0000,,it into two parts, so one is called the Dialogue: 0,0:29:23.72,0:29:25.76,Default,,0000,0000,0000,,training dataset and the other is Dialogue: 0,0:29:25.76,0:29:28.12,Default,,0000,0000,0000,,called the test dataset. And the primary Dialogue: 0,0:29:28.12,0:29:30.00,Default,,0000,0000,0000,,purpose for this is when we feed and Dialogue: 0,0:29:30.00,0:29:31.40,Default,,0000,0000,0000,,train the machine learning model, we're Dialogue: 0,0:29:31.40,0:29:32.64,Default,,0000,0000,0000,,going to use what is called the training Dialogue: 0,0:29:32.64,0:29:35.56,Default,,0000,0000,0000,,dataset, and when we want to evaluate Dialogue: 0,0:29:35.56,0:29:37.40,Default,,0000,0000,0000,,the accuracy of the model, right? So this Dialogue: 0,0:29:37.40,0:29:40.96,Default,,0000,0000,0000,,is the key part of your machine learning Dialogue: 0,0:29:40.96,0:29:43.64,Default,,0000,0000,0000,,life cycle because you are not only just Dialogue: 0,0:29:43.64,0:29:45.44,Default,,0000,0000,0000,,going to have one possible models Dialogue: 0,0:29:45.44,0:29:47.72,Default,,0000,0000,0000,,because there are a vast range of Dialogue: 0,0:29:47.72,0:29:50.08,Default,,0000,0000,0000,,algorithms that you can use to create a Dialogue: 0,0:29:50.08,0:29:53.00,Default,,0000,0000,0000,,model. So fundamentally you have a wide Dialogue: 0,0:29:53.00,0:29:55.68,Default,,0000,0000,0000,,range of choices, right, like wide range Dialogue: 0,0:29:55.68,0:29:57.64,Default,,0000,0000,0000,,of cars, right? You want to buy a car, you Dialogue: 0,0:29:57.64,0:30:00.56,Default,,0000,0000,0000,,can buy a Myvi, you can buy a Perodua, Dialogue: 0,0:30:00.56,0:30:02.64,Default,,0000,0000,0000,,you can buy a Honda, you can buy a Dialogue: 0,0:30:02.64,0:30:05.04,Default,,0000,0000,0000,,Mercedes-Benz, you can buy a Audi, you can Dialogue: 0,0:30:05.04,0:30:07.76,Default,,0000,0000,0000,,buy a beamer, many, many different cars Dialogue: 0,0:30:07.76,0:30:09.24,Default,,0000,0000,0000,,that available for you if you want Dialogue: 0,0:30:09.24,0:30:11.68,Default,,0000,0000,0000,,to buy a car, right? Same thing. With a Dialogue: 0,0:30:11.68,0:30:14.36,Default,,0000,0000,0000,,machine learning model there are a vast Dialogue: 0,0:30:14.36,0:30:16.72,Default,,0000,0000,0000,,variety of algorithms that you can Dialogue: 0,0:30:16.72,0:30:19.48,Default,,0000,0000,0000,,choose from in order to create a model, Dialogue: 0,0:30:19.48,0:30:21.52,Default,,0000,0000,0000,,and so once you create a model from a Dialogue: 0,0:30:21.52,0:30:24.48,Default,,0000,0000,0000,,given algorithm you need to say, hey, how Dialogue: 0,0:30:24.48,0:30:26.44,Default,,0000,0000,0000,,accurate is this model that I've created Dialogue: 0,0:30:26.44,0:30:28.64,Default,,0000,0000,0000,,from this algorithm. And different Dialogue: 0,0:30:28.64,0:30:30.40,Default,,0000,0000,0000,,algorithms are going to create different Dialogue: 0,0:30:30.40,0:30:33.72,Default,,0000,0000,0000,,models with different rates of accuracy. Dialogue: 0,0:30:33.72,0:30:35.68,Default,,0000,0000,0000,,And so the primary purpose of the test Dialogue: 0,0:30:35.68,0:30:38.20,Default,,0000,0000,0000,,dataset is to evaluate the accuracy Dialogue: 0,0:30:38.20,0:30:41.48,Default,,0000,0000,0000,,of the model to see hey, is this model Dialogue: 0,0:30:41.48,0:30:43.36,Default,,0000,0000,0000,,that I've created using this algorithm, Dialogue: 0,0:30:43.36,0:30:45.88,Default,,0000,0000,0000,,is it adequate for me to use in a real Dialogue: 0,0:30:45.88,0:30:48.60,Default,,0000,0000,0000,,life production use case? Okay? So that's Dialogue: 0,0:30:48.60,0:30:52.32,Default,,0000,0000,0000,,what it's all about. Okay, so this is my Dialogue: 0,0:30:52.32,0:30:54.28,Default,,0000,0000,0000,,original dataset, I break it into my Dialogue: 0,0:30:54.28,0:30:56.56,Default,,0000,0000,0000,,feature dataset and Dialogue: 0,0:30:56.56,0:30:58.52,Default,,0000,0000,0000,,also my target variable column, so my Dialogue: 0,0:30:58.52,0:31:00.64,Default,,0000,0000,0000,,feature variable columns, the target Dialogue: 0,0:31:00.64,0:31:02.20,Default,,0000,0000,0000,,variable columns, and then I further break Dialogue: 0,0:31:02.20,0:31:04.24,Default,,0000,0000,0000,,it into a training dataset and a test Dialogue: 0,0:31:04.24,0:31:06.60,Default,,0000,0000,0000,,dataset. The training dataset is to use Dialogue: 0,0:31:06.60,0:31:08.32,Default,,0000,0000,0000,,to train, to create the machine learning Dialogue: 0,0:31:08.32,0:31:10.48,Default,,0000,0000,0000,,model. And then once the machine learning Dialogue: 0,0:31:10.48,0:31:12.20,Default,,0000,0000,0000,,model is created, I then use the test Dialogue: 0,0:31:12.20,0:31:15.08,Default,,0000,0000,0000,,dataset to evaluate the accuracy of the Dialogue: 0,0:31:15.08,0:31:17.26,Default,,0000,0000,0000,,machine learning model. Dialogue: 0,0:31:17.26,0:31:21.00,Default,,0000,0000,0000,,All right. And then finally we can Dialogue: 0,0:31:21.00,0:31:23.20,Default,,0000,0000,0000,,see what are the different parts or Dialogue: 0,0:31:23.20,0:31:26.08,Default,,0000,0000,0000,,aspects that go into a successful model, Dialogue: 0,0:31:26.08,0:31:29.52,Default,,0000,0000,0000,,so EDA about 10%, data cleansing about Dialogue: 0,0:31:29.52,0:31:32.36,Default,,0000,0000,0000,,20%, feature engineering about Dialogue: 0,0:31:32.36,0:31:36.32,Default,,0000,0000,0000,,25%, selecting a specific algorithm about Dialogue: 0,0:31:36.32,0:31:39.12,Default,,0000,0000,0000,,10%, and then training the model from Dialogue: 0,0:31:39.12,0:31:41.64,Default,,0000,0000,0000,,that algorithm about 15%, and then Dialogue: 0,0:31:41.64,0:31:43.68,Default,,0000,0000,0000,,finally evaluating the model, deciding Dialogue: 0,0:31:43.68,0:31:45.96,Default,,0000,0000,0000,,which is the best model with the highest Dialogue: 0,0:31:45.96,0:31:51.82,Default,,0000,0000,0000,,accuracy rate, that's about 20%. Dialogue: 0,0:31:54.08,0:31:56.92,Default,,0000,0000,0000,,All right, so we have reached the Dialogue: 0,0:31:56.92,0:31:58.88,Default,,0000,0000,0000,,most interesting part of this Dialogue: 0,0:31:58.88,0:32:01.04,Default,,0000,0000,0000,,presentation which is the demonstration Dialogue: 0,0:32:01.04,0:32:03.76,Default,,0000,0000,0000,,of an end-to-end machine learning workflow Dialogue: 0,0:32:03.76,0:32:06.08,Default,,0000,0000,0000,,on a real life dataset that Dialogue: 0,0:32:06.08,0:32:10.08,Default,,0000,0000,0000,,demonstrates the use case of predictive Dialogue: 0,0:32:10.08,0:32:13.52,Default,,0000,0000,0000,,maintenance. So for the data set for Dialogue: 0,0:32:13.52,0:32:16.24,Default,,0000,0000,0000,,this particular use case, I've used a Dialogue: 0,0:32:16.24,0:32:19.20,Default,,0000,0000,0000,,data set from Kaggle. So for those of you Dialogue: 0,0:32:19.20,0:32:21.40,Default,,0000,0000,0000,,are not aware of this, Kaggle is the Dialogue: 0,0:32:21.40,0:32:24.88,Default,,0000,0000,0000,,world's largest open-source community Dialogue: 0,0:32:24.88,0:32:28.08,Default,,0000,0000,0000,,for data science and AI, and they have a Dialogue: 0,0:32:28.08,0:32:31.16,Default,,0000,0000,0000,,large collection of datasets from all Dialogue: 0,0:32:31.16,0:32:34.44,Default,,0000,0000,0000,,various areas of industry and human Dialogue: 0,0:32:34.44,0:32:37.04,Default,,0000,0000,0000,,endeavor, and they also have a large Dialogue: 0,0:32:37.04,0:32:38.84,Default,,0000,0000,0000,,collection of models that have been Dialogue: 0,0:32:38.84,0:32:42.88,Default,,0000,0000,0000,,developed using these data sets. So here Dialogue: 0,0:32:42.88,0:32:47.04,Default,,0000,0000,0000,,we have a data set for the particular Dialogue: 0,0:32:47.04,0:32:50.52,Default,,0000,0000,0000,,use case, predictive maintenance, okay? So Dialogue: 0,0:32:50.52,0:32:52.92,Default,,0000,0000,0000,,this is some information about the data Dialogue: 0,0:32:52.92,0:32:56.44,Default,,0000,0000,0000,,set, so in case you do not know how Dialogue: 0,0:32:56.44,0:32:59.20,Default,,0000,0000,0000,,to get to there, this is the URL to click Dialogue: 0,0:32:59.20,0:33:02.24,Default,,0000,0000,0000,,on, okay, to get to that dataset. So once Dialogue: 0,0:33:02.24,0:33:05.12,Default,,0000,0000,0000,,your at the data set here, you can- or the Dialogue: 0,0:33:05.12,0:33:07.40,Default,,0000,0000,0000,,page for about this dataset, you can see Dialogue: 0,0:33:07.40,0:33:09.96,Default,,0000,0000,0000,,all the information about this data set, Dialogue: 0,0:33:09.96,0:33:12.96,Default,,0000,0000,0000,,and you can download the data set in a Dialogue: 0,0:33:12.96,0:33:14.16,Default,,0000,0000,0000,,CSV format. Dialogue: 0,0:33:14.16,0:33:16.36,Default,,0000,0000,0000,,Okay, so let's take a look at the Dialogue: 0,0:33:16.36,0:33:19.56,Default,,0000,0000,0000,,dataset. So this dataset has a total of Dialogue: 0,0:33:19.56,0:33:23.44,Default,,0000,0000,0000,,10,000 samples, okay? And these are the Dialogue: 0,0:33:23.44,0:33:26.28,Default,,0000,0000,0000,,feature variables, the type, the product Dialogue: 0,0:33:26.28,0:33:28.44,Default,,0000,0000,0000,,ID, the air temperature, process Dialogue: 0,0:33:28.44,0:33:30.90,Default,,0000,0000,0000,,temperature, rotational speed, torque, tool Dialogue: 0,0:33:30.90,0:33:34.80,Default,,0000,0000,0000,,wear, and this is the target variable, Dialogue: 0,0:33:34.80,0:33:36.72,Default,,0000,0000,0000,,all right? So the target variable is what Dialogue: 0,0:33:36.72,0:33:38.16,Default,,0000,0000,0000,,we are interested in, what we are Dialogue: 0,0:33:38.16,0:33:40.96,Default,,0000,0000,0000,,interested in using to train the machine Dialogue: 0,0:33:40.96,0:33:42.60,Default,,0000,0000,0000,,learning model, and also what we are Dialogue: 0,0:33:42.60,0:33:45.28,Default,,0000,0000,0000,,interested to predict, okay? So these are Dialogue: 0,0:33:45.28,0:33:47.96,Default,,0000,0000,0000,,the feature variables, they describe or Dialogue: 0,0:33:47.96,0:33:49.96,Default,,0000,0000,0000,,they provide information about this Dialogue: 0,0:33:49.96,0:33:52.88,Default,,0000,0000,0000,,particular machine on the production Dialogue: 0,0:33:52.88,0:33:55.08,Default,,0000,0000,0000,,line, on the assembly line, so you might Dialogue: 0,0:33:55.08,0:33:56.80,Default,,0000,0000,0000,,know the product ID, the type, the air Dialogue: 0,0:33:56.80,0:33:58.12,Default,,0000,0000,0000,,temperature, process temperature, Dialogue: 0,0:33:58.12,0:34:00.48,Default,,0000,0000,0000,,rotational speed, torque, tool wear, right? So Dialogue: 0,0:34:00.48,0:34:03.16,Default,,0000,0000,0000,,let's say you've got a IoT sensor system Dialogue: 0,0:34:03.16,0:34:06.12,Default,,0000,0000,0000,,that's basically capturing all this data Dialogue: 0,0:34:06.12,0:34:08.36,Default,,0000,0000,0000,,about a product or a machine on your Dialogue: 0,0:34:08.36,0:34:10.68,Default,,0000,0000,0000,,production or assembly line, okay? And Dialogue: 0,0:34:10.68,0:34:13.92,Default,,0000,0000,0000,,you've also captured information about Dialogue: 0,0:34:13.92,0:34:17.20,Default,,0000,0000,0000,,whether is for a specific sample, Dialogue: 0,0:34:17.20,0:34:19.84,Default,,0000,0000,0000,,whether that sample experience a Dialogue: 0,0:34:19.84,0:34:23.04,Default,,0000,0000,0000,,failure or not, okay? So the target value Dialogue: 0,0:34:23.04,0:34:25.52,Default,,0000,0000,0000,,of zero, okay, indicates that there's no Dialogue: 0,0:34:25.52,0:34:28.00,Default,,0000,0000,0000,,failure. So zero means no failure, and we Dialogue: 0,0:34:28.00,0:34:30.20,Default,,0000,0000,0000,,can see that the vast majority of data Dialogue: 0,0:34:30.20,0:34:32.52,Default,,0000,0000,0000,,points in this data set are no failure. Dialogue: 0,0:34:32.52,0:34:34.00,Default,,0000,0000,0000,,And here we can see an example here Dialogue: 0,0:34:34.00,0:34:36.72,Default,,0000,0000,0000,,where you have a case of a failure, so a Dialogue: 0,0:34:36.72,0:34:40.16,Default,,0000,0000,0000,,failure is marked as a one, positive, and Dialogue: 0,0:34:40.16,0:34:42.64,Default,,0000,0000,0000,,no failure is marked as zero, negative, Dialogue: 0,0:34:42.64,0:34:44.88,Default,,0000,0000,0000,,all right? So here we have one type of a Dialogue: 0,0:34:44.88,0:34:47.04,Default,,0000,0000,0000,,failure, it's called a power failure. And Dialogue: 0,0:34:47.04,0:34:49.00,Default,,0000,0000,0000,,if you scroll down the data set, you see Dialogue: 0,0:34:49.00,0:34:50.40,Default,,0000,0000,0000,,there are also other kinds of failures Dialogue: 0,0:34:50.40,0:34:52.84,Default,,0000,0000,0000,,like a tool wear Dialogue: 0,0:34:52.84,0:34:56.96,Default,,0000,0000,0000,,failure, we have a overstrain failure Dialogue: 0,0:34:56.96,0:34:58.68,Default,,0000,0000,0000,,here, for example, Dialogue: 0,0:34:58.68,0:35:00.76,Default,,0000,0000,0000,,we also have a power failure again, Dialogue: 0,0:35:00.76,0:35:02.20,Default,,0000,0000,0000,,and so on. So if you scroll down through Dialogue: 0,0:35:02.20,0:35:04.16,Default,,0000,0000,0000,,these 10,000 data points, or if Dialogue: 0,0:35:04.16,0:35:06.04,Default,,0000,0000,0000,,you're familiar with using Excel to Dialogue: 0,0:35:06.04,0:35:08.84,Default,,0000,0000,0000,,filter out values in a column, you can Dialogue: 0,0:35:08.84,0:35:12.28,Default,,0000,0000,0000,,see that in this particular column here Dialogue: 0,0:35:12.28,0:35:14.48,Default,,0000,0000,0000,,which is the so-called target variable Dialogue: 0,0:35:14.48,0:35:16.96,Default,,0000,0000,0000,,column, you are going to have the vast Dialogue: 0,0:35:16.96,0:35:18.92,Default,,0000,0000,0000,,majority of values as zero which means Dialogue: 0,0:35:18.92,0:35:22.76,Default,,0000,0000,0000,,no failure, and some of the rows or the Dialogue: 0,0:35:22.76,0:35:24.04,Default,,0000,0000,0000,,data points you are going to have a Dialogue: 0,0:35:24.04,0:35:26.36,Default,,0000,0000,0000,,value of one, and for those rows that you Dialogue: 0,0:35:26.36,0:35:28.12,Default,,0000,0000,0000,,have a value of one, for example, Dialogue: 0,0:35:28.12,0:35:31.28,Default,,0000,0000,0000,,here you are- Sorry, for example, here you Dialogue: 0,0:35:31.28,0:35:32.84,Default,,0000,0000,0000,,are going to have different types of Dialogue: 0,0:35:32.84,0:35:34.64,Default,,0000,0000,0000,,failures, so like I said just now power Dialogue: 0,0:35:34.64,0:35:38.96,Default,,0000,0000,0000,,failure, tool set failure, etc, etc. So we are Dialogue: 0,0:35:38.96,0:35:40.64,Default,,0000,0000,0000,,going to go through the entire machine Dialogue: 0,0:35:40.64,0:35:43.76,Default,,0000,0000,0000,,learning workflow process with this dataset. Dialogue: 0,0:35:43.76,0:35:46.64,Default,,0000,0000,0000,,So to see an example of that, we are Dialogue: 0,0:35:46.64,0:35:50.40,Default,,0000,0000,0000,,going to use a- we're going to go to the Dialogue: 0,0:35:50.40,0:35:52.28,Default,,0000,0000,0000,,code section here, all right, so if I Dialogue: 0,0:35:52.28,0:35:54.28,Default,,0000,0000,0000,,click on the code section here. And right Dialogue: 0,0:35:54.28,0:35:56.40,Default,,0000,0000,0000,,down here we have see what is called a Dialogue: 0,0:35:56.40,0:35:59.36,Default,,0000,0000,0000,,dataset notebook. So this is basically a Dialogue: 0,0:35:59.36,0:36:02.32,Default,,0000,0000,0000,,Jupyter notebook. Jupyter is basically an Dialogue: 0,0:36:02.32,0:36:05.28,Default,,0000,0000,0000,,Python application which allows you to Dialogue: 0,0:36:05.28,0:36:09.24,Default,,0000,0000,0000,,create a Python machine learning Dialogue: 0,0:36:09.24,0:36:11.68,Default,,0000,0000,0000,,program that basically builds your Dialogue: 0,0:36:11.68,0:36:14.52,Default,,0000,0000,0000,,machine learning model, assesses or Dialogue: 0,0:36:14.52,0:36:16.48,Default,,0000,0000,0000,,evaluates its accuracy, and generates Dialogue: 0,0:36:16.48,0:36:19.04,Default,,0000,0000,0000,,predictions from it, okay? So here we have Dialogue: 0,0:36:19.04,0:36:21.68,Default,,0000,0000,0000,,a whole bunch of Jupyter notebooks that Dialogue: 0,0:36:21.68,0:36:24.56,Default,,0000,0000,0000,,are available, and you can select any one Dialogue: 0,0:36:24.56,0:36:26.00,Default,,0000,0000,0000,,of them. All these notebooks are Dialogue: 0,0:36:26.00,0:36:28.72,Default,,0000,0000,0000,,essentially going to process the data Dialogue: 0,0:36:28.72,0:36:31.72,Default,,0000,0000,0000,,from this particular dataset. So if I go Dialogue: 0,0:36:31.72,0:36:34.72,Default,,0000,0000,0000,,to this code page here, I've actually Dialogue: 0,0:36:34.72,0:36:37.32,Default,,0000,0000,0000,,selected a specific notebook that I'm Dialogue: 0,0:36:37.32,0:36:39.96,Default,,0000,0000,0000,,going to run through to demonstrate an Dialogue: 0,0:36:39.96,0:36:42.84,Default,,0000,0000,0000,,end-to-end machine learning workflow using Dialogue: 0,0:36:42.84,0:36:45.56,Default,,0000,0000,0000,,various machine learning libraries from Dialogue: 0,0:36:45.56,0:36:49.80,Default,,0000,0000,0000,,the Python programming language, okay? So Dialogue: 0,0:36:49.80,0:36:52.44,Default,,0000,0000,0000,,the particular notebook I'm going to Dialogue: 0,0:36:52.44,0:36:55.16,Default,,0000,0000,0000,,use is this particular notebook here, and Dialogue: 0,0:36:55.16,0:36:57.16,Default,,0000,0000,0000,,you can also get the URL for that Dialogue: 0,0:36:57.16,0:37:00.44,Default,,0000,0000,0000,,particular notebook from here. Dialogue: 0,0:37:00.44,0:37:03.76,Default,,0000,0000,0000,,Okay, so let's quickly do a quick Dialogue: 0,0:37:03.76,0:37:05.97,Default,,0000,0000,0000,,revision again. What are we trying to do Dialogue: 0,0:37:05.97,0:37:08.00,Default,,0000,0000,0000,,here? We're trying to build a machine Dialogue: 0,0:37:08.00,0:37:11.36,Default,,0000,0000,0000,,learning classification model, right? So Dialogue: 0,0:37:11.36,0:37:12.96,Default,,0000,0000,0000,,we said there are two primary areas of Dialogue: 0,0:37:12.96,0:37:14.56,Default,,0000,0000,0000,,supervised learning, one is regression Dialogue: 0,0:37:14.56,0:37:16.20,Default,,0000,0000,0000,,which is used to predict a numerical Dialogue: 0,0:37:16.20,0:37:18.64,Default,,0000,0000,0000,,target variable, and the second kind of Dialogue: 0,0:37:18.64,0:37:21.36,Default,,0000,0000,0000,,supervised learning is classification Dialogue: 0,0:37:21.36,0:37:23.08,Default,,0000,0000,0000,,which is what we're doing here. We're Dialogue: 0,0:37:23.08,0:37:25.84,Default,,0000,0000,0000,,trying to predict a categorical target Dialogue: 0,0:37:25.84,0:37:29.68,Default,,0000,0000,0000,,variable, okay? So in this particular Dialogue: 0,0:37:29.68,0:37:32.12,Default,,0000,0000,0000,,example, we actually have two kinds of Dialogue: 0,0:37:32.12,0:37:34.48,Default,,0000,0000,0000,,ways we can classify, either a binary Dialogue: 0,0:37:34.48,0:37:36.68,Default,,0000,0000,0000,,classification or a multiclass Dialogue: 0,0:37:36.68,0:37:39.52,Default,,0000,0000,0000,,classification. So for binary Dialogue: 0,0:37:39.52,0:37:41.44,Default,,0000,0000,0000,,classification, we are only going to Dialogue: 0,0:37:41.44,0:37:43.40,Default,,0000,0000,0000,,classify the product or machine as Dialogue: 0,0:37:43.40,0:37:47.16,Default,,0000,0000,0000,,either it failed or it did not fail, okay? Dialogue: 0,0:37:47.16,0:37:48.88,Default,,0000,0000,0000,,So if we go back to the dataset that I Dialogue: 0,0:37:48.88,0:37:50.84,Default,,0000,0000,0000,,showed you just now, if you look at this Dialogue: 0,0:37:50.84,0:37:52.68,Default,,0000,0000,0000,,target variable column, there are only Dialogue: 0,0:37:52.68,0:37:54.52,Default,,0000,0000,0000,,two possible values here. They are either Dialogue: 0,0:37:54.52,0:37:58.28,Default,,0000,0000,0000,,zero or one. Zero means there's no failure. Dialogue: 0,0:37:58.28,0:38:01.24,Default,,0000,0000,0000,,One means there's a failure, okay? So this Dialogue: 0,0:38:01.24,0:38:03.44,Default,,0000,0000,0000,,is an example of a binary classification. Dialogue: 0,0:38:03.44,0:38:07.24,Default,,0000,0000,0000,,Only two possible outcomes, zero or one, Dialogue: 0,0:38:07.24,0:38:10.12,Default,,0000,0000,0000,,didn't fail or fail, all right? Two Dialogue: 0,0:38:10.12,0:38:13.08,Default,,0000,0000,0000,,possible outcomes. And then we can also, Dialogue: 0,0:38:13.08,0:38:15.48,Default,,0000,0000,0000,,for the same dataset, we can extend it Dialogue: 0,0:38:15.48,0:38:18.08,Default,,0000,0000,0000,,and make it a multiclass classification Dialogue: 0,0:38:18.08,0:38:20.88,Default,,0000,0000,0000,,problem, all right? So if we kind of want Dialogue: 0,0:38:20.88,0:38:23.72,Default,,0000,0000,0000,,to drill down further, we can say that Dialogue: 0,0:38:23.72,0:38:26.80,Default,,0000,0000,0000,,not only is there a failure, we can Dialogue: 0,0:38:26.80,0:38:29.20,Default,,0000,0000,0000,,actually say there are different types of Dialogue: 0,0:38:29.20,0:38:32.44,Default,,0000,0000,0000,,failures, okay? So we have one category of Dialogue: 0,0:38:32.44,0:38:35.60,Default,,0000,0000,0000,,class that is basically no failure, okay? Dialogue: 0,0:38:35.60,0:38:37.40,Default,,0000,0000,0000,,Then we have a category for the Dialogue: 0,0:38:37.40,0:38:40.40,Default,,0000,0000,0000,,different types of failures, right? So you Dialogue: 0,0:38:40.40,0:38:43.92,Default,,0000,0000,0000,,can have a power failure, you could have Dialogue: 0,0:38:43.92,0:38:46.40,Default,,0000,0000,0000,,a tool wear failure, Dialogue: 0,0:38:46.40,0:38:48.92,Default,,0000,0000,0000,,you could have- let's go down Dialogue: 0,0:38:48.92,0:38:50.88,Default,,0000,0000,0000,,here, you could have a overstrain Dialogue: 0,0:38:50.88,0:38:53.76,Default,,0000,0000,0000,,failure, and etc, etc. So you can have Dialogue: 0,0:38:53.76,0:38:57.16,Default,,0000,0000,0000,,multiple classes of failure in addition Dialogue: 0,0:38:57.16,0:39:00.52,Default,,0000,0000,0000,,to the general overall or the majority Dialogue: 0,0:39:00.52,0:39:04.32,Default,,0000,0000,0000,,class of no failure, and that would be a Dialogue: 0,0:39:04.32,0:39:06.68,Default,,0000,0000,0000,,multiclass classification problem. So Dialogue: 0,0:39:06.68,0:39:08.40,Default,,0000,0000,0000,,with this data set, we are going to see Dialogue: 0,0:39:08.40,0:39:11.04,Default,,0000,0000,0000,,how to make it a binary classification Dialogue: 0,0:39:11.04,0:39:12.80,Default,,0000,0000,0000,,problem and also a multiclass Dialogue: 0,0:39:12.80,0:39:15.08,Default,,0000,0000,0000,,classification problem. Okay, so let's Dialogue: 0,0:39:15.08,0:39:16.88,Default,,0000,0000,0000,,look at the workflow. So let's say we've Dialogue: 0,0:39:16.88,0:39:18.88,Default,,0000,0000,0000,,already got the data, so right now we do Dialogue: 0,0:39:18.88,0:39:20.84,Default,,0000,0000,0000,,have the dataset. This is the dataset Dialogue: 0,0:39:20.84,0:39:22.72,Default,,0000,0000,0000,,that we have, so let's assume we've Dialogue: 0,0:39:22.72,0:39:24.56,Default,,0000,0000,0000,,somehow managed to get this dataset Dialogue: 0,0:39:24.56,0:39:26.88,Default,,0000,0000,0000,,from some IoT sensors that are Dialogue: 0,0:39:26.88,0:39:29.12,Default,,0000,0000,0000,,monitoring real-time data in our Dialogue: 0,0:39:29.12,0:39:31.08,Default,,0000,0000,0000,,production environment. On the assembly Dialogue: 0,0:39:31.08,0:39:32.80,Default,,0000,0000,0000,,line, on the production line we've got Dialogue: 0,0:39:32.80,0:39:34.68,Default,,0000,0000,0000,,sensors reading data that gives us all Dialogue: 0,0:39:34.68,0:39:37.96,Default,,0000,0000,0000,,these data that we have in this CSV file. Dialogue: 0,0:39:37.96,0:39:40.08,Default,,0000,0000,0000,,Okay, so we've already got the data, we've Dialogue: 0,0:39:40.08,0:39:41.60,Default,,0000,0000,0000,,retrieved the data, now we're going to go Dialogue: 0,0:39:41.60,0:39:45.00,Default,,0000,0000,0000,,on to the cleaning and exploration part Dialogue: 0,0:39:45.00,0:39:47.52,Default,,0000,0000,0000,,of your machine learning life cycle. All Dialogue: 0,0:39:47.52,0:39:49.80,Default,,0000,0000,0000,,right, so let's look at the data cleaning Dialogue: 0,0:39:49.80,0:39:51.40,Default,,0000,0000,0000,,part. So the data cleaning part, we're Dialogue: 0,0:39:51.40,0:39:53.72,Default,,0000,0000,0000,,interested in checking for missing Dialogue: 0,0:39:53.72,0:39:56.20,Default,,0000,0000,0000,,values and maybe removing the rows you Dialogue: 0,0:39:56.20,0:39:58.08,Default,,0000,0000,0000,,missing values, okay? Dialogue: 0,0:39:58.08,0:39:59.76,Default,,0000,0000,0000,,So the kind of things we can- sorry, Dialogue: 0,0:39:59.76,0:40:01.00,Default,,0000,0000,0000,,the kind of things we can do in missing Dialogue: 0,0:40:01.00,0:40:02.88,Default,,0000,0000,0000,,values, we can remove the rows missing Dialogue: 0,0:40:02.88,0:40:05.84,Default,,0000,0000,0000,,values, we can put in some new values, Dialogue: 0,0:40:05.84,0:40:08.00,Default,,0000,0000,0000,,some replacement values which could be a Dialogue: 0,0:40:08.00,0:40:09.88,Default,,0000,0000,0000,,average of all the values in that that Dialogue: 0,0:40:09.88,0:40:12.88,Default,,0000,0000,0000,,particular column, etc, etc, we could also try to Dialogue: 0,0:40:12.88,0:40:15.48,Default,,0000,0000,0000,,identify outliers in our data set and Dialogue: 0,0:40:15.48,0:40:17.48,Default,,0000,0000,0000,,also there are a variety of ways to deal Dialogue: 0,0:40:17.48,0:40:19.48,Default,,0000,0000,0000,,with that. So this is called data Dialogue: 0,0:40:19.48,0:40:21.36,Default,,0000,0000,0000,,cleansing which is a really important Dialogue: 0,0:40:21.36,0:40:23.32,Default,,0000,0000,0000,,part of your machine learning workflow, Dialogue: 0,0:40:23.32,0:40:25.52,Default,,0000,0000,0000,,right? So that's where we are now at, Dialogue: 0,0:40:25.52,0:40:26.84,Default,,0000,0000,0000,,we're doing cleansing, and then we're Dialogue: 0,0:40:26.84,0:40:27.94,Default,,0000,0000,0000,,going to follow up with Dialogue: 0,0:40:27.94,0:40:31.16,Default,,0000,0000,0000,,exploration. So let's look at the actual Dialogue: 0,0:40:31.16,0:40:33.16,Default,,0000,0000,0000,,code that does the cleansing here. So Dialogue: 0,0:40:33.16,0:40:35.80,Default,,0000,0000,0000,,here we are right at the start of the Dialogue: 0,0:40:35.80,0:40:38.25,Default,,0000,0000,0000,,machine learning life cycle here, so Dialogue: 0,0:40:38.25,0:40:40.84,Default,,0000,0000,0000,,this is a Jupyter notebook. So here we Dialogue: 0,0:40:40.84,0:40:43.36,Default,,0000,0000,0000,,have a brief description of the problem Dialogue: 0,0:40:43.36,0:40:45.92,Default,,0000,0000,0000,,statement, all right? So this dataset Dialogue: 0,0:40:45.92,0:40:47.64,Default,,0000,0000,0000,,reflects real life predictive Dialogue: 0,0:40:47.64,0:40:49.24,Default,,0000,0000,0000,,maintenance encountered industry with Dialogue: 0,0:40:49.24,0:40:50.48,Default,,0000,0000,0000,,measurements from real equipment. The Dialogue: 0,0:40:50.48,0:40:52.40,Default,,0000,0000,0000,,features description is taken directly Dialogue: 0,0:40:52.40,0:40:54.52,Default,,0000,0000,0000,,from the data source set. So here we have Dialogue: 0,0:40:54.52,0:40:57.40,Default,,0000,0000,0000,,a description of the six key features in Dialogue: 0,0:40:57.40,0:40:59.60,Default,,0000,0000,0000,,our dataset type which is the quality Dialogue: 0,0:40:59.60,0:41:02.52,Default,,0000,0000,0000,,of the product, the air temperature, the Dialogue: 0,0:41:02.52,0:41:04.68,Default,,0000,0000,0000,,process temperature, the rotational speed, Dialogue: 0,0:41:04.68,0:41:06.60,Default,,0000,0000,0000,,the torque, and the tool wear, all right? So Dialogue: 0,0:41:06.60,0:41:08.88,Default,,0000,0000,0000,,these are the six feature variables, and Dialogue: 0,0:41:08.88,0:41:11.32,Default,,0000,0000,0000,,there are the two target variables, so Dialogue: 0,0:41:11.32,0:41:13.12,Default,,0000,0000,0000,,just now- I showed you just now there's Dialogue: 0,0:41:13.12,0:41:15.12,Default,,0000,0000,0000,,one target variable which only has two Dialogue: 0,0:41:15.12,0:41:17.44,Default,,0000,0000,0000,,possible values, either zero or one, okay? Dialogue: 0,0:41:17.44,0:41:20.08,Default,,0000,0000,0000,,Zero or one means failure or no failure, Dialogue: 0,0:41:20.08,0:41:23.08,Default,,0000,0000,0000,,so that will be this column here, right? Dialogue: 0,0:41:23.08,0:41:24.88,Default,,0000,0000,0000,,So let me go all the way back up to here. Dialogue: 0,0:41:24.88,0:41:26.64,Default,,0000,0000,0000,,So this column here, we already saw it Dialogue: 0,0:41:26.64,0:41:29.44,Default,,0000,0000,0000,,only has two possible values, it's either zero or Dialogue: 0,0:41:29.44,0:41:32.68,Default,,0000,0000,0000,,one. And then we also have this column Dialogue: 0,0:41:32.68,0:41:35.04,Default,,0000,0000,0000,,here, and this column here is basically Dialogue: 0,0:41:35.04,0:41:38.08,Default,,0000,0000,0000,,the failure type. And so the- we have- as I Dialogue: 0,0:41:38.08,0:41:40.80,Default,,0000,0000,0000,,already demonstrated just now, we do have Dialogue: 0,0:41:40.80,0:41:43.44,Default,,0000,0000,0000,,several categories of types of Dialogue: 0,0:41:43.44,0:41:45.56,Default,,0000,0000,0000,,failure, and so here we call this Dialogue: 0,0:41:45.56,0:41:46.24,Default,,0000,0000,0000,,multiclass Dialogue: 0,0:41:46.24,0:41:50.00,Default,,0000,0000,0000,,classification. So we can either build a Dialogue: 0,0:41:50.00,0:41:51.84,Default,,0000,0000,0000,,binary classification model for this Dialogue: 0,0:41:51.84,0:41:53.52,Default,,0000,0000,0000,,problem domain, or we can build a Dialogue: 0,0:41:53.52,0:41:54.49,Default,,0000,0000,0000,,multiclass Dialogue: 0,0:41:54.49,0:41:58.12,Default,,0000,0000,0000,,classification problem, all right. So this Dialogue: 0,0:41:58.12,0:41:59.84,Default,,0000,0000,0000,,Jupyter notebook is going to demonstrate Dialogue: 0,0:41:59.84,0:42:02.32,Default,,0000,0000,0000,,both approaches to us. So first step, we Dialogue: 0,0:42:02.32,0:42:04.80,Default,,0000,0000,0000,,are going to write all this Python code Dialogue: 0,0:42:04.80,0:42:06.88,Default,,0000,0000,0000,,that's going to import all the libraries Dialogue: 0,0:42:06.88,0:42:09.08,Default,,0000,0000,0000,,that we need to use, okay? So this is Dialogue: 0,0:42:09.08,0:42:12.32,Default,,0000,0000,0000,,basically Python code, okay, and it's Dialogue: 0,0:42:12.32,0:42:15.12,Default,,0000,0000,0000,,importing the relevant machine learn- Dialogue: 0,0:42:15.12,0:42:17.96,Default,,0000,0000,0000,,oops. We are importing the relevant Dialogue: 0,0:42:17.96,0:42:20.60,Default,,0000,0000,0000,,machine learning libraries related to Dialogue: 0,0:42:20.60,0:42:23.52,Default,,0000,0000,0000,,our domain use case, okay? Then we load in Dialogue: 0,0:42:23.52,0:42:26.44,Default,,0000,0000,0000,,our dataset, okay, so this our dataset. Dialogue: 0,0:42:26.44,0:42:28.32,Default,,0000,0000,0000,,We describe it, we have some quick Dialogue: 0,0:42:28.32,0:42:30.92,Default,,0000,0000,0000,,insights into the dataset. And then Dialogue: 0,0:42:30.92,0:42:32.84,Default,,0000,0000,0000,,we just take a look at all the variables Dialogue: 0,0:42:32.84,0:42:36.00,Default,,0000,0000,0000,,of the feature variables, etc, and so on. Dialogue: 0,0:42:36.00,0:42:38.00,Default,,0000,0000,0000,,What we're doing now is just Dialogue: 0,0:42:38.00,0:42:39.80,Default,,0000,0000,0000,,doing a quick overview of the dataset, Dialogue: 0,0:42:39.80,0:42:41.56,Default,,0000,0000,0000,,so this all this Python code here that Dialogue: 0,0:42:41.56,0:42:43.76,Default,,0000,0000,0000,,we're writing is allowing us, the data Dialogue: 0,0:42:43.76,0:42:45.36,Default,,0000,0000,0000,,scientist, to get a quick overview of our Dialogue: 0,0:42:45.36,0:42:48.21,Default,,0000,0000,0000,,dataset, right, okay, like how many varia- Dialogue: 0,0:42:48.21,0:42:50.24,Default,,0000,0000,0000,,how many rows are there, how many columns Dialogue: 0,0:42:50.24,0:42:51.76,Default,,0000,0000,0000,,are there, what are the data types of the Dialogue: 0,0:42:51.76,0:42:53.44,Default,,0000,0000,0000,,columns, what are the name of the columns, Dialogue: 0,0:42:53.44,0:42:57.36,Default,,0000,0000,0000,,etc, etc. Okay, then we zoom in on to the Dialogue: 0,0:42:57.36,0:42:58.84,Default,,0000,0000,0000,,target variables. So we look at the Dialogue: 0,0:42:58.84,0:43:02.00,Default,,0000,0000,0000,,target variables, how many counts Dialogue: 0,0:43:02.00,0:43:04.52,Default,,0000,0000,0000,,there are of this target variable, and Dialogue: 0,0:43:04.52,0:43:06.44,Default,,0000,0000,0000,,so on. How many different types of Dialogue: 0,0:43:06.44,0:43:08.24,Default,,0000,0000,0000,,failures there are. Then you want to Dialogue: 0,0:43:08.24,0:43:09.00,Default,,0000,0000,0000,,check whether there are any Dialogue: 0,0:43:09.00,0:43:10.76,Default,,0000,0000,0000,,inconsistencies between the target and Dialogue: 0,0:43:10.76,0:43:13.56,Default,,0000,0000,0000,,the failure type, etc. Okay, so when you do Dialogue: 0,0:43:13.56,0:43:15.12,Default,,0000,0000,0000,,all this checking, you're going to Dialogue: 0,0:43:15.12,0:43:16.96,Default,,0000,0000,0000,,discover there are some discrepancies in Dialogue: 0,0:43:16.96,0:43:20.28,Default,,0000,0000,0000,,your dataset, so using a specific Python Dialogue: 0,0:43:20.28,0:43:21.84,Default,,0000,0000,0000,,code to do checking, you're going to say Dialogue: 0,0:43:21.84,0:43:23.48,Default,,0000,0000,0000,,hey, you know what? There's some errors Dialogue: 0,0:43:23.48,0:43:25.00,Default,,0000,0000,0000,,here, right? There are nine values that Dialogue: 0,0:43:25.00,0:43:26.60,Default,,0000,0000,0000,,classify as failure in target variable, Dialogue: 0,0:43:26.60,0:43:28.20,Default,,0000,0000,0000,,but as no failure in the failure type Dialogue: 0,0:43:28.20,0:43:29.72,Default,,0000,0000,0000,,variable, so that means there's a Dialogue: 0,0:43:29.72,0:43:33.20,Default,,0000,0000,0000,,discrepancy in your data point, right? Dialogue: 0,0:43:33.20,0:43:34.76,Default,,0000,0000,0000,,So these are all the ones that Dialogue: 0,0:43:34.76,0:43:36.36,Default,,0000,0000,0000,,are discrepancies because the target Dialogue: 0,0:43:36.36,0:43:39.00,Default,,0000,0000,0000,,variable says one, and we already know Dialogue: 0,0:43:39.00,0:43:41.24,Default,,0000,0000,0000,,that target variable one is supposed to Dialogue: 0,0:43:41.24,0:43:43.10,Default,,0000,0000,0000,,mean there is a failure, right? Target Dialogue: 0,0:43:43.10,0:43:44.88,Default,,0000,0000,0000,,variable one is supposed to mean there is Dialogue: 0,0:43:44.88,0:43:47.12,Default,,0000,0000,0000,,a failure, so we are kind of expecting to Dialogue: 0,0:43:47.12,0:43:49.68,Default,,0000,0000,0000,,see the failure classification, but some Dialogue: 0,0:43:49.68,0:43:51.40,Default,,0000,0000,0000,,rows actually say there's no failure Dialogue: 0,0:43:51.40,0:43:53.80,Default,,0000,0000,0000,,although the target type is one. Well here Dialogue: 0,0:43:53.80,0:43:55.92,Default,,0000,0000,0000,,is a classic example of an error that Dialogue: 0,0:43:55.92,0:43:58.64,Default,,0000,0000,0000,,can very well occur in a dataset, so now Dialogue: 0,0:43:58.64,0:44:00.56,Default,,0000,0000,0000,,the question is what do you do with Dialogue: 0,0:44:00.56,0:44:04.72,Default,,0000,0000,0000,,these errors in your dataset, right? So Dialogue: 0,0:44:04.72,0:44:06.24,Default,,0000,0000,0000,,here the data scientist says, I think it Dialogue: 0,0:44:06.24,0:44:07.52,Default,,0000,0000,0000,,would make sense to remove those Dialogue: 0,0:44:07.52,0:44:09.92,Default,,0000,0000,0000,,instances, and so they write some code Dialogue: 0,0:44:09.92,0:44:12.68,Default,,0000,0000,0000,,then to remove those instances or those Dialogue: 0,0:44:12.68,0:44:14.92,Default,,0000,0000,0000,,rows or data points from the overall Dialogue: 0,0:44:14.92,0:44:17.28,Default,,0000,0000,0000,,data set, and same thing we can, again, Dialogue: 0,0:44:17.28,0:44:19.24,Default,,0000,0000,0000,,check for other issues. So we find there's Dialogue: 0,0:44:19.24,0:44:21.16,Default,,0000,0000,0000,,another issue here with our data set which Dialogue: 0,0:44:21.16,0:44:24.08,Default,,0000,0000,0000,,is another warning, so, again, we can Dialogue: 0,0:44:24.08,0:44:26.24,Default,,0000,0000,0000,,possibly remove them. So you're going to Dialogue: 0,0:44:26.24,0:44:31.28,Default,,0000,0000,0000,,remove 27 instances or rows from your Dialogue: 0,0:44:31.28,0:44:34.44,Default,,0000,0000,0000,,overall data set. So your data set has Dialogue: 0,0:44:34.44,0:44:37.08,Default,,0000,0000,0000,,10,000 rows or data points. You're Dialogue: 0,0:44:37.08,0:44:40.16,Default,,0000,0000,0000,,removing 27 which is only 0.27 of the Dialogue: 0,0:44:40.16,0:44:42.24,Default,,0000,0000,0000,,entire dataset. And these were the Dialogue: 0,0:44:42.24,0:44:45.72,Default,,0000,0000,0000,,reasons why you removed them, okay? So if Dialogue: 0,0:44:45.72,0:44:48.16,Default,,0000,0000,0000,,you're just removing 0.27% of the Dialogue: 0,0:44:48.16,0:44:50.80,Default,,0000,0000,0000,,entire dataset, no big deal, right? Still Dialogue: 0,0:44:50.80,0:44:53.08,Default,,0000,0000,0000,,okay, but you needed to remove them Dialogue: 0,0:44:53.08,0:44:55.00,Default,,0000,0000,0000,,because these errors right, these Dialogue: 0,0:44:55.00,0:44:58.04,Default,,0000,0000,0000,,27 Dialogue: 0,0:44:58.04,0:45:00.56,Default,,0000,0000,0000,,errors, okay, data points with errors in Dialogue: 0,0:45:00.56,0:45:02.96,Default,,0000,0000,0000,,your dataset could really affect the Dialogue: 0,0:45:02.96,0:45:05.00,Default,,0000,0000,0000,,training of your machine learning model. Dialogue: 0,0:45:05.00,0:45:08.64,Default,,0000,0000,0000,,So we need to do your data cleansing, Dialogue: 0,0:45:08.64,0:45:11.72,Default,,0000,0000,0000,,right? So we are actually cleansing now Dialogue: 0,0:45:11.72,0:45:15.20,Default,,0000,0000,0000,,some kind of data that is Dialogue: 0,0:45:15.20,0:45:17.52,Default,,0000,0000,0000,,incorrect or erroneous in your original Dialogue: 0,0:45:17.52,0:45:21.44,Default,,0000,0000,0000,,dataset. Okay, so then we go on to the Dialogue: 0,0:45:21.44,0:45:23.84,Default,,0000,0000,0000,,next part which is called EDA, right? So Dialogue: 0,0:45:23.84,0:45:28.88,Default,,0000,0000,0000,,EDA is where we kind of explore our data, Dialogue: 0,0:45:28.88,0:45:31.72,Default,,0000,0000,0000,,and we want to, kind of, get a visual Dialogue: 0,0:45:31.72,0:45:34.24,Default,,0000,0000,0000,,overview of our data as a whole, and also Dialogue: 0,0:45:34.24,0:45:35.88,Default,,0000,0000,0000,,take a look at the statistical Dialogue: 0,0:45:35.88,0:45:38.16,Default,,0000,0000,0000,,properties of our data. The statistical Dialogue: 0,0:45:38.16,0:45:40.48,Default,,0000,0000,0000,,distribution of the data in all the Dialogue: 0,0:45:40.48,0:45:43.08,Default,,0000,0000,0000,,various columns, the correlation between Dialogue: 0,0:45:43.08,0:45:44.64,Default,,0000,0000,0000,,the variables, between the feature Dialogue: 0,0:45:44.64,0:45:46.68,Default,,0000,0000,0000,,variables different columns, and also the Dialogue: 0,0:45:46.68,0:45:48.60,Default,,0000,0000,0000,,feature variable and the target variable. Dialogue: 0,0:45:48.60,0:45:52.04,Default,,0000,0000,0000,,So all of this is called EDA, and EDA in Dialogue: 0,0:45:52.04,0:45:54.08,Default,,0000,0000,0000,,a machine learning workflow is typically Dialogue: 0,0:45:54.08,0:45:57.16,Default,,0000,0000,0000,,done through visualization, Dialogue: 0,0:45:57.16,0:45:58.84,Default,,0000,0000,0000,,all right? So let's go back here and take Dialogue: 0,0:45:58.84,0:46:00.60,Default,,0000,0000,0000,,a look, right? So, for example, here we are Dialogue: 0,0:46:00.60,0:46:03.40,Default,,0000,0000,0000,,looking at correlation, so we plot the Dialogue: 0,0:46:03.40,0:46:05.68,Default,,0000,0000,0000,,values of all the various feature Dialogue: 0,0:46:05.68,0:46:07.60,Default,,0000,0000,0000,,variables against each other and look Dialogue: 0,0:46:07.60,0:46:10.80,Default,,0000,0000,0000,,for potential correlations and patterns Dialogue: 0,0:46:10.80,0:46:13.36,Default,,0000,0000,0000,,and so on. And all the different shapes Dialogue: 0,0:46:13.36,0:46:17.28,Default,,0000,0000,0000,,that you see here in this pair plot, okay, Dialogue: 0,0:46:17.28,0:46:18.40,Default,,0000,0000,0000,,will have different meaning, Dialogue: 0,0:46:18.40,0:46:20.00,Default,,0000,0000,0000,,statistical meaning, and so the data Dialogue: 0,0:46:20.00,0:46:21.80,Default,,0000,0000,0000,,scientist has to, kind of, visually Dialogue: 0,0:46:21.80,0:46:23.76,Default,,0000,0000,0000,,inspect this pair plot, make some Dialogue: 0,0:46:23.76,0:46:25.56,Default,,0000,0000,0000,,interpretations of these different Dialogue: 0,0:46:25.56,0:46:27.68,Default,,0000,0000,0000,,patterns that he sees here, all right. So Dialogue: 0,0:46:27.68,0:46:30.48,Default,,0000,0000,0000,,these are some of the insights that Dialogue: 0,0:46:30.48,0:46:32.84,Default,,0000,0000,0000,,can be deduced from looking at these Dialogue: 0,0:46:32.84,0:46:34.32,Default,,0000,0000,0000,,patterns, so, for example, the torque and Dialogue: 0,0:46:34.32,0:46:36.28,Default,,0000,0000,0000,,rotational speed are highly correlated, Dialogue: 0,0:46:36.28,0:46:38.04,Default,,0000,0000,0000,,the process temperature and air Dialogue: 0,0:46:38.04,0:46:39.92,Default,,0000,0000,0000,,temperature also highly correlated, that Dialogue: 0,0:46:39.92,0:46:41.56,Default,,0000,0000,0000,,failures occur for extreme values of Dialogue: 0,0:46:41.56,0:46:44.52,Default,,0000,0000,0000,,some features, etc, etc. Then you can plot Dialogue: 0,0:46:44.52,0:46:45.96,Default,,0000,0000,0000,,certain kinds of charts. This called a Dialogue: 0,0:46:45.96,0:46:48.48,Default,,0000,0000,0000,,violin chart to, again, get new insights. Dialogue: 0,0:46:48.48,0:46:49.84,Default,,0000,0000,0000,,For example, regarding the torque and Dialogue: 0,0:46:49.84,0:46:51.48,Default,,0000,0000,0000,,rotational speed, it can see, again, that Dialogue: 0,0:46:51.48,0:46:53.12,Default,,0000,0000,0000,,most failures are triggered for much Dialogue: 0,0:46:53.12,0:46:55.12,Default,,0000,0000,0000,,lower or much higher values than the Dialogue: 0,0:46:55.12,0:46:57.40,Default,,0000,0000,0000,,mean when they're not failing. So all Dialogue: 0,0:46:57.40,0:47:00.72,Default,,0000,0000,0000,,these visualizations, they are there, and Dialogue: 0,0:47:00.72,0:47:02.48,Default,,0000,0000,0000,,a trained data scientist can look at Dialogue: 0,0:47:02.48,0:47:05.08,Default,,0000,0000,0000,,them, inspect them, and make some kind of Dialogue: 0,0:47:05.08,0:47:08.40,Default,,0000,0000,0000,,insightful deductions from them, okay? Dialogue: 0,0:47:08.40,0:47:11.08,Default,,0000,0000,0000,,Percentage of failure, right? The Dialogue: 0,0:47:11.08,0:47:13.64,Default,,0000,0000,0000,,correlation heat map, okay, between all Dialogue: 0,0:47:13.64,0:47:15.56,Default,,0000,0000,0000,,these different feature variables, and Dialogue: 0,0:47:15.56,0:47:16.43,Default,,0000,0000,0000,,also the target Dialogue: 0,0:47:16.43,0:47:19.60,Default,,0000,0000,0000,,variable, okay? The product types, Dialogue: 0,0:47:19.60,0:47:21.08,Default,,0000,0000,0000,,percentage of product types, percentage Dialogue: 0,0:47:21.08,0:47:23.16,Default,,0000,0000,0000,,of failure with respect to the product Dialogue: 0,0:47:23.16,0:47:25.72,Default,,0000,0000,0000,,type, so we can also kind of visualize Dialogue: 0,0:47:25.72,0:47:27.80,Default,,0000,0000,0000,,that as well. So certain products have a Dialogue: 0,0:47:27.80,0:47:29.84,Default,,0000,0000,0000,,higher ratio of failure compared to other Dialogue: 0,0:47:29.84,0:47:33.24,Default,,0000,0000,0000,,product types, etc. Or, for example, M Dialogue: 0,0:47:33.24,0:47:35.80,Default,,0000,0000,0000,,tends to fail more than H products, etc, Dialogue: 0,0:47:35.80,0:47:38.88,Default,,0000,0000,0000,,etc. So we can create a vast variety of Dialogue: 0,0:47:38.88,0:47:41.32,Default,,0000,0000,0000,,visualizations in the EDA stage, so you Dialogue: 0,0:47:41.32,0:47:43.96,Default,,0000,0000,0000,,can see here. And, again, the idea of this Dialogue: 0,0:47:43.96,0:47:46.36,Default,,0000,0000,0000,,visualization is just to give us some Dialogue: 0,0:47:46.36,0:47:49.68,Default,,0000,0000,0000,,insight, some preliminary insight into Dialogue: 0,0:47:49.68,0:47:52.52,Default,,0000,0000,0000,,our dataset that helps us to model it Dialogue: 0,0:47:52.52,0:47:54.12,Default,,0000,0000,0000,,more correctly. So some more insights Dialogue: 0,0:47:54.12,0:47:56.20,Default,,0000,0000,0000,,that we get into our data set from all Dialogue: 0,0:47:56.20,0:47:57.60,Default,,0000,0000,0000,,this visualization. Dialogue: 0,0:47:57.60,0:47:59.56,Default,,0000,0000,0000,,Then we can plot the distribution so we Dialogue: 0,0:47:59.56,0:48:00.72,Default,,0000,0000,0000,,can see whether it's a normal Dialogue: 0,0:48:00.72,0:48:02.79,Default,,0000,0000,0000,,distribution or some other kind of Dialogue: 0,0:48:02.79,0:48:05.64,Default,,0000,0000,0000,,distribution. We can have a box plot Dialogue: 0,0:48:05.64,0:48:07.76,Default,,0000,0000,0000,,to see whether there are any outliers in Dialogue: 0,0:48:07.76,0:48:10.40,Default,,0000,0000,0000,,your data set and so on, right? So we can Dialogue: 0,0:48:10.40,0:48:11.64,Default,,0000,0000,0000,,see from the box plots, we can see Dialogue: 0,0:48:11.64,0:48:14.60,Default,,0000,0000,0000,,rotational speed and have outliers. So we Dialogue: 0,0:48:14.60,0:48:16.88,Default,,0000,0000,0000,,already saw outliers are basically a Dialogue: 0,0:48:16.88,0:48:18.80,Default,,0000,0000,0000,,problem that you may need to kind of Dialogue: 0,0:48:18.80,0:48:22.52,Default,,0000,0000,0000,,tackle, right? So outliers are an issue, Dialogue: 0,0:48:22.52,0:48:24.80,Default,,0000,0000,0000,,it's a part of data cleansing. And Dialogue: 0,0:48:24.80,0:48:26.96,Default,,0000,0000,0000,,so you may need to tackle this, so we may Dialogue: 0,0:48:26.96,0:48:28.88,Default,,0000,0000,0000,,have to check okay, well where are the Dialogue: 0,0:48:28.88,0:48:31.32,Default,,0000,0000,0000,,potential outliers so we can analyze Dialogue: 0,0:48:31.32,0:48:35.32,Default,,0000,0000,0000,,them from the box plot, okay? But then Dialogue: 0,0:48:35.32,0:48:37.08,Default,,0000,0000,0000,,we can say well they are outliers, but Dialogue: 0,0:48:37.08,0:48:38.80,Default,,0000,0000,0000,,maybe they're not really horrible Dialogue: 0,0:48:38.80,0:48:40.76,Default,,0000,0000,0000,,outliers so we can tolerate them or Dialogue: 0,0:48:40.76,0:48:42.88,Default,,0000,0000,0000,,maybe we want to remove them. So we can Dialogue: 0,0:48:42.88,0:48:44.92,Default,,0000,0000,0000,,see what our mean and maximum values for Dialogue: 0,0:48:44.92,0:48:46.72,Default,,0000,0000,0000,,all these with respect to product type, Dialogue: 0,0:48:46.72,0:48:49.68,Default,,0000,0000,0000,,how many of them are above or highly Dialogue: 0,0:48:49.68,0:48:51.44,Default,,0000,0000,0000,,correlated with the product type in Dialogue: 0,0:48:51.44,0:48:54.24,Default,,0000,0000,0000,,terms of the maximum and minimum, okay, Dialogue: 0,0:48:54.24,0:48:56.96,Default,,0000,0000,0000,,and then so on. So the insight is well we Dialogue: 0,0:48:56.96,0:48:59.60,Default,,0000,0000,0000,,got 4.8% of the instances are outliers, Dialogue: 0,0:48:59.60,0:49:02.56,Default,,0000,0000,0000,,so maybe 4.87% is not really that much, Dialogue: 0,0:49:02.56,0:49:04.92,Default,,0000,0000,0000,,the outliers are not horrible, so we just Dialogue: 0,0:49:04.92,0:49:06.96,Default,,0000,0000,0000,,leave them in the dataset. Now for a Dialogue: 0,0:49:06.96,0:49:08.52,Default,,0000,0000,0000,,different dataset, the data scientist Dialogue: 0,0:49:08.52,0:49:10.28,Default,,0000,0000,0000,,could come to a different conclusion, so Dialogue: 0,0:49:10.28,0:49:12.28,Default,,0000,0000,0000,,then they would do whatever they've Dialogue: 0,0:49:12.28,0:49:15.40,Default,,0000,0000,0000,,deemed is appropriate to, kind of, cleanse Dialogue: 0,0:49:15.40,0:49:18.08,Default,,0000,0000,0000,,the dataset. Okay, so now that we have Dialogue: 0,0:49:18.08,0:49:20.00,Default,,0000,0000,0000,,done all the EDA, the next thing we're Dialogue: 0,0:49:20.00,0:49:23.16,Default,,0000,0000,0000,,going to do is we are going to do what Dialogue: 0,0:49:23.16,0:49:26.20,Default,,0000,0000,0000,,is called feature engineering. So we are Dialogue: 0,0:49:26.20,0:49:28.76,Default,,0000,0000,0000,,going to transform our original feature Dialogue: 0,0:49:28.76,0:49:31.28,Default,,0000,0000,0000,,variables and these are our original Dialogue: 0,0:49:31.28,0:49:32.96,Default,,0000,0000,0000,,feature variables, right? These are our Dialogue: 0,0:49:32.96,0:49:35.04,Default,,0000,0000,0000,,original feature variables, and we are Dialogue: 0,0:49:35.04,0:49:37.76,Default,,0000,0000,0000,,going to transform them, all right? We're Dialogue: 0,0:49:37.76,0:49:40.32,Default,,0000,0000,0000,,going to transform them in some sense Dialogue: 0,0:49:40.32,0:49:43.76,Default,,0000,0000,0000,,into some other form before we fit this Dialogue: 0,0:49:43.76,0:49:45.64,Default,,0000,0000,0000,,for training into our machine learning Dialogue: 0,0:49:45.64,0:49:48.60,Default,,0000,0000,0000,,algorithm, all right? So these are Dialogue: 0,0:49:48.60,0:49:51.60,Default,,0000,0000,0000,,examples of- let's say these are examples of a Dialogue: 0,0:49:51.60,0:49:55.20,Default,,0000,0000,0000,,original data set, right? And this is Dialogue: 0,0:49:55.20,0:49:56.84,Default,,0000,0000,0000,,examples, these are some of the examples, Dialogue: 0,0:49:56.84,0:49:58.04,Default,,0000,0000,0000,,you don't have to use all of them, but Dialogue: 0,0:49:58.04,0:49:59.44,Default,,0000,0000,0000,,these are some of the examples of what we Dialogue: 0,0:49:59.44,0:50:00.84,Default,,0000,0000,0000,,call feature engineering which you can Dialogue: 0,0:50:00.84,0:50:03.56,Default,,0000,0000,0000,,then transform your original values in Dialogue: 0,0:50:03.56,0:50:05.28,Default,,0000,0000,0000,,your feature variables to all these Dialogue: 0,0:50:05.28,0:50:07.92,Default,,0000,0000,0000,,transform values here. So we're going to Dialogue: 0,0:50:07.92,0:50:09.68,Default,,0000,0000,0000,,pretty much do that here, so we have a Dialogue: 0,0:50:09.68,0:50:12.60,Default,,0000,0000,0000,,ordinal encoding, we do scaling of the Dialogue: 0,0:50:12.60,0:50:14.84,Default,,0000,0000,0000,,data so the dataset is scaled, we use a Dialogue: 0,0:50:14.84,0:50:18.24,Default,,0000,0000,0000,,MinMax scaling, and then finally, we come Dialogue: 0,0:50:18.24,0:50:21.72,Default,,0000,0000,0000,,to do a modeling. So we have to split our Dialogue: 0,0:50:21.72,0:50:24.36,Default,,0000,0000,0000,,dataset into a training dataset and a Dialogue: 0,0:50:24.36,0:50:28.64,Default,,0000,0000,0000,,test dataset. So coming back to here again, Dialogue: 0,0:50:28.64,0:50:32.16,Default,,0000,0000,0000,,we said that before you train your Dialogue: 0,0:50:32.16,0:50:33.80,Default,,0000,0000,0000,,model, sorry, before you train your model, Dialogue: 0,0:50:33.80,0:50:35.60,Default,,0000,0000,0000,,you have to take your original dataset, Dialogue: 0,0:50:35.60,0:50:37.32,Default,,0000,0000,0000,,now this is a featured engineered dataset. Dialogue: 0,0:50:37.32,0:50:38.84,Default,,0000,0000,0000,,We're going to break it into two or Dialogue: 0,0:50:38.84,0:50:40.84,Default,,0000,0000,0000,,more subsets, okay. So one is called the Dialogue: 0,0:50:40.84,0:50:42.40,Default,,0000,0000,0000,,training dataset that we use to feed Dialogue: 0,0:50:42.40,0:50:44.00,Default,,0000,0000,0000,,and train a machine learning model. The Dialogue: 0,0:50:44.00,0:50:45.92,Default,,0000,0000,0000,,second is test dataset to evaluate the Dialogue: 0,0:50:45.92,0:50:47.96,Default,,0000,0000,0000,,accuracy of the model, okay? So we got Dialogue: 0,0:50:47.96,0:50:50.94,Default,,0000,0000,0000,,this training dataset, your test dataset, Dialogue: 0,0:50:50.94,0:50:52.72,Default,,0000,0000,0000,,and we also need Dialogue: 0,0:50:52.72,0:50:56.16,Default,,0000,0000,0000,,to sample. So from our original data set Dialogue: 0,0:50:56.16,0:50:57.40,Default,,0000,0000,0000,,we need to sample some points Dialogue: 0,0:50:57.40,0:50:58.84,Default,,0000,0000,0000,,that go into your training dataset, some Dialogue: 0,0:50:58.84,0:51:00.56,Default,,0000,0000,0000,,points that go in your test dataset. So Dialogue: 0,0:51:00.56,0:51:02.72,Default,,0000,0000,0000,,there are many ways to do sampling. One Dialogue: 0,0:51:02.72,0:51:04.92,Default,,0000,0000,0000,,way is to do stratified sampling where Dialogue: 0,0:51:04.92,0:51:06.72,Default,,0000,0000,0000,,we ensure the same proportion of data Dialogue: 0,0:51:06.72,0:51:09.00,Default,,0000,0000,0000,,from each stata or class because right Dialogue: 0,0:51:09.00,0:51:10.96,Default,,0000,0000,0000,,now we have a multiclass classification Dialogue: 0,0:51:10.96,0:51:12.32,Default,,0000,0000,0000,,problem, so you want to make sure the Dialogue: 0,0:51:12.32,0:51:13.96,Default,,0000,0000,0000,,same proportion of data from each strata or Dialogue: 0,0:51:13.96,0:51:15.84,Default,,0000,0000,0000,,class is equally proportional in the Dialogue: 0,0:51:15.84,0:51:17.92,Default,,0000,0000,0000,,training and test dataset as the Dialogue: 0,0:51:17.92,0:51:20.12,Default,,0000,0000,0000,,original dataset which is very useful Dialogue: 0,0:51:20.12,0:51:21.64,Default,,0000,0000,0000,,for dealing with what is called an Dialogue: 0,0:51:21.64,0:51:24.32,Default,,0000,0000,0000,,imbalanced dataset. So here we have an Dialogue: 0,0:51:24.32,0:51:25.84,Default,,0000,0000,0000,,example of what is called an imbalanced Dialogue: 0,0:51:25.84,0:51:29.52,Default,,0000,0000,0000,,dataset in the sense that you have the Dialogue: 0,0:51:29.52,0:51:32.76,Default,,0000,0000,0000,,vast majority of data points in your Dialogue: 0,0:51:32.76,0:51:34.96,Default,,0000,0000,0000,,data set, they are going to have the Dialogue: 0,0:51:34.96,0:51:37.48,Default,,0000,0000,0000,,value of zero for their target variable Dialogue: 0,0:51:37.48,0:51:40.20,Default,,0000,0000,0000,,column. So only a extremely small Dialogue: 0,0:51:40.20,0:51:43.44,Default,,0000,0000,0000,,minority of the data points in your dataset Dialogue: 0,0:51:43.44,0:51:45.32,Default,,0000,0000,0000,,will actually have the value of one Dialogue: 0,0:51:45.32,0:51:48.72,Default,,0000,0000,0000,,for their target variable column, okay? So Dialogue: 0,0:51:48.72,0:51:51.04,Default,,0000,0000,0000,,a situation where you have your class or Dialogue: 0,0:51:51.04,0:51:52.52,Default,,0000,0000,0000,,your target variable column where the Dialogue: 0,0:51:52.52,0:51:54.48,Default,,0000,0000,0000,,vast majority of values are from one Dialogue: 0,0:51:54.48,0:51:58.12,Default,,0000,0000,0000,,class and a tiny small minority are from Dialogue: 0,0:51:58.12,0:52:00.52,Default,,0000,0000,0000,,another class, we call this an imbalanced Dialogue: 0,0:52:00.52,0:52:02.72,Default,,0000,0000,0000,,dataset. And for an imbalanced dataset, Dialogue: 0,0:52:02.72,0:52:04.32,Default,,0000,0000,0000,,typically we will have a specific Dialogue: 0,0:52:04.32,0:52:05.92,Default,,0000,0000,0000,,technique to do the train test split Dialogue: 0,0:52:05.92,0:52:08.12,Default,,0000,0000,0000,,which is called stratified sampling, and Dialogue: 0,0:52:08.12,0:52:09.60,Default,,0000,0000,0000,,so that's what's exactly happening here. Dialogue: 0,0:52:09.60,0:52:12.00,Default,,0000,0000,0000,,We're doing a stratified split here, so Dialogue: 0,0:52:12.00,0:52:14.84,Default,,0000,0000,0000,,we are doing a train test split here, Dialogue: 0,0:52:14.84,0:52:17.52,Default,,0000,0000,0000,,and we are doing a stratified split. Dialogue: 0,0:52:17.52,0:52:20.36,Default,,0000,0000,0000,,And then now we actually develop the Dialogue: 0,0:52:20.36,0:52:23.36,Default,,0000,0000,0000,,models. So now we've got the train test Dialogue: 0,0:52:23.36,0:52:25.48,Default,,0000,0000,0000,,split, now here is where we actually Dialogue: 0,0:52:25.48,0:52:27.08,Default,,0000,0000,0000,,train the models. Dialogue: 0,0:52:27.08,0:52:29.92,Default,,0000,0000,0000,,Now in terms of classification there are Dialogue: 0,0:52:29.92,0:52:31.30,Default,,0000,0000,0000,,a whole bunch of Dialogue: 0,0:52:31.30,0:52:35.40,Default,,0000,0000,0000,,possibilities, right, that you can use. Dialogue: 0,0:52:35.40,0:52:38.48,Default,,0000,0000,0000,,There are many, many different algorithms Dialogue: 0,0:52:38.48,0:52:41.00,Default,,0000,0000,0000,,that we can use to create a Dialogue: 0,0:52:41.00,0:52:42.84,Default,,0000,0000,0000,,classification model. So these are an Dialogue: 0,0:52:42.84,0:52:45.08,Default,,0000,0000,0000,,example of some of the more common ones. Dialogue: 0,0:52:45.08,0:52:47.48,Default,,0000,0000,0000,,Logistic, support vector machine, decision Dialogue: 0,0:52:47.48,0:52:49.52,Default,,0000,0000,0000,,trees, random forest, bagging, balanced Dialogue: 0,0:52:49.52,0:52:52.72,Default,,0000,0000,0000,,bagging, boost, ensemble. So all Dialogue: 0,0:52:52.72,0:52:55.04,Default,,0000,0000,0000,,these are different algorithms which Dialogue: 0,0:52:55.04,0:52:57.76,Default,,0000,0000,0000,,will create different kinds of models Dialogue: 0,0:52:57.76,0:53:01.60,Default,,0000,0000,0000,,which will result in different accuracy Dialogue: 0,0:53:01.60,0:53:05.40,Default,,0000,0000,0000,,measures, okay? So it's the goal of the Dialogue: 0,0:53:05.40,0:53:08.92,Default,,0000,0000,0000,,data scientist to find the best model Dialogue: 0,0:53:08.92,0:53:11.52,Default,,0000,0000,0000,,that gives the best accuracy for the Dialogue: 0,0:53:11.52,0:53:14.12,Default,,0000,0000,0000,,given dataset, for training on that Dialogue: 0,0:53:14.12,0:53:16.88,Default,,0000,0000,0000,,given dataset. So let's head back, again, Dialogue: 0,0:53:16.88,0:53:19.76,Default,,0000,0000,0000,,to our machine learning workflow. So Dialogue: 0,0:53:19.76,0:53:21.52,Default,,0000,0000,0000,,here basically what I'm doing is I'm Dialogue: 0,0:53:21.52,0:53:23.52,Default,,0000,0000,0000,,creating a whole bunch of models here, Dialogue: 0,0:53:23.52,0:53:25.52,Default,,0000,0000,0000,,all right? So one is a random forest, one Dialogue: 0,0:53:25.52,0:53:27.16,Default,,0000,0000,0000,,is balanced bagging, one is a boost Dialogue: 0,0:53:27.16,0:53:29.52,Default,,0000,0000,0000,,classifier, one's a ensemble classifier, Dialogue: 0,0:53:29.52,0:53:32.76,Default,,0000,0000,0000,,and using all of these, I am going to Dialogue: 0,0:53:32.76,0:53:35.32,Default,,0000,0000,0000,,basically feed or train my model using Dialogue: 0,0:53:35.32,0:53:37.44,Default,,0000,0000,0000,,all these algorithms. And then I'm going Dialogue: 0,0:53:37.44,0:53:39.80,Default,,0000,0000,0000,,to evaluate them, okay? I'm going to Dialogue: 0,0:53:39.80,0:53:42.48,Default,,0000,0000,0000,,evaluate how good each of these models Dialogue: 0,0:53:42.48,0:53:45.76,Default,,0000,0000,0000,,are. And here you can see your Dialogue: 0,0:53:45.76,0:53:48.84,Default,,0000,0000,0000,,evaluation data, right? Okay and this is Dialogue: 0,0:53:48.84,0:53:50.84,Default,,0000,0000,0000,,the confusion matrix which is another Dialogue: 0,0:53:50.84,0:53:54.28,Default,,0000,0000,0000,,way of evaluating. So now we come to the, Dialogue: 0,0:53:54.28,0:53:56.32,Default,,0000,0000,0000,,kind of, the key part here which Dialogue: 0,0:53:56.32,0:53:58.52,Default,,0000,0000,0000,,is how do I distinguish between Dialogue: 0,0:53:58.52,0:54:00.08,Default,,0000,0000,0000,,all these models, right? I've got all Dialogue: 0,0:54:00.08,0:54:01.40,Default,,0000,0000,0000,,these different models which are built Dialogue: 0,0:54:01.40,0:54:03.04,Default,,0000,0000,0000,,with different algorithms which I'm Dialogue: 0,0:54:03.04,0:54:05.36,Default,,0000,0000,0000,,using to train on the same dataset, how Dialogue: 0,0:54:05.36,0:54:07.36,Default,,0000,0000,0000,,do I distinguish between all these Dialogue: 0,0:54:07.36,0:54:10.36,Default,,0000,0000,0000,,models, okay? And so for that sense, for Dialogue: 0,0:54:10.36,0:54:13.88,Default,,0000,0000,0000,,that we actually have a whole bunch of Dialogue: 0,0:54:13.88,0:54:16.20,Default,,0000,0000,0000,,common evaluation metrics for Dialogue: 0,0:54:16.20,0:54:18.32,Default,,0000,0000,0000,,classification, right? So this evaluation Dialogue: 0,0:54:18.32,0:54:22.24,Default,,0000,0000,0000,,metrics tell us how good a model is in Dialogue: 0,0:54:22.24,0:54:24.32,Default,,0000,0000,0000,,terms of its accuracy in Dialogue: 0,0:54:24.32,0:54:27.00,Default,,0000,0000,0000,,classification. So in terms of Dialogue: 0,0:54:27.00,0:54:29.44,Default,,0000,0000,0000,,accuracy, we actually have many different Dialogue: 0,0:54:29.44,0:54:31.68,Default,,0000,0000,0000,,models, sorry, many different measures, Dialogue: 0,0:54:31.68,0:54:33.44,Default,,0000,0000,0000,,right? You might think well, accuracy is Dialogue: 0,0:54:33.44,0:54:35.40,Default,,0000,0000,0000,,just accuracy, well that's all right, it's Dialogue: 0,0:54:35.40,0:54:36.88,Default,,0000,0000,0000,,just either it's accurate or it's not Dialogue: 0,0:54:36.88,0:54:39.32,Default,,0000,0000,0000,,accurate, right? But actually it's not Dialogue: 0,0:54:39.32,0:54:41.36,Default,,0000,0000,0000,,that simple. There are many different Dialogue: 0,0:54:41.36,0:54:43.84,Default,,0000,0000,0000,,ways to measure the accuracy of a Dialogue: 0,0:54:43.84,0:54:45.48,Default,,0000,0000,0000,,classification model, and these are some Dialogue: 0,0:54:45.48,0:54:48.28,Default,,0000,0000,0000,,of the more common ones. So, for example, Dialogue: 0,0:54:48.28,0:54:51.00,Default,,0000,0000,0000,,the confusion matrix tells us how many Dialogue: 0,0:54:51.00,0:54:54.00,Default,,0000,0000,0000,,true positives, that means the value is Dialogue: 0,0:54:54.00,0:54:55.88,Default,,0000,0000,0000,,positive, the prediction is positive, how Dialogue: 0,0:54:55.88,0:54:57.52,Default,,0000,0000,0000,,many false positives which means the Dialogue: 0,0:54:57.52,0:54:59.04,Default,,0000,0000,0000,,value is negative the machine learning Dialogue: 0,0:54:59.04,0:55:01.84,Default,,0000,0000,0000,,model predicts positive. How many false Dialogue: 0,0:55:01.84,0:55:03.84,Default,,0000,0000,0000,,negatives which means that the machine Dialogue: 0,0:55:03.84,0:55:05.56,Default,,0000,0000,0000,,learning model predicts negative, but Dialogue: 0,0:55:05.56,0:55:07.48,Default,,0000,0000,0000,,it's actually positive. And how many true Dialogue: 0,0:55:07.48,0:55:09.36,Default,,0000,0000,0000,,negatives there are which means that the Dialogue: 0,0:55:09.36,0:55:11.24,Default,,0000,0000,0000,,the machine learning model Dialogue: 0,0:55:11.24,0:55:12.88,Default,,0000,0000,0000,,predicts negative and the true value is Dialogue: 0,0:55:12.88,0:55:14.76,Default,,0000,0000,0000,,also negative. So this is called a Dialogue: 0,0:55:14.76,0:55:16.92,Default,,0000,0000,0000,,confusion matrix. This is one way we Dialogue: 0,0:55:16.92,0:55:19.48,Default,,0000,0000,0000,,assess or evaluate the performance of a Dialogue: 0,0:55:19.48,0:55:20.52,Default,,0000,0000,0000,,classification model, Dialogue: 0,0:55:20.52,0:55:23.32,Default,,0000,0000,0000,,okay? This is for binary Dialogue: 0,0:55:23.32,0:55:24.68,Default,,0000,0000,0000,,classification, we can also have Dialogue: 0,0:55:24.68,0:55:26.88,Default,,0000,0000,0000,,multiclass confusion matrix, Dialogue: 0,0:55:26.88,0:55:29.00,Default,,0000,0000,0000,,and then we can also measure things like Dialogue: 0,0:55:29.00,0:55:31.72,Default,,0000,0000,0000,,accuracy. So accuracy is the true Dialogue: 0,0:55:31.72,0:55:34.08,Default,,0000,0000,0000,,positives plus the true negatives which Dialogue: 0,0:55:34.08,0:55:35.44,Default,,0000,0000,0000,,is the total number of correct Dialogue: 0,0:55:35.44,0:55:37.84,Default,,0000,0000,0000,,predictions made by the model divided by Dialogue: 0,0:55:37.84,0:55:39.84,Default,,0000,0000,0000,,the total number of data points in your Dialogue: 0,0:55:39.84,0:55:42.60,Default,,0000,0000,0000,,dataset. And then you have also other Dialogue: 0,0:55:42.60,0:55:43.15,Default,,0000,0000,0000,,kinds of Dialogue: 0,0:55:43.15,0:55:46.60,Default,,0000,0000,0000,,measures such as recall. And this a Dialogue: 0,0:55:46.60,0:55:49.16,Default,,0000,0000,0000,,formula for recall, this is a formula for Dialogue: 0,0:55:49.16,0:55:51.48,Default,,0000,0000,0000,,the F1 score, okay? And then there's Dialogue: 0,0:55:51.48,0:55:55.56,Default,,0000,0000,0000,,something called the ROC curve, right? So Dialogue: 0,0:55:55.56,0:55:57.04,Default,,0000,0000,0000,,without going too much in the detail of Dialogue: 0,0:55:57.04,0:55:59.00,Default,,0000,0000,0000,,what each of these entails, essentially Dialogue: 0,0:55:59.00,0:56:00.64,Default,,0000,0000,0000,,these are all different ways, these are Dialogue: 0,0:56:00.64,0:56:03.28,Default,,0000,0000,0000,,different KPI, right? Just like if you Dialogue: 0,0:56:03.28,0:56:06.12,Default,,0000,0000,0000,,work in a company, you have different KPI, Dialogue: 0,0:56:06.12,0:56:08.08,Default,,0000,0000,0000,,right? Certain employees have certain KPI Dialogue: 0,0:56:08.08,0:56:11.28,Default,,0000,0000,0000,,that measures how good or how, you Dialogue: 0,0:56:11.28,0:56:13.20,Default,,0000,0000,0000,,know, efficient or how effective a Dialogue: 0,0:56:13.20,0:56:15.50,Default,,0000,0000,0000,,particular employee is, right? So the Dialogue: 0,0:56:15.50,0:56:19.88,Default,,0000,0000,0000,,KPI for your machine learning models Dialogue: 0,0:56:19.88,0:56:24.24,Default,,0000,0000,0000,,are ROC curve, F1 score, recall, accuracy, Dialogue: 0,0:56:24.24,0:56:26.60,Default,,0000,0000,0000,,okay, and your confusion matrix. So Dialogue: 0,0:56:26.60,0:56:29.84,Default,,0000,0000,0000,,fundamentally after I have built, right, Dialogue: 0,0:56:29.84,0:56:33.36,Default,,0000,0000,0000,,so here I've built my four different Dialogue: 0,0:56:33.36,0:56:35.24,Default,,0000,0000,0000,,models. So after I built these four Dialogue: 0,0:56:35.24,0:56:37.64,Default,,0000,0000,0000,,different models, I'm going to check and Dialogue: 0,0:56:37.64,0:56:39.68,Default,,0000,0000,0000,,evaluate them using all those different Dialogue: 0,0:56:39.68,0:56:42.44,Default,,0000,0000,0000,,metrics like, for example, the F1 score, Dialogue: 0,0:56:42.44,0:56:44.84,Default,,0000,0000,0000,,the precision score, the recall score, all Dialogue: 0,0:56:44.84,0:56:47.32,Default,,0000,0000,0000,,right. So for this model, I can check out Dialogue: 0,0:56:47.32,0:56:50.04,Default,,0000,0000,0000,,the ROC score, the F1 score, the precision Dialogue: 0,0:56:50.04,0:56:52.12,Default,,0000,0000,0000,,score, the recall score. Then for this Dialogue: 0,0:56:52.12,0:56:54.80,Default,,0000,0000,0000,,model, this is the ROC score, the F1 score, Dialogue: 0,0:56:54.80,0:56:56.84,Default,,0000,0000,0000,,the precision score, the recall score. Dialogue: 0,0:56:56.84,0:56:59.68,Default,,0000,0000,0000,,Then for this model and so on. So for Dialogue: 0,0:56:59.68,0:57:03.24,Default,,0000,0000,0000,,every single model I've created using my Dialogue: 0,0:57:03.24,0:57:05.84,Default,,0000,0000,0000,,training data set, I will have all my set Dialogue: 0,0:57:05.84,0:57:08.00,Default,,0000,0000,0000,,of evaluation metrics that I can use to Dialogue: 0,0:57:08.00,0:57:11.84,Default,,0000,0000,0000,,evaluate how good this model is, okay? Dialogue: 0,0:57:11.84,0:57:13.12,Default,,0000,0000,0000,,Same thing here, I've got a confusion Dialogue: 0,0:57:13.12,0:57:15.08,Default,,0000,0000,0000,,matrix here, right, so I can use that, Dialogue: 0,0:57:15.08,0:57:18.12,Default,,0000,0000,0000,,again, to evaluate between all these four Dialogue: 0,0:57:18.12,0:57:20.20,Default,,0000,0000,0000,,different models, and then I, kind of, Dialogue: 0,0:57:20.20,0:57:22.24,Default,,0000,0000,0000,,summarize it up here. So we can see from Dialogue: 0,0:57:22.24,0:57:25.44,Default,,0000,0000,0000,,this summary here that actually the top Dialogue: 0,0:57:25.44,0:57:27.60,Default,,0000,0000,0000,,two models, right, which are I'm going to Dialogue: 0,0:57:27.60,0:57:29.44,Default,,0000,0000,0000,,give a lot, as a data scientist, I'm now Dialogue: 0,0:57:29.44,0:57:31.12,Default,,0000,0000,0000,,going to just focus on these two models. Dialogue: 0,0:57:31.12,0:57:33.44,Default,,0000,0000,0000,,So these two models are bagging Dialogue: 0,0:57:33.44,0:57:36.00,Default,,0000,0000,0000,,classifier and random forest classifier. Dialogue: 0,0:57:36.00,0:57:38.48,Default,,0000,0000,0000,,They have the highest values of F1 score, Dialogue: 0,0:57:38.48,0:57:40.48,Default,,0000,0000,0000,,and the highest values of the ROC curve Dialogue: 0,0:57:40.48,0:57:42.64,Default,,0000,0000,0000,,score, okay? So we can say these are the Dialogue: 0,0:57:42.64,0:57:45.84,Default,,0000,0000,0000,,top two models in terms of accuracy, okay, Dialogue: 0,0:57:45.84,0:57:48.92,Default,,0000,0000,0000,,using the F1 evaluation metric and the Dialogue: 0,0:57:48.92,0:57:53.72,Default,,0000,0000,0000,,ROC AUC evaluation metric, okay? So these Dialogue: 0,0:57:53.72,0:57:57.48,Default,,0000,0000,0000,,results, kind of, summarize here, and Dialogue: 0,0:57:57.48,0:57:59.08,Default,,0000,0000,0000,,then we use different sampling Dialogue: 0,0:57:59.08,0:58:00.88,Default,,0000,0000,0000,,techniques, okay, so just now I talked Dialogue: 0,0:58:00.88,0:58:03.68,Default,,0000,0000,0000,,about um different kinds of sampling Dialogue: 0,0:58:03.68,0:58:06.40,Default,,0000,0000,0000,,techniques and so the idea of different Dialogue: 0,0:58:06.40,0:58:08.32,Default,,0000,0000,0000,,kinds of sampling techniques is to just Dialogue: 0,0:58:08.32,0:58:11.32,Default,,0000,0000,0000,,get a different feel for different Dialogue: 0,0:58:11.32,0:58:13.72,Default,,0000,0000,0000,,distributions of the data in different Dialogue: 0,0:58:13.72,0:58:16.36,Default,,0000,0000,0000,,areas of your data set so that you want Dialogue: 0,0:58:16.36,0:58:20.00,Default,,0000,0000,0000,,to just kind of make sure that your your Dialogue: 0,0:58:20.00,0:58:22.80,Default,,0000,0000,0000,,your evaluation of accuracy is actually Dialogue: 0,0:58:22.80,0:58:27.08,Default,,0000,0000,0000,,statistically correct right so we can um Dialogue: 0,0:58:27.08,0:58:29.60,Default,,0000,0000,0000,,do what is called oversampling and under Dialogue: 0,0:58:29.60,0:58:30.88,Default,,0000,0000,0000,,sampling which is very useful when Dialogue: 0,0:58:30.88,0:58:32.28,Default,,0000,0000,0000,,you're working with an imbalance data Dialogue: 0,0:58:32.28,0:58:35.04,Default,,0000,0000,0000,,set so this is example of doing that and Dialogue: 0,0:58:35.04,0:58:37.24,Default,,0000,0000,0000,,then here we again again check out the Dialogue: 0,0:58:37.24,0:58:38.80,Default,,0000,0000,0000,,results for all these different Dialogue: 0,0:58:38.80,0:58:41.68,Default,,0000,0000,0000,,techniques we use uh the F1 score the Au Dialogue: 0,0:58:41.68,0:58:43.60,Default,,0000,0000,0000,,score all right these are the two key Dialogue: 0,0:58:43.60,0:58:46.76,Default,,0000,0000,0000,,measures of accuracy right so and then Dialogue: 0,0:58:46.76,0:58:47.92,Default,,0000,0000,0000,,we can check out the scores for the Dialogue: 0,0:58:47.92,0:58:50.48,Default,,0000,0000,0000,,different approaches okay so we can see Dialogue: 0,0:58:50.48,0:58:53.12,Default,,0000,0000,0000,,oh well overall the models have lower Au Dialogue: 0,0:58:53.12,0:58:55.72,Default,,0000,0000,0000,,r r Au C score but they have a much Dialogue: 0,0:58:55.72,0:58:58.28,Default,,0000,0000,0000,,higher F1 score the begging classifier Dialogue: 0,0:58:58.28,0:59:00.84,Default,,0000,0000,0000,,had the highest R1 highest roc1 score Dialogue: 0,0:59:00.84,0:59:04.12,Default,,0000,0000,0000,,but F1 score was too low okay then in Dialogue: 0,0:59:04.12,0:59:06.52,Default,,0000,0000,0000,,the data scientist opinion the random Dialogue: 0,0:59:06.52,0:59:08.52,Default,,0000,0000,0000,,forest with this particular technique of Dialogue: 0,0:59:08.52,0:59:10.76,Default,,0000,0000,0000,,sampling has equilibrium between the F1 Dialogue: 0,0:59:10.76,0:59:14.48,Default,,0000,0000,0000,,R F1 R and A score so the takeaway one Dialogue: 0,0:59:14.48,0:59:16.68,Default,,0000,0000,0000,,is the macro F1 score improves Dialogue: 0,0:59:16.68,0:59:18.48,Default,,0000,0000,0000,,dramatically using the sampl sampling Dialogue: 0,0:59:18.48,0:59:20.16,Default,,0000,0000,0000,,techniqu so these models might be better Dialogue: 0,0:59:20.16,0:59:22.44,Default,,0000,0000,0000,,compared to the balanced ones all right Dialogue: 0,0:59:22.44,0:59:26.28,Default,,0000,0000,0000,,so based on all this uh evaluation the Dialogue: 0,0:59:26.28,0:59:27.68,Default,,0000,0000,0000,,data scientist says they're going to Dialogue: 0,0:59:27.68,0:59:29.92,Default,,0000,0000,0000,,continue to work with these two models Dialogue: 0,0:59:29.92,0:59:31.44,Default,,0000,0000,0000,,all right and the balance begging one Dialogue: 0,0:59:31.44,0:59:33.08,Default,,0000,0000,0000,,and then continue to make further Dialogue: 0,0:59:33.08,0:59:35.04,Default,,0000,0000,0000,,comparisons all right so then we Dialogue: 0,0:59:35.04,0:59:37.08,Default,,0000,0000,0000,,continue to keep refining on our Dialogue: 0,0:59:37.08,0:59:38.60,Default,,0000,0000,0000,,evaluation work here we're going to Dialogue: 0,0:59:38.60,0:59:41.00,Default,,0000,0000,0000,,train the models one more time again so Dialogue: 0,0:59:41.00,0:59:43.04,Default,,0000,0000,0000,,we again do a training test plate and Dialogue: 0,0:59:43.04,0:59:44.80,Default,,0000,0000,0000,,then we do that for this particular uh Dialogue: 0,0:59:44.80,0:59:47.04,Default,,0000,0000,0000,,approach model and then we print out we Dialogue: 0,0:59:47.04,0:59:48.20,Default,,0000,0000,0000,,print out what is called a Dialogue: 0,0:59:48.20,0:59:50.96,Default,,0000,0000,0000,,classification report and this is Dialogue: 0,0:59:50.96,0:59:53.40,Default,,0000,0000,0000,,basically a summary of all those metrics Dialogue: 0,0:59:53.40,0:59:55.36,Default,,0000,0000,0000,,that I talk about just now so just now Dialogue: 0,0:59:55.36,0:59:57.52,Default,,0000,0000,0000,,remember I said the the there was Dialogue: 0,0:59:57.52,0:59:59.68,Default,,0000,0000,0000,,several evaluation metrics right so uh Dialogue: 0,0:59:59.68,1:00:01.48,Default,,0000,0000,0000,,we had the confusion matrics the Dialogue: 0,1:00:01.48,1:00:04.12,Default,,0000,0000,0000,,accuracy the Precision the recall the Au Dialogue: 0,1:00:04.12,1:00:08.12,Default,,0000,0000,0000,,ccore so here with the um classification Dialogue: 0,1:00:08.12,1:00:09.88,Default,,0000,0000,0000,,report I can get a summary of all of Dialogue: 0,1:00:09.88,1:00:11.76,Default,,0000,0000,0000,,that so I can see all the values here Dialogue: 0,1:00:11.76,1:00:14.64,Default,,0000,0000,0000,,okay for this particular model begging Dialogue: 0,1:00:14.64,1:00:17.16,Default,,0000,0000,0000,,Tomac links and then I can do that for Dialogue: 0,1:00:17.16,1:00:18.64,Default,,0000,0000,0000,,another model the random Forest Dialogue: 0,1:00:18.64,1:00:20.60,Default,,0000,0000,0000,,borderline SME and then I can do that Dialogue: 0,1:00:20.60,1:00:22.20,Default,,0000,0000,0000,,for another model which is the balance Dialogue: 0,1:00:22.20,1:00:25.16,Default,,0000,0000,0000,,ping so again we see this a lot of Dialogue: 0,1:00:25.16,1:00:27.08,Default,,0000,0000,0000,,comparison between different models Dialogue: 0,1:00:27.08,1:00:28.64,Default,,0000,0000,0000,,trying to figure out what all these Dialogue: 0,1:00:28.64,1:00:30.72,Default,,0000,0000,0000,,evaluation metrics are telling us all Dialogue: 0,1:00:30.72,1:00:32.96,Default,,0000,0000,0000,,right then again we have a confusion Dialogue: 0,1:00:32.96,1:00:35.88,Default,,0000,0000,0000,,Matrix so we generate a confusion Matrix Dialogue: 0,1:00:35.88,1:00:38.88,Default,,0000,0000,0000,,for the bagging with the toac links Dialogue: 0,1:00:38.88,1:00:40.72,Default,,0000,0000,0000,,under sampling for the random followers Dialogue: 0,1:00:40.72,1:00:42.68,Default,,0000,0000,0000,,with the borderline mod over sampling Dialogue: 0,1:00:42.68,1:00:44.96,Default,,0000,0000,0000,,and just balance begging by itself then Dialogue: 0,1:00:44.96,1:00:47.72,Default,,0000,0000,0000,,again we compare between these three uh Dialogue: 0,1:00:47.72,1:00:50.80,Default,,0000,0000,0000,,models uh using the confusion Matrix Dialogue: 0,1:00:50.80,1:00:52.60,Default,,0000,0000,0000,,evaluation Matrix and then we can kind Dialogue: 0,1:00:52.60,1:00:55.68,Default,,0000,0000,0000,,of come to some conclusions all right so Dialogue: 0,1:00:55.68,1:00:58.16,Default,,0000,0000,0000,,right so now we look at all the data Dialogue: 0,1:00:58.16,1:01:01.20,Default,,0000,0000,0000,,then we move on and look at another um Dialogue: 0,1:01:01.20,1:01:03.16,Default,,0000,0000,0000,,another kind of evaluation metrix which Dialogue: 0,1:01:03.16,1:01:06.72,Default,,0000,0000,0000,,is the r score right so this is one of Dialogue: 0,1:01:06.72,1:01:08.68,Default,,0000,0000,0000,,the other evaluation metrics I talk Dialogue: 0,1:01:08.68,1:01:11.20,Default,,0000,0000,0000,,about so this one is a kind of a curve Dialogue: 0,1:01:11.20,1:01:12.52,Default,,0000,0000,0000,,you look at it to see the area Dialogue: 0,1:01:12.52,1:01:14.36,Default,,0000,0000,0000,,underneath the curve this is called AOC Dialogue: 0,1:01:14.36,1:01:18.08,Default,,0000,0000,0000,,R area under the curve sorry Au Au R Dialogue: 0,1:01:18.08,1:01:19.88,Default,,0000,0000,0000,,area under the curve all right so the Dialogue: 0,1:01:19.88,1:01:21.84,Default,,0000,0000,0000,,area under the curve uh Dialogue: 0,1:01:21.84,1:01:24.32,Default,,0000,0000,0000,,score will give us some idea about the Dialogue: 0,1:01:24.32,1:01:25.60,Default,,0000,0000,0000,,threshold that we're going to use for Dialogue: 0,1:01:25.60,1:01:27.68,Default,,0000,0000,0000,,classif ification so we can examine this Dialogue: 0,1:01:27.68,1:01:29.20,Default,,0000,0000,0000,,for the bagging classifier for the Dialogue: 0,1:01:29.20,1:01:30.96,Default,,0000,0000,0000,,random forest classifier for the balance Dialogue: 0,1:01:30.96,1:01:33.60,Default,,0000,0000,0000,,bagging classifier okay then we can also Dialogue: 0,1:01:33.60,1:01:36.20,Default,,0000,0000,0000,,again do that uh finally we can check Dialogue: 0,1:01:36.20,1:01:37.88,Default,,0000,0000,0000,,the classification report of this Dialogue: 0,1:01:37.88,1:01:39.68,Default,,0000,0000,0000,,particular model so we keep doing this Dialogue: 0,1:01:39.68,1:01:43.20,Default,,0000,0000,0000,,over and over again evaluating this m Dialogue: 0,1:01:43.20,1:01:45.72,Default,,0000,0000,0000,,The Matrix the the accuracy Matrix the Dialogue: 0,1:01:45.72,1:01:46.88,Default,,0000,0000,0000,,evaluation Matrix for all these Dialogue: 0,1:01:46.88,1:01:48.88,Default,,0000,0000,0000,,different models so we keep doing this Dialogue: 0,1:01:48.88,1:01:50.52,Default,,0000,0000,0000,,over and over again for different Dialogue: 0,1:01:50.52,1:01:53.44,Default,,0000,0000,0000,,thresholds or for classification and so Dialogue: 0,1:01:53.44,1:01:56.88,Default,,0000,0000,0000,,as we keep drilling into these we kind Dialogue: 0,1:01:56.88,1:02:00.84,Default,,0000,0000,0000,,of get more and more understanding of Dialogue: 0,1:02:00.84,1:02:02.80,Default,,0000,0000,0000,,all these different models which one is Dialogue: 0,1:02:02.80,1:02:04.76,Default,,0000,0000,0000,,the best one that gives the best Dialogue: 0,1:02:04.76,1:02:08.52,Default,,0000,0000,0000,,performance for our data set okay so Dialogue: 0,1:02:08.52,1:02:11.44,Default,,0000,0000,0000,,finally we come to this conclusion this Dialogue: 0,1:02:11.44,1:02:13.52,Default,,0000,0000,0000,,particular model is not able to reduce Dialogue: 0,1:02:13.52,1:02:15.28,Default,,0000,0000,0000,,the record on failure test than Dialogue: 0,1:02:15.28,1:02:17.52,Default,,0000,0000,0000,,95.8% on the other hand balance begging Dialogue: 0,1:02:17.52,1:02:19.40,Default,,0000,0000,0000,,with a decision thresold of 0.6 is able Dialogue: 0,1:02:19.40,1:02:21.52,Default,,0000,0000,0000,,to have a better recall blah blah blah Dialogue: 0,1:02:21.52,1:02:25.32,Default,,0000,0000,0000,,Etc so finally after having done all of Dialogue: 0,1:02:25.32,1:02:27.48,Default,,0000,0000,0000,,this evalu ations Dialogue: 0,1:02:27.48,1:02:31.12,Default,,0000,0000,0000,,okay this is the conclusion Dialogue: 0,1:02:31.12,1:02:33.96,Default,,0000,0000,0000,,so after having gone so right now we Dialogue: 0,1:02:33.96,1:02:35.28,Default,,0000,0000,0000,,have gone through all the steps of the Dialogue: 0,1:02:35.28,1:02:37.76,Default,,0000,0000,0000,,Machining learning life cycle and which Dialogue: 0,1:02:37.76,1:02:40.24,Default,,0000,0000,0000,,means we have right now or the data Dialogue: 0,1:02:40.24,1:02:41.96,Default,,0000,0000,0000,,scientist right now has gone through all Dialogue: 0,1:02:41.96,1:02:43.00,Default,,0000,0000,0000,,these Dialogue: 0,1:02:43.00,1:02:47.08,Default,,0000,0000,0000,,steps uh which is now we have done this Dialogue: 0,1:02:47.08,1:02:48.64,Default,,0000,0000,0000,,validation so we have done the cleaning Dialogue: 0,1:02:48.64,1:02:50.56,Default,,0000,0000,0000,,exploration preparation transformation Dialogue: 0,1:02:50.56,1:02:52.60,Default,,0000,0000,0000,,the future engineering we have developed Dialogue: 0,1:02:52.60,1:02:54.36,Default,,0000,0000,0000,,and trained multiple models we have Dialogue: 0,1:02:54.36,1:02:56.48,Default,,0000,0000,0000,,evaluated all these different models so Dialogue: 0,1:02:56.48,1:02:58.60,Default,,0000,0000,0000,,right now we have reached this stage so Dialogue: 0,1:02:58.60,1:03:02.72,Default,,0000,0000,0000,,at this stage we as the data scientist Dialogue: 0,1:03:02.72,1:03:05.48,Default,,0000,0000,0000,,kind of have completed our job so we've Dialogue: 0,1:03:05.48,1:03:08.12,Default,,0000,0000,0000,,come to some very useful conclusions Dialogue: 0,1:03:08.12,1:03:09.64,Default,,0000,0000,0000,,which we now can share with our Dialogue: 0,1:03:09.64,1:03:13.24,Default,,0000,0000,0000,,colleagues all right and based on this Dialogue: 0,1:03:13.24,1:03:15.40,Default,,0000,0000,0000,,uh conclusions or recommendations Dialogue: 0,1:03:15.40,1:03:17.16,Default,,0000,0000,0000,,somebody is going to choose a Dialogue: 0,1:03:17.16,1:03:19.16,Default,,0000,0000,0000,,appropriate model and that model is Dialogue: 0,1:03:19.16,1:03:22.64,Default,,0000,0000,0000,,going to get deployed for realtime use Dialogue: 0,1:03:22.64,1:03:25.32,Default,,0000,0000,0000,,in a real life production environment Dialogue: 0,1:03:25.32,1:03:27.24,Default,,0000,0000,0000,,okay and that decision is going to be Dialogue: 0,1:03:27.24,1:03:29.36,Default,,0000,0000,0000,,made based on the recommendations coming Dialogue: 0,1:03:29.36,1:03:30.88,Default,,0000,0000,0000,,from the data scientist at the end of Dialogue: 0,1:03:30.88,1:03:33.48,Default,,0000,0000,0000,,this phase okay so at the end of this Dialogue: 0,1:03:33.48,1:03:35.08,Default,,0000,0000,0000,,phase the data scientist is going to Dialogue: 0,1:03:35.08,1:03:36.88,Default,,0000,0000,0000,,come up with these conclusions so Dialogue: 0,1:03:36.88,1:03:41.76,Default,,0000,0000,0000,,conclusions is okay if the engineering Dialogue: 0,1:03:41.76,1:03:44.52,Default,,0000,0000,0000,,team they are looking okay the Dialogue: 0,1:03:44.52,1:03:46.12,Default,,0000,0000,0000,,engineering team right the engineering Dialogue: 0,1:03:46.12,1:03:48.72,Default,,0000,0000,0000,,team if they are looking for the highest Dialogue: 0,1:03:48.72,1:03:51.84,Default,,0000,0000,0000,,failure detection rate possible then Dialogue: 0,1:03:51.84,1:03:54.48,Default,,0000,0000,0000,,they should go with this particular Dialogue: 0,1:03:54.48,1:03:56.52,Default,,0000,0000,0000,,model okay Dialogue: 0,1:03:56.52,1:03:58.68,Default,,0000,0000,0000,,and if they want a balance between Dialogue: 0,1:03:58.68,1:04:01.04,Default,,0000,0000,0000,,precision and recall then they should Dialogue: 0,1:04:01.04,1:04:03.24,Default,,0000,0000,0000,,choose between the begging model with a Dialogue: 0,1:04:03.24,1:04:05.96,Default,,0000,0000,0000,,0.4 decision threshold or the random Dialogue: 0,1:04:05.96,1:04:09.60,Default,,0000,0000,0000,,forest model with a 0.5 threshold but if Dialogue: 0,1:04:09.60,1:04:11.88,Default,,0000,0000,0000,,they don't care so much about predicting Dialogue: 0,1:04:11.88,1:04:14.48,Default,,0000,0000,0000,,every failure and they want the highest Dialogue: 0,1:04:14.48,1:04:16.76,Default,,0000,0000,0000,,Precision possible then they should opt Dialogue: 0,1:04:16.76,1:04:19.80,Default,,0000,0000,0000,,for the begging toax link classifier Dialogue: 0,1:04:19.80,1:04:23.16,Default,,0000,0000,0000,,with a bit higher decision threshold and Dialogue: 0,1:04:23.16,1:04:26.16,Default,,0000,0000,0000,,so this is the key thing that the data Dialogue: 0,1:04:26.16,1:04:28.32,Default,,0000,0000,0000,,scientist is going to give right this is Dialogue: 0,1:04:28.32,1:04:30.76,Default,,0000,0000,0000,,the key takeaway this is the kind of the Dialogue: 0,1:04:30.76,1:04:32.68,Default,,0000,0000,0000,,end result of the entire machine Dialogue: 0,1:04:32.68,1:04:34.68,Default,,0000,0000,0000,,learning life cycle right now the data Dialogue: 0,1:04:34.68,1:04:36.40,Default,,0000,0000,0000,,scientist is going to tell the Dialogue: 0,1:04:36.40,1:04:38.60,Default,,0000,0000,0000,,engineering team all right you guys Dialogue: 0,1:04:38.60,1:04:41.16,Default,,0000,0000,0000,,which is more important for you point a Dialogue: 0,1:04:41.16,1:04:45.04,Default,,0000,0000,0000,,point B or Point C make your decision so Dialogue: 0,1:04:45.04,1:04:47.40,Default,,0000,0000,0000,,the engineering team will then discuss Dialogue: 0,1:04:47.40,1:04:48.96,Default,,0000,0000,0000,,among themselves and say hey you know Dialogue: 0,1:04:48.96,1:04:52.28,Default,,0000,0000,0000,,what what we want is we want to get the Dialogue: 0,1:04:52.28,1:04:54.72,Default,,0000,0000,0000,,highest failure detection possible Dialogue: 0,1:04:54.72,1:04:58.36,Default,,0000,0000,0000,,because any kind kind of failure of that Dialogue: 0,1:04:58.36,1:05:00.40,Default,,0000,0000,0000,,machine or the product on the samply Dialogue: 0,1:05:00.40,1:05:03.12,Default,,0000,0000,0000,,line is really going to screw us up big Dialogue: 0,1:05:03.12,1:05:05.64,Default,,0000,0000,0000,,time so what we're looking for is the Dialogue: 0,1:05:05.64,1:05:08.08,Default,,0000,0000,0000,,model that will give us the highest Dialogue: 0,1:05:08.08,1:05:10.88,Default,,0000,0000,0000,,failure detection rate we don't care Dialogue: 0,1:05:10.88,1:05:13.48,Default,,0000,0000,0000,,about Precision but we want to be make Dialogue: 0,1:05:13.48,1:05:15.44,Default,,0000,0000,0000,,sure that if there's a failure we are Dialogue: 0,1:05:15.44,1:05:17.72,Default,,0000,0000,0000,,going to catch it right so that's what Dialogue: 0,1:05:17.72,1:05:19.60,Default,,0000,0000,0000,,they want and so the data scientist will Dialogue: 0,1:05:19.60,1:05:22.20,Default,,0000,0000,0000,,say Hey you go for the balance begging Dialogue: 0,1:05:22.20,1:05:24.88,Default,,0000,0000,0000,,model okay then the data scientist saves Dialogue: 0,1:05:24.88,1:05:27.72,Default,,0000,0000,0000,,this all right uh and then once you have Dialogue: 0,1:05:27.72,1:05:30.00,Default,,0000,0000,0000,,saved this uh you can then go right Dialogue: 0,1:05:30.00,1:05:32.32,Default,,0000,0000,0000,,ahead and deploy that so you can go Dialogue: 0,1:05:32.32,1:05:33.52,Default,,0000,0000,0000,,right ahead and deploy that to Dialogue: 0,1:05:33.52,1:05:37.16,Default,,0000,0000,0000,,production okay and so if you want to Dialogue: 0,1:05:37.16,1:05:38.84,Default,,0000,0000,0000,,continue we can actually further Dialogue: 0,1:05:38.84,1:05:41.12,Default,,0000,0000,0000,,continue this modeling problem so just Dialogue: 0,1:05:41.12,1:05:43.48,Default,,0000,0000,0000,,now I model this problem as a binary Dialogue: 0,1:05:43.48,1:05:46.72,Default,,0000,0000,0000,,classification problem uh sorry just I Dialogue: 0,1:05:46.72,1:05:48.24,Default,,0000,0000,0000,,modeled this problem as a binary Dialogue: 0,1:05:48.24,1:05:49.52,Default,,0000,0000,0000,,classification which means it's either Dialogue: 0,1:05:49.52,1:05:51.68,Default,,0000,0000,0000,,zero or one either fail or not fail but Dialogue: 0,1:05:51.68,1:05:53.60,Default,,0000,0000,0000,,we can also model it as a multiclass Dialogue: 0,1:05:53.60,1:05:55.64,Default,,0000,0000,0000,,classification problem right because as Dialogue: 0,1:05:55.64,1:05:57.64,Default,,0000,0000,0000,,as I said earlier just now for the Dialogue: 0,1:05:57.64,1:06:00.20,Default,,0000,0000,0000,,Target variable colum which is sorry for Dialogue: 0,1:06:00.20,1:06:02.52,Default,,0000,0000,0000,,the failure type colume you actually Dialogue: 0,1:06:02.52,1:06:04.84,Default,,0000,0000,0000,,have multiple kinds of failures right Dialogue: 0,1:06:04.84,1:06:07.56,Default,,0000,0000,0000,,for example you may have a power failure Dialogue: 0,1:06:07.56,1:06:10.00,Default,,0000,0000,0000,,uh you may have a towar failure uh you Dialogue: 0,1:06:10.00,1:06:12.92,Default,,0000,0000,0000,,may have a overstrain failure so now we Dialogue: 0,1:06:12.92,1:06:14.84,Default,,0000,0000,0000,,can model the problem slightly Dialogue: 0,1:06:14.84,1:06:17.24,Default,,0000,0000,0000,,differently so we can model it as a Dialogue: 0,1:06:17.24,1:06:19.68,Default,,0000,0000,0000,,multiclass classification problem and Dialogue: 0,1:06:19.68,1:06:21.16,Default,,0000,0000,0000,,then we go through the entire same Dialogue: 0,1:06:21.16,1:06:22.68,Default,,0000,0000,0000,,process that we went through just now so Dialogue: 0,1:06:22.68,1:06:24.88,Default,,0000,0000,0000,,we create different models we test this Dialogue: 0,1:06:24.88,1:06:26.72,Default,,0000,0000,0000,,out but now the confusion Matrix is for Dialogue: 0,1:06:26.72,1:06:30.12,Default,,0000,0000,0000,,a multiclass classification isue right Dialogue: 0,1:06:30.12,1:06:30.96,Default,,0000,0000,0000,,so we're going Dialogue: 0,1:06:30.96,1:06:34.04,Default,,0000,0000,0000,,to check them out we're going to again Dialogue: 0,1:06:34.04,1:06:36.08,Default,,0000,0000,0000,,uh try different algorithms or models Dialogue: 0,1:06:36.08,1:06:38.04,Default,,0000,0000,0000,,again train and test our data set do the Dialogue: 0,1:06:38.04,1:06:39.76,Default,,0000,0000,0000,,training test split uh on these Dialogue: 0,1:06:39.76,1:06:42.00,Default,,0000,0000,0000,,different models all right so we have Dialogue: 0,1:06:42.00,1:06:43.40,Default,,0000,0000,0000,,like for example we have bon random Dialogue: 0,1:06:43.40,1:06:46.16,Default,,0000,0000,0000,,Forest B random Forest a great search Dialogue: 0,1:06:46.16,1:06:47.72,Default,,0000,0000,0000,,then you train the models using what is Dialogue: 0,1:06:47.72,1:06:49.68,Default,,0000,0000,0000,,called hyperparameter tuning then you Dialogue: 0,1:06:49.68,1:06:51.08,Default,,0000,0000,0000,,get the scores all right so you get the Dialogue: 0,1:06:51.08,1:06:53.16,Default,,0000,0000,0000,,same evaluation scores again you check Dialogue: 0,1:06:53.16,1:06:54.60,Default,,0000,0000,0000,,out the evaluation scores compare Dialogue: 0,1:06:54.60,1:06:57.08,Default,,0000,0000,0000,,between them generate a confusion Matrix Dialogue: 0,1:06:57.08,1:06:59.96,Default,,0000,0000,0000,,so this is a multiclass confusion Matrix Dialogue: 0,1:06:59.96,1:07:02.40,Default,,0000,0000,0000,,and then you come to the final Dialogue: 0,1:07:02.40,1:07:05.76,Default,,0000,0000,0000,,conclusion so now if you are interested Dialogue: 0,1:07:05.76,1:07:09.00,Default,,0000,0000,0000,,to frame your problem domain as a Dialogue: 0,1:07:09.00,1:07:11.36,Default,,0000,0000,0000,,multiclass classification problem all Dialogue: 0,1:07:11.36,1:07:13.84,Default,,0000,0000,0000,,right then these are the recommendations Dialogue: 0,1:07:13.84,1:07:15.48,Default,,0000,0000,0000,,from the data scientist so the data Dialogue: 0,1:07:15.48,1:07:17.24,Default,,0000,0000,0000,,scientist will say you know what I'm Dialogue: 0,1:07:17.24,1:07:19.56,Default,,0000,0000,0000,,going to pick this particular model the Dialogue: 0,1:07:19.56,1:07:22.04,Default,,0000,0000,0000,,balance backing classifier and these are Dialogue: 0,1:07:22.04,1:07:24.52,Default,,0000,0000,0000,,all the reasons that the data scientist Dialogue: 0,1:07:24.52,1:07:27.28,Default,,0000,0000,0000,,is going to give as a rational for Dialogue: 0,1:07:27.28,1:07:29.40,Default,,0000,0000,0000,,selecting this particular Dialogue: 0,1:07:29.40,1:07:32.04,Default,,0000,0000,0000,,model and then once that's done you save Dialogue: 0,1:07:32.04,1:07:35.00,Default,,0000,0000,0000,,the model and that's that's it that's it Dialogue: 0,1:07:35.00,1:07:38.92,Default,,0000,0000,0000,,so that's all done now and so then the Dialogue: 0,1:07:38.92,1:07:41.04,Default,,0000,0000,0000,,uh the model the machine learning model Dialogue: 0,1:07:41.04,1:07:43.72,Default,,0000,0000,0000,,now you can put it live run it on the Dialogue: 0,1:07:43.72,1:07:45.28,Default,,0000,0000,0000,,server and now the machine learning Dialogue: 0,1:07:45.28,1:07:47.20,Default,,0000,0000,0000,,model is ready to work which means it's Dialogue: 0,1:07:47.20,1:07:48.92,Default,,0000,0000,0000,,ready to generate predictions right Dialogue: 0,1:07:48.92,1:07:50.28,Default,,0000,0000,0000,,that's the main job of the machine Dialogue: 0,1:07:50.28,1:07:52.04,Default,,0000,0000,0000,,learning model you have picked the best Dialogue: 0,1:07:52.04,1:07:53.68,Default,,0000,0000,0000,,machine learning model with the best Dialogue: 0,1:07:53.68,1:07:55.80,Default,,0000,0000,0000,,evaluation metrics for whatever accur Dialogue: 0,1:07:55.80,1:07:57.76,Default,,0000,0000,0000,,see goal you're trying to achieve and Dialogue: 0,1:07:57.76,1:07:59.64,Default,,0000,0000,0000,,now you're going to run it on a server Dialogue: 0,1:07:59.64,1:08:00.80,Default,,0000,0000,0000,,and now you're going to get all this Dialogue: 0,1:08:00.80,1:08:02.96,Default,,0000,0000,0000,,real time data that's coming from your Dialogue: 0,1:08:02.96,1:08:04.52,Default,,0000,0000,0000,,sensus you're going to pump that into Dialogue: 0,1:08:04.52,1:08:06.36,Default,,0000,0000,0000,,your machine learning model your machine Dialogue: 0,1:08:06.36,1:08:07.88,Default,,0000,0000,0000,,learning model will pump out a whole Dialogue: 0,1:08:07.88,1:08:09.52,Default,,0000,0000,0000,,bunch of predictions and we're going to Dialogue: 0,1:08:09.52,1:08:12.80,Default,,0000,0000,0000,,use that predictions in real time to Dialogue: 0,1:08:12.80,1:08:15.40,Default,,0000,0000,0000,,make real time real world decision Dialogue: 0,1:08:15.40,1:08:17.56,Default,,0000,0000,0000,,making right you're going to say okay Dialogue: 0,1:08:17.56,1:08:19.60,Default,,0000,0000,0000,,I'm predicting that that machine is Dialogue: 0,1:08:19.60,1:08:23.20,Default,,0000,0000,0000,,going to fail on Thursday at 5:00 p.m. Dialogue: 0,1:08:23.20,1:08:25.52,Default,,0000,0000,0000,,so you better get your service folks in Dialogue: 0,1:08:25.52,1:08:28.64,Default,,0000,0000,0000,,to service it on Thursday 2: p.m. or you Dialogue: 0,1:08:28.64,1:08:31.64,Default,,0000,0000,0000,,know whatever so you can you know uh Dialogue: 0,1:08:31.64,1:08:33.48,Default,,0000,0000,0000,,make decisions on when you want to do Dialogue: 0,1:08:33.48,1:08:35.32,Default,,0000,0000,0000,,your maintenance you know and and make Dialogue: 0,1:08:35.32,1:08:37.64,Default,,0000,0000,0000,,the best decisions to optimize the cost Dialogue: 0,1:08:37.64,1:08:41.16,Default,,0000,0000,0000,,of Maintenance etc etc and then based on Dialogue: 0,1:08:41.16,1:08:42.12,Default,,0000,0000,0000,,the Dialogue: 0,1:08:42.12,1:08:45.00,Default,,0000,0000,0000,,results that are coming up from the Dialogue: 0,1:08:45.00,1:08:46.76,Default,,0000,0000,0000,,predictions so the predictions may be Dialogue: 0,1:08:46.76,1:08:49.12,Default,,0000,0000,0000,,good the predictions may be lousy the Dialogue: 0,1:08:49.12,1:08:51.36,Default,,0000,0000,0000,,predictions may be average right so we Dialogue: 0,1:08:51.36,1:08:53.72,Default,,0000,0000,0000,,are we're constantly monitoring how good Dialogue: 0,1:08:53.72,1:08:55.44,Default,,0000,0000,0000,,or how useful are the predictions Dialogue: 0,1:08:55.44,1:08:57.76,Default,,0000,0000,0000,,generated by this realtime model that's Dialogue: 0,1:08:57.76,1:08:59.88,Default,,0000,0000,0000,,running on the server and based on our Dialogue: 0,1:08:59.88,1:09:02.68,Default,,0000,0000,0000,,monitoring we will then take some new Dialogue: 0,1:09:02.68,1:09:05.32,Default,,0000,0000,0000,,data and then repeat this entire life Dialogue: 0,1:09:05.32,1:09:07.04,Default,,0000,0000,0000,,cycle again so this is basically a Dialogue: 0,1:09:07.04,1:09:09.24,Default,,0000,0000,0000,,workflow that's iterative and we are Dialogue: 0,1:09:09.24,1:09:11.12,Default,,0000,0000,0000,,constantly or the data scientist is Dialogue: 0,1:09:11.12,1:09:13.32,Default,,0000,0000,0000,,constantly getting in all these new data Dialogue: 0,1:09:13.32,1:09:15.28,Default,,0000,0000,0000,,points and then refining the model Dialogue: 0,1:09:15.28,1:09:17.96,Default,,0000,0000,0000,,picking maybe a new model deploying the Dialogue: 0,1:09:17.96,1:09:21.68,Default,,0000,0000,0000,,new model onto the server and so on all Dialogue: 0,1:09:21.68,1:09:23.92,Default,,0000,0000,0000,,right and so that's it so that is Dialogue: 0,1:09:23.92,1:09:26.40,Default,,0000,0000,0000,,basically your machine learning workflow Dialogue: 0,1:09:26.40,1:09:29.48,Default,,0000,0000,0000,,in a nutshell okay so for this Dialogue: 0,1:09:29.48,1:09:32.08,Default,,0000,0000,0000,,particular approach we have used a bunch Dialogue: 0,1:09:32.08,1:09:34.56,Default,,0000,0000,0000,,of uh data science libraries from python Dialogue: 0,1:09:34.56,1:09:36.52,Default,,0000,0000,0000,,so we have used pandas which is the most Dialogue: 0,1:09:36.52,1:09:38.56,Default,,0000,0000,0000,,B basic data science libraries that Dialogue: 0,1:09:38.56,1:09:40.28,Default,,0000,0000,0000,,provides all the tools to work with raw Dialogue: 0,1:09:40.28,1:09:42.52,Default,,0000,0000,0000,,data we have used numai which is a high Dialogue: 0,1:09:42.52,1:09:44.08,Default,,0000,0000,0000,,performance library for implementing Dialogue: 0,1:09:44.08,1:09:46.44,Default,,0000,0000,0000,,complex array metrix operations we have Dialogue: 0,1:09:46.44,1:09:49.56,Default,,0000,0000,0000,,used met plot lip and cbon which is used Dialogue: 0,1:09:49.56,1:09:52.44,Default,,0000,0000,0000,,for doing the Eda the explorat Dialogue: 0,1:09:52.44,1:09:55.56,Default,,0000,0000,0000,,exploratory data analysis phase machine Dialogue: 0,1:09:55.56,1:09:57.04,Default,,0000,0000,0000,,learning where you visualize all your Dialogue: 0,1:09:57.04,1:09:59.04,Default,,0000,0000,0000,,data we have used psyit learn which is Dialogue: 0,1:09:59.04,1:10:01.28,Default,,0000,0000,0000,,the machine L learning library to do all Dialogue: 0,1:10:01.28,1:10:02.92,Default,,0000,0000,0000,,your implementation for all your call Dialogue: 0,1:10:02.92,1:10:06.00,Default,,0000,0000,0000,,machine learning algorithms uh we we we Dialogue: 0,1:10:06.00,1:10:08.00,Default,,0000,0000,0000,,have not used this because this is not a Dialogue: 0,1:10:08.00,1:10:11.04,Default,,0000,0000,0000,,deep learning uh problem but if you are Dialogue: 0,1:10:11.04,1:10:12.80,Default,,0000,0000,0000,,working with a deep learning problem Dialogue: 0,1:10:12.80,1:10:15.36,Default,,0000,0000,0000,,like image classification image Dialogue: 0,1:10:15.36,1:10:17.84,Default,,0000,0000,0000,,recognition object detection okay Dialogue: 0,1:10:17.84,1:10:20.20,Default,,0000,0000,0000,,natural language processing text Dialogue: 0,1:10:20.20,1:10:21.92,Default,,0000,0000,0000,,classification well then you're going to Dialogue: 0,1:10:21.92,1:10:24.36,Default,,0000,0000,0000,,use these libraries from python which is Dialogue: 0,1:10:24.36,1:10:28.96,Default,,0000,0000,0000,,tensor flow okay and also py Dialogue: 0,1:10:28.96,1:10:32.68,Default,,0000,0000,0000,,to and then lastly that whole thing that Dialogue: 0,1:10:32.68,1:10:34.72,Default,,0000,0000,0000,,whole data science project that you saw Dialogue: 0,1:10:34.72,1:10:36.80,Default,,0000,0000,0000,,just now this entire data science Dialogue: 0,1:10:36.80,1:10:38.88,Default,,0000,0000,0000,,project is actually developed in Dialogue: 0,1:10:38.88,1:10:41.08,Default,,0000,0000,0000,,something called a Jupiter notebook so Dialogue: 0,1:10:41.08,1:10:44.04,Default,,0000,0000,0000,,all this python code along with all the Dialogue: 0,1:10:44.04,1:10:46.36,Default,,0000,0000,0000,,observations from the data Dialogue: 0,1:10:46.36,1:10:48.68,Default,,0000,0000,0000,,scientists okay for this entire data Dialogue: 0,1:10:48.68,1:10:50.44,Default,,0000,0000,0000,,science project was actually run in Dialogue: 0,1:10:50.44,1:10:53.36,Default,,0000,0000,0000,,something called a Jupiter notebook so Dialogue: 0,1:10:53.36,1:10:55.76,Default,,0000,0000,0000,,that is uh the Dialogue: 0,1:10:55.76,1:10:59.08,Default,,0000,0000,0000,,most widely used tool for interactively Dialogue: 0,1:10:59.08,1:11:02.36,Default,,0000,0000,0000,,developing and presenting data science Dialogue: 0,1:11:02.36,1:11:04.64,Default,,0000,0000,0000,,projects okay so that brings me to the Dialogue: 0,1:11:04.64,1:11:07.40,Default,,0000,0000,0000,,end of this entire presentation I hope Dialogue: 0,1:11:07.40,1:11:10.36,Default,,0000,0000,0000,,that you find it useful for you and that Dialogue: 0,1:11:10.36,1:11:13.20,Default,,0000,0000,0000,,you can appreciate the importance of Dialogue: 0,1:11:13.20,1:11:15.28,Default,,0000,0000,0000,,machine learning and how it can be Dialogue: 0,1:11:15.28,1:11:19.80,Default,,0000,0000,0000,,applied in a real life use case in a Dialogue: 0,1:11:19.80,1:11:23.36,Default,,0000,0000,0000,,typical production environment all right Dialogue: 0,1:11:23.36,1:11:27.24,Default,,0000,0000,0000,,thank you all so much for watching