04-800-K   AIOps: Continuous and Automated IT and AI Monitoring

Location: Africa

Units: 12

Semester Offered: Fall

Course description

This course builds on and integrates S/W engineering, AI/ML, and IT skills in support of automated methods for assuring the highest level of system availability and resilience.  Students will apply tools including Docker, Kubernetes, and AI-based models for anomaly detection to monitor and correct hybrid cloud applications as they experience simulated disruptions and outages.  Lab exercises will deploy such monitoring tools in continuous integration/delivery pipelines for proactive control as opposed to traditional reactive and manual incident response.

The course will focus on automated monitoring of both IT components (e.g. code-based implementations of a distributed application) and AI/ML components (e.g., models and their associated pre- and post-deployment pipelines).

The course will review basic concepts of DevOps including Docker, CI/CD pipelines, and the microservices architectures used in hybrid cloud deployments. This background provides preparation for deep dives into real-time operational data gathering and the automated tools available for anomaly detection, reporting, and even self-correction.

For IT monitoring, multiple types of performance data will be used in automated monitoring including structured metrics such as CPU, memory, and network utilization as well as emerging methods for analyzing unstructured content such as textual information in application logs.

For AI monitoring, extensions of DevIOps methods for AI will introduce ModelOps and its application to multiple types of quality metrics for automated model monitoring.  Model accuracy, precision, recall, and bias will be evaluated for initial deployment and tracked for post-deployment drift over time leading to predicted violation of quality standards.

Hands-on lab work will include tools for configuring and executing pipelines for continuous integration and delivery, deploying applications as microservices with embedded monitoring instrumentation, dashboards for collecting performance data, and multiple methods for real-time tracking of such data with automated anomaly detection and repair.

The course consists of weekly hands-on assignments as well as a final project to integrate the project methods covered in the class.

Learning objectives

In this course, we will:

  • Discuss the new role of Site Reliability Engineer (SRE) and the motivation for its introduction.
  • Have students understand and use CI/CD tools to configure and run pipelines for a sample project.
  • Deploy applications into a microservices orchestration platform based on the de-facto standard Kubernetes runtime.
  • Understand the methods available for collecting real-time performance data for services and models.
  • Acquire techniques for identifying deviations in performance of both IT and AI/ML-based components.

Outcomes

By the end of this course, you will be better able to:

  • Understand the need for automated methods in maintaining a high level of service availability
  • Configure and deploy CI/CD pipelines 
  • Deploy and manage containerized components and models in a microservices runtime
  • Instrument components and models for real-time data collection for analysis and visualization
  • Apply AI methods for IT component anomaly detection
  • Apply AI methods for AI model drift detection

Content details

The class is taught through weekly lectures and assignments according to this general schedule:

  • Month 1: Background and practice with CI/CD, Docker, and Kubernetes
  • Month 2: Automated methods for IT data collection and anomaly detection
  • Month 3: Extensions of DevOps for ModelOps: model metric data collection and automated drift analysis

Grading is based on written assignments, a final portfolio of work, participation, and attendance

Prerequisites

Strong background in Python programming and exposure to DevOps and Cloud platforms such as Docker and microservices.

Faculty

Charles Wiecha