Thoughts on Automated Operations Systems: Evolution and Architecture

2015-08-18 · Ryan · Post Comment

Evolution of Automated Operations Systems

The construction of an automated operations system is a continuous evolutionary process, with its architecture and functionality iterating alongside technological advancements and business needs. The following outlines its development history.

Generation Zero: Information Display System

The initial system was a simple, small-scale application consisting of a PHP frontend and Python-based Agents. Its core function was to receive custom information from various hosts, categorize it by host IP, and display it. The displayed content included hardware information, OS details, and system status. While basic, this version laid a crucial foundation for subsequent system development.

First Generation: Centralized System Based on SSH2 and Socket

This generation was the initial version developed collaboratively in Java. The core design was: a Socket server received encrypted XML data delivered by Agents, storing it in a dedicated temporary table in an Oracle database. Scheduled batch jobs then updated this temporary data into corresponding business tables, completing data integration and persistence.

Second Generation: Distributed Architecture with Buffering and Message Queues

To address performance bottlenecks and high concurrency pressure, the second generation introduced major innovations:

Buffering Layer: Introduced Redis, leveraging its NoSQL capabilities to buffer high-concurrency write requests, significantly improving system throughput.
Communication Framework: Replaced the original NIO Socket server with a ZeroMQ (ZMQ) message bus, effectively solving server high-load and connection management issues, making communication more efficient and decoupled.
Task Scheduling: Integrated the Quartz framework to manage complex scheduled tasks.
Process Management: Introduced the jBPM workflow engine to support automated orchestration of operational processes.

Third Generation (Planned): Resource Information Decoupling with LDAP

The planned next-generation architecture aims to use LDAP (Lightweight Directory Access Protocol) to store and manage basic host information (e.g., IP, groups, attributes), decoupling host information queries from the core business database. This improves query flexibility and efficiency and facilitates integration with other systems (e.g., authentication systems).

Functional Module Division of an Automated Operations System

A complete automated operations platform typically includes the following major functional modules:

1. Common Base Platform

Permission System: Based on a User(Group)-Role-Permission model, enabling fine-grained control over system modules (menus), specific functions, and access URLs.
System Data Dictionary: Centralized management of basic data like business codes and status codes.
Notification Service: Integrates internal messaging, email sending, and real-time alerts.
File Service: Provides unified interfaces for attachment upload, download, and management.
Scheduled Task Component: Supports distributed, highly available task scheduling.
Workflow Component: Supports visual process design and execution.
System Logging: Comprehensive audit logs for user logins and key operations.
Third-party Integration (Optional): e.g., Single Sign-On (CAS), financial software interfaces, external alerting platforms, big data analysis systems.

2. Resource Repository Management

Infrastructure: Manages physical location information for data centers and racks.
Network Resources: Manages IP address pools, defines network alert thresholds, monitors switch traffic.
Asset Information: Manages real server hardware asset information.
Logical Hosts: Handles host registration, heartbeat detection, group management, and displays system status in charts.
Data Pipeline:
- Maps information delivered by Agents to corresponding database records.
- Maintains a dictionary of data parsing rules.
- Caches the latest delivery information in Redis for fast queries.
- Records historical changes to resources.
Related Scheduled Tasks: e.g., periodic synchronization, data cleanup.

3. Remote Control and Configuration Management

Remote Operations: Secure remote command-line or desktop control based on SSH (Linux/Unix) or RDP (Windows) protocols.
Task Management: Supports publishing remote scheduled tasks, tracking execution status, and receiving results.
Configuration Management:
- Defines standardized templates for service configurations.
- Manages software repositories or plugin libraries, interfacing with distributors.
- Maintains lists of various business nodes and their associated configuration files.
- Implements automatic configuration file generation and batch distribution.

4. Information Collection and Analysis

This module is responsible for continuously collecting performance metrics, logs, and events from hosts and network devices, performing centralized storage, processing, and analysis. Its output supports real-time monitoring, capacity planning, troubleshooting, and performance optimization, serving as vital data for operational decision-making.

This article references an external technical blog. Original link: http://my.oschina.net/yygh/blog/119408