WHEN TWO HEALTH CHECKS SEE DIFFERENT GPU STATES: AN INVESTIGATION, FIX, AND REFLECTION ON AN ARCHITECTURE PROBLEM
This article records a GPU health-check inconsistency in which the Passive Health Check and dcgm-exporter saw different GPU states on the same node. The root cause was that the two containers each ran their own embedded nv-hostengine, and the fix was to deploy a node-level Host Engine with internalTrafficPolicy: Local so every component shared the same state view and control entry point.
UNDERSTANDING DCGM: THE GPU MANAGEMENT STACK FROM NVML TO HOSTENGINE
This article explains how NVML, NVIDIA-SMI, DCGM, HostEngine, DCGMI, and DCGM Exporter fit together, and why device-level APIs are no longer enough once GPU management moves from a single machine to a cluster-scale observability and operations model.
WHY NCCL TESTS MATTER
Many engineers treat NCCL Tests as a simple GPU communication benchmark, but its more useful role is as a diagnostic tool for GPU cluster communication paths. By isolating communication from model compute, it helps expose NCCL timeouts, RDMA misconfiguration, node-to-node connectivity issues, bandwidth degradation, and network instability much more directly.
UNDERSTANDING THE CORE MECHANISMS OF PROMETHEUS IN ONE ARTICLE
I recently joined a NeoCloud company as an Observability Engineer and had the opportunity to participate in building a monitoring system from scratch. This led me to systematically learn about Prometheus. This article is aimed at those who have a foundational understanding of K8s but are encountering Prometheus for the first time. It explains core concepts such as the Pull model, time series storage, and the Exporter mechanism, helping you establish a complete cognitive framework before you start using it formally.
2025 RECAP
In the blink of an eye, 2025 has passed. It’s a wrap! 🎉
A PRACTICAL GUIDE TO YOUR JOB SEARCH
In Finland, 90% of recruitment happens quietly. Based on my two successful overseas internship experiences, this article reveals how to bypass traditional overseas investment and find opportunities through internal referrals, school resources and technical networking.
DEMYSTIFYING MACHINE LEARNING: CLASSIFICATION FRAMEWORKS, AMORTIZATION THINKING, AND BUSINESS ORIENTATION
I recently finished a machine learning course (cs-c3240) and finally have some time to think and reflect on it. This was my first formal exposure to machine learning, and as a newbie, I gained a lot from the course. I am very grateful for the efforts put in by the professor and teaching assistants, and I would also like to thank my classmates. Due to our peer review system, their projects and feedback on our assignments gave me a deeper understanding of machine learning.
UNDERSTANDING THE LINK LAYER IN SIMPLE TERMS: GRASPING THE NETWORK'S FIRST STEP THROUGH "MEDIA ACCESS"
Recently, while reorganizing my knowledge of computer networks, I was flipping through my notes and suddenly realized that the link layer, often considered ’too low-level,’ actually contains many fundamental issues. Particularly when reviewing MAC protocols, I was reminded of the confusion I had when learning about CSMA/CD in the past. This also led me to ponder why we still study these ’technologies supposedly made obsolete by switches’ today.
TRANSPARENCY IN DISTRIBUTED SYSTEMS
The transparency of distributed systems conceals underlying complexity, enhancing user experience but also posing challenges for management and troubleshooting. Consistency algorithms, reasonable architecture, and observability are needed to optimize them.
A COMPREHENSIVE GUIDE FOR AWS ELASTIC LOAD BALANCER
AWS Elastic Load Balancer offers ALB, NLB, and GWLB for web traffic, high-throughput connections, and security inspection. It enhances system availability, scalability, and security, making it essential for traffic management, performance optimization, and multi-AZ deployments. Choosing the right load balancer ensures a stable, efficient, and resilient cloud architecture.