cpu 사용량 확인하기

cpu 사용량 확인하기

2025. 1. 7. 23:07ㆍ카테고리 없음

1. Pytorch Profiler 사용하기

import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
from torch.profiler import profile, ProfilerActivity

# 모델 정의
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 1)

    def forward(self, x):
        return self.fc(x)

# DDP 초기화
dist.init_process_group("nccl")
torch.cuda.set_device(0)  # GPU 0 사용
device = torch.device("cuda:0")
model = SimpleModel().to(device)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[0])

optimizer = optim.SGD(model.parameters(), lr=0.01)

# 데이터
inputs = torch.randn(32, 10).to(device)

# PyTorch Profiler 사용
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
             on_trace_ready=torch.profiler.tensorboard_trace_handler('./log')) as prof:
    outputs = model(inputs)
    loss = torch.nn.functional.mse_loss(outputs, torch.randn(32, 1).to(device))
    loss.backward()
    optimizer.step()

print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

2. psutil로 cpu 추적

import psutil
import os

def monitor_cpu():
    pid = os.getpid()  # 현재 프로세스 ID
    process = psutil.Process(pid)
    print(f"CPU 사용률: {process.cpu_percent(interval=1.0)}%")
    print(f"메모리 사용량: {process.memory_info().rss / 1024 ** 2:.2f} MB")

# 학습 루프 중 호출
for epoch in range(10):
    monitor_cpu()
    outputs = model(inputs)
    loss = torch.nn.functional.mse_loss(outputs, torch.randn(32, 1).to(device))
    loss.backward()
    optimizer.step()

3. DDP 통신 병목 검토

PyTorch Profiler는 DDP 통신 시간(예: all_reduce)을 분석할 수 있음. ProfilerActivity.CUDA와 ddp 이벤트를 활성화 함

import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
from torch.profiler import profile, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
             on_trace_ready=torch.profiler.tensorboard_trace_handler('./log'),
             record_shapes=True,
             with_stack=True) as prof:
    for epoch in range(10):
        outputs = model(inputs)
        loss = torch.nn.functional.mse_loss(outputs, torch.randn(32, 1).to(device))
        loss.backward()
        optimizer.step()

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

NCCL timeout 에러는 Distributed Data Parallel (DDP) 환경에서 GPU 간 통신 또는 데이터 준비 단계에서 병목이 발생할 때 흔히 발생합니다. 특히, 한 GPU에서 사용하는 CPU 리소스를 다른 GPU가 경쟁적으로 사용하면 병목이 발생하고, 이는 NCCL timeout 문제로 이어질 수 있습니다.

이를 해결하기 위해 GPU 간 CPU 리소스 사용을 분리하는 방법을 아래와 같이 정리했습니다.

1. 각 GPU에 별도의 CPU 프로세스를 할당

(1) DDP에서 `set_affinity` 사용

각 GPU 프로세스에서 특정 CPU 코어만 사용하도록 고정(CPU affinity)하면 충돌을 방지할 수 있습니다.
Python의 os.sched_setaffinity를 활용하여 프로세스에 CPU 코어를 명시적으로 할당할 수 있습니다.

코드 예제:

import os
import torch
import torch.distributed as dist

def set_affinity(gpu_rank):
    cpu_count = os.cpu_count()
    cpu_per_gpu = cpu_count // torch.cuda.device_count()
    start_cpu = gpu_rank * cpu_per_gpu
    end_cpu = start_cpu + cpu_per_gpu
    os.sched_setaffinity(0, list(range(start_cpu, end_cpu)))  # 현재 프로세스에 할당된 CPU 설정

# DDP 초기화
dist.init_process_group(backend='nccl')
gpu_rank = dist.get_rank()
torch.cuda.set_device(gpu_rank)

# CPU affinity 설정
set_affinity(gpu_rank)

# 모델 학습 코드
model = ...
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[gpu_rank])

(2) `taskset`으로 CPU 코어 제어

실행 시 명령어에 taskset을 사용하여 각 GPU 프로세스가 특정 CPU 코어만 사용하도록 설정할 수 있습니다.

실행 예제:

taskset -c 0-3 python train.py --rank 0 &
taskset -c 4-7 python train.py --rank 1 &
taskset -c 8-11 python train.py --rank 2 &
taskset -c 12-15 python train.py --rank 3 &

위에서 taskset은 GPU 0은 CPU 코어 0-3, GPU 1은 CPU 코어 4-7 등을 사용하도록 제한합니다.

2. 데이터 로더 프로세스 분리

DDP에서 데이터 준비 단계(데이터 로딩, 전처리 등)가 CPU 자원을 많이 소모합니다. 데이터 로더가 특정 CPU 코어를 과도하게 사용할 경우, 다른 GPU와 충돌을 일으킬 수 있습니다.

(1) `num_workers`와 `worker_init_fn` 조정

torch.utils.data.DataLoader에서 num_workers를 GPU별로 조정하고, worker_init_fn으로 CPU 코어를 고정합니다.

코드 예제:

from torch.utils.data import DataLoader

def worker_init_fn(worker_id):
    # 워커 프로세스가 특정 CPU 코어를 사용하도록 고정
    gpu_rank = int(os.environ["LOCAL_RANK"])  # GPU ID 가져오기
    num_cpus = os.cpu_count()
    cpu_per_gpu = num_cpus // torch.cuda.device_count()
    base_cpu = gpu_rank * cpu_per_gpu
    os.sched_setaffinity(0, list(range(base_cpu, base_cpu + cpu_per_gpu)))

dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=4,
    pin_memory=True,
    worker_init_fn=worker_init_fn
)

(2) `prefetch_factor` 설정

DataLoader의 prefetch_factor를 조정하여 CPU 워커 프로세스가 한 번에 준비하는 데이터 수를 제한합니다.
문제 상황:
- 한 GPU 프로세스가 너무 많은 데이터를 미리 로드하여 다른 GPU와 충돌.

해결 방법:

dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=4,
    prefetch_factor=2,  # 기본값 2
    pin_memory=True
)

3. NCCL 통신 시간 제한 조정

NCCL timeout 문제는 GPU 간 통신 지연으로 발생할 수도 있습니다. 아래와 같은 방법으로 이를 완화할 수 있습니다.

(1) NCCL 환경 변수 설정

NCCL 통신 시간 제한 값을 늘려 GPU 간 통신이 더 오래 걸려도 실패하지 않도록 설정합니다.

실행 전에 환경 변수를 설정:

export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0   # 통신에 사용할 네트워크 인터페이스
export NCCL_TIMEOUT=180          # NCCL 타임아웃 (기본값 30초)

(2) NCCL 디버그 정보 확인

NCCL_DEBUG=INFO로 NCCL 통신 상태를 로그로 출력하여 병목 원인을 파악할 수 있습니다.

4. 데이터셋 분할 및 로드 전략

DDP 환경에서 데이터셋 로드 전략을 GPU별로 적절히 설정해야 합니다.

(1) DistributedSampler 사용

각 GPU가 데이터셋의 서로 다른 부분을 처리하도록 torch.utils.data.distributed.DistributedSampler를 사용합니다.

코드 예제:

from torch.utils.data.distributed import DistributedSampler

sampler = DistributedSampler(dataset, num_replicas=torch.cuda.device_count(), rank=gpu_rank)
dataloader = DataLoader(dataset, batch_size=32, sampler=sampler, num_workers=4, pin_memory=True)

(2) 데이터 전송 최적화

pin_memory=True를 사용하여 CPU에서 GPU로 데이터를 전송할 때 더 빠르게 처리합니다.
데이터가 로컬 디스크에서 로드될 경우, 데이터 로컬리티를 확인하여 I/O 병목을 방지.

5. 디버깅과 테스트

문제의 원인을 정확히 파악하려면 작은 데이터셋과 짧은 학습 루프로 디버깅하세요.

PyTorch Profiler를 사용하여 각 GPU 및 CPU의 사용 상태를 분석:

from torch.profiler import profile, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
             on_trace_ready=torch.profiler.tensorboard_trace_handler('./log')) as prof:
    outputs = model(inputs)
    loss.backward()
print(prof.key_averages().table(sort_by="cuda_time_total"))

요약

CPU 리소스 분리:
- os.sched_setaffinity 또는 taskset으로 GPU별로 고정된 CPU 코어를 사용.
- 데이터 로더에서 worker_init_fn으로 CPU 자원을 분리.
NCCL 설정 조정:
- NCCL_TIMEOUT 값 증가.
- NCCL_DEBUG=INFO로 문제의 원인 파악.
데이터 로딩 최적화:
- DistributedSampler로 데이터셋을 GPU별로 나누어 처리.
- num_workers, prefetch_factor 설정 최적화.
디버깅:
- PyTorch Profiler 또는 NCCL 디버그 로그를 활용.

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

jaeha-lee

jaeha-lee

태그

최근글

댓글

공지사항

아카이브

1. Pytorch Profiler 사용하기

2. psutil로 cpu 추적

3. DDP 통신 병목 검토

1. 각 GPU에 별도의 CPU 프로세스를 할당

(1) DDP에서 `set_affinity` 사용

코드 예제:

(2) `taskset`으로 CPU 코어 제어

실행 예제:

2. 데이터 로더 프로세스 분리

(1) `num_workers`와 `worker_init_fn` 조정

코드 예제:

(2) `prefetch_factor` 설정

3. NCCL 통신 시간 제한 조정

(1) NCCL 환경 변수 설정

(2) NCCL 디버그 정보 확인

4. 데이터셋 분할 및 로드 전략

(1) DistributedSampler 사용

코드 예제:

(2) 데이터 전송 최적화

5. 디버깅과 테스트

요약

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

jaeha-lee

태그

최근글

댓글

공지사항

아카이브

1. Pytorch Profiler 사용하기

2. psutil로 cpu 추적

3. DDP 통신 병목 검토

1. 각 GPU에 별도의 CPU 프로세스를 할당

(1) DDP에서 set_affinity 사용

코드 예제:

(2) taskset으로 CPU 코어 제어

실행 예제:

2. 데이터 로더 프로세스 분리

(1) num_workers와 worker_init_fn 조정

코드 예제:

(2) prefetch_factor 설정

3. NCCL 통신 시간 제한 조정

(1) NCCL 환경 변수 설정

(2) NCCL 디버그 정보 확인

4. 데이터셋 분할 및 로드 전략

(1) DistributedSampler 사용

코드 예제:

(2) 데이터 전송 최적화

5. 디버깅과 테스트

요약

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

(1) DDP에서 `set_affinity` 사용

(2) `taskset`으로 CPU 코어 제어

(1) `num_workers`와 `worker_init_fn` 조정

(2) `prefetch_factor` 설정