Skip to content

Feature Request: Use Aperf as a monitoring tool for EC2 hosts #389

@molenin-moodys

Description

@molenin-moodys

TL;DR

I want to replace some metrics currently produced by the CloudWatch Agent with aperf records stored in S3. I’m not a performance-optimization expert, so I may be missing some common best practices.

Aperf seems to be designed mainly for manual performance analysis, so it lacks some features I expected. I also couldn’t find certain data that’s important for my intended use cases.

Proposed usage

Today, we rely on the CloudWatch Agent for host-level performance metrics. I’m exploring whether aperf could be used instead to collect more detailed, low-level performance data and store it in S3 for later analysis. This would allow us to have fine-grained visibility into host behavior, without being limited to the 1-minute or 1-hour granularity of CloudWatch metrics.

My goal is to collect performance records for the entire lifetime of each EC2 host running in AWS ECS.

Because I don’t control when a host is terminated, my current idea is:

  • Run a cron job every minute on each host.
  • Each run:
    • Records performance data for 60 seconds using aperf.
    • Uploads the resulting record to S3.
  • This should minimize gaps in host-level performance history and make sure we don’t lose data when instances are terminated unexpectedly.

Questions / concerns

  1. Is this usage pattern aligned with how aperf is intended to be used?
    If not, could you please share any concerns you may have?
  2. Are there any known limitations when using aperf in this way (for example, overhead, storage, or data quality)?
    Based on my estimates, the record size should remain reasonable.

Features I would like to have as part of aperf

  1. Currently, records do not include absolute timestamps. I would like to have the ability to merge multiple records into a single larger record, even if they overlap.
  2. I would appreciate native support for using S3 as the output location for records. This would remove the need to upload them manually using awscli.
  3. If aperf record could run in the background and periodically upload partial records to S3, I would not need to rely on cron jobs.

Data that I cannot find in the report and would like to have in the future

  1. Network usage speed - at the moment I can only find packet counters, but I would like to see actual throughput values.
  2. Disk usage speed - we are using EC2 instances with NVMe disks, and disk performance is an important factor for us.
  3. Speed of communication with S3 - we upload and download many gigabytes of data to and from S3, so it would be very helpful to have a separate graph for this traffic.
  4. Free and used disk space for each mount point.

I’d really appreciate your feedback on this request and any guidance on possible implementations. I’m happy to contribute and help implement these features through PRs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions