![]() ![]() ![]() The second case study explores maximizing throughput for these same models using the least hardware necessary to do so, such as a large emergency room running many models on a consistent basis.īoth case studies walk through these steps manually, so we end with a discussion on the next steps for incorporating model metadata into automated scheduling.The first case study explores minimizing hardware for a system that runs intermittently, such as a low-budget medical provider who needs to run many models on minimal hardware.We discuss two case studies using medical inference models: Here’s how you can use these metrics to optimize your system performance. Using the compute requirements for optimization The information is gathered by collecting metrics on your system as it runs, so it is ideal to run it on an isolated GPU or system only running Model Analyzer. When the container completes, the metrics are exported to your chosen directory for every model, batch size, and concurrency value. v /home/user/results:/results -net=host \ v /home/user/models: /home/user/models \ Here’s an example command for reference: docker run -v /var/run/docker.sock:/var/run/docker.sock \ v /ABSOLUTE/PATH/TO/EXPORT/DIRECTORY:/results -net=host \ v /ABSOLUTE/PATH/TO/MODELS:ABSOLUTE/PATH/TO/MODELS \ Then, run the following command, replacing the capitalized arguments: docker run -v /var/run/docker.sock:/var/run/docker.sock \ To run the container for your models, make sure that ports 8000, 8001, and 8002 are available. For this tutorial, you build the Docker container from the source, the triton-inference-server/model_analyzer GitHub repo. Model Analyzer runs as a Helm chart, Docker container, or standalone command-line interface. For more information, see the Installing Docker and NVIDIA Docker section in NVIDIA Docker: GPU Server Application Deployment Made Easy. You must install some software such as Docker before using the inference server container. Getting the Model Analyzer Docker container ![]() Here’s a look at Model Analyzer to see how it can contribute to a maximum-performance inference solution. In short, understanding the compute requirements of inference models provides a host of benefits from model creation and hardware sizing to reliable, efficient running of models. Better hardware sizing-Determine the exact amount of hardware needed to run your models, using the memory requirements.This can help produce more lightweight models and reduce the amount of memory required for your inference needs. Efficient models-Compare and contrast different models, using compute requirements as an additional datapoint into how well a model performs.In addition, there are two critical non-scheduling benefits: Increased reliability-Eliminate out-of-memory errors by knowing that the models you load on a GPU will not exceed its capabilities.Rather than optimizing for throughput, you can use this data to determine the maximum number of models that can be loaded per GPU, reducing the hardware needed, or weigh the trade-off with throughput. Optimized hardware usage-Examine GPU memory requirements to run more models on less hardware.This maximizes throughput for your hardware. Maximized model throughput-Ensure that the models placed on each GPU do not sum to above a certain threshold of available memory and GPU utilization, such as 100%.By gathering the hot and cold storage requirements, you can use them to inform the scheduling of models to gain several benefits: Without this information, there is a knowledge gap in understanding how many models to run on a GPU. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |