Skip to content

5 IMPORTANT FACTORS FOR SIZING YOUR ORACLE DATA SCIENCE ENVIRONMENT

The Most Important Aspects of a Data Science Career - DataScienceCentral.com

Most data science experiments begin with a few data files on a laptop and then use algorithms and models to anticipate a result. Although the method requires little initial computing, data scientists rapidly learn that their workloads demand more computational power than a local CPU or GPU can give. As a result, they’re compelled to create a rapidly scalable computing solution. The Oracle Cloud Infrastructure (OCI) Data Science platform fits into this category, addressing the need for changeable size and performance.

When a data scientist considers employing cloud resources, the following questions can be asked:

  • Is cloud computing capable of scaling up to meet the workload?
  • Can I achieve performance comparable to or better than my laptop or on-premises machines?
  • Is this a cost-effective move?

Data scientists may utilize the major criteria that drive responses to these questions as a guideline to prepare and get the most out of the OCI platform by reading this article.

Workload

The first stage in the process is to understand your workload and its peculiarities. A data science program’s workload is described as the quantity of computing work performed on a certain computational resource, such as a laptop, on-premises server, or a cluster of workstations. It covers memory, I/O, and CPU or GPU-intensive tasks. The workload is calculated by multiplying the number of CPU or GPU cores by the number of computation hours.

As a result, you must be aware of the following critical elements that influence workloads:

  • Your longest-running job has a maximum compute hour limit.
  • Your longest-running task at 80–90% CPU or GPU usage.
  • Average CPU or GPU utilization: What proportion of the CPU or GPU is used on average?
  • Average workload hours per day or month: The percentage of time your machine is active rather than idle.

The first two criteria identify the type of your most demanding task and provide an estimate of the number of computing cores and storage space needed to meet your performance goals. The remaining two elements are concerned with resource idle time and the cost implications for your performance target.

If the maximum computing hours exceed your performance requirements, you’ll need extra compute cores, and running workloads in OCI for higher performance may be beneficial. If your typical monthly workload hours are low and your machines are idle the majority of the time, OCI Data Science’s on-demand computing infrastructure can help you save money on idling charges.

Installation, configuration, data intake, and preparation may all be factored into your workload. These manual chores are required in all data science initiatives. These activities usually don’t require much computing power and may be completed with a small amount of CPU time. You may, for example, install Anaconda and configure it on a low-end Compute shape and then transfer it to a bigger core CPU or GPU as you run your algorithms and iterations.

Installing software many times, managing configurations, and dealing with version and library incompatibilities all take time away from productive work. Using a pre-built environment that functions consistently in a team context reduces this time to a fraction of what it would be otherwise.

CPU and GPU

The performance of your Data Science application is influenced by the CPU or GPU. To estimate the appropriate computing needs, it is necessary to grasp a few important factors.
Cores, threads, and clock speed are the most important characteristics of a CPU. Individual processors are stacked within the CPU chip as cores. Threads define how many tasks a CPU core can perform at the same time, often known as simultaneous multithreading or hyperthreading. The number of threads is usually double the number of cores.

A processor’s clock speed in GHz indicates how fast it can execute per second. When two processors have the same number of cores, the one with the higher clock speed performs better. Because the former has a greater clock speed, a 10-core Intel i9-10900 is similar to a 16-core AMD Ryzen 9 5950X.

A stream processor, CUDA or tensor cores, GPU VRAM memory, and memory bandwidth are all features of a GPU. AMD’s stream processors and NVIDIA’s CUDA cores refer to a single, exact calculation, whereas tensor cores compute a whole matrix operation per GPU clock cycle and are 5–8 times quicker. How fast the GPU can access and use the vRAM is determined by the GPU vRAM, built-in vRAM in a GPU card, and memory bandwidth in MHz. Executing the nvidia-smi tool on an NVIDIA GPU provides real-time GPU and vRAM usage while your task is running.

As a result, you may explore and decide the best CPU or GPU processor shape for your task. For example, MLPerf is an open-source measuring tool that compares CPU and GPU performance for both machine learning training and inference workloads and tools. Benchmarks for deep learning on the CPU and NVIDIA GPU can also give useful information.

Matching the setup with OCI Compute shapes leads to predicted performance on an OCI Data Science shape once you’ve determined the best processor for your job. However, because workload characteristics vary greatly, evaluating your personal workloads is the most reliable approach to determine what works best for you.

Environment

Your Data Science environment also has a big influence on Compute shape size and achieving the desired price-performance ratio. A high-capacity CPU or GPU isn’t necessary for installing or establishing an environment, or when performing data analysis or preparation tasks. Similarly, if you’re just getting started with algorithm testing or model training, a few CPU or GPU cores may be plenty.

When you increase your training burden, you’ll still need more cores and memory. The Compute form is influenced by how a trained model infers production data. Predictions on big batches of data and high-frequency concurrent queries, for example, necessitate CPU and GPU scaling in terms of cores and nodes. Latency is a critical aspect of inferencing, and increasing the number of cores or nodes in a computing structure is essential to strike the right balance.

Data science workloads are frequently run not just on various Compute shapes, but also in various Conda settings. Only GPU-based environments can employ PyTorch, TensorFlow, or CUDA, however ordinary pandas, sklearn, or Oracle Accelerated Data Science environments may do data exploration and preparation.

As a result, dividing the surroundings according to the type of work done might be beneficial. Working with distinct Conda contexts for a job is part of this approach. You need the ability to move between environments fast and grow to compute cores without compromising your installed environment base. To take advantage of price-performance and migration expectations, you’ll need to strike the correct mix between employing an on-premises environment and an on-demand OCI environment.

Existing segregated data science, for example, may be OK at first, but as a data science team grows, a consistent environment management technique and collaborative code-sharing are essential. Consider the implications and critical elements, such as performance, growth, cooperation, and cost, when assessing the total infrastructure.

Data residency and volume

In order to make data appropriate for data science processing, about 70% of each data science effort is spent on data input, cleaning, and preparation. When huge batches or streams of data need to be handled, this workload might grow exponentially, necessitating the use of a bigger, auto-scalable compute resource. When the essential data is spread across on-premises and several clouds, this procedure becomes much more difficult. Consolidation, deduplication, and consistency of data become significant challenges.

In such cases, copying the essential data to a common platform, such as a data lake, which you can then process with data transformation tools located closer to the data, is often the simplest and most cost-effective solution. This architecture necessitates the adoption of an auto-scalable storage solution, and on-demand solutions such as OCI Lakehouse can help.

Cleaning and preparation methods differ between structured and unstructured information, as well as between smaller batches and streaming datasets. For example, SQL-based cleaning operations inside an OCI autonomous database or Apache Spark-based in-memory batch processes may successfully clean huge relational structured data. Because a bigger data collection necessitates greater storage, access, and data processing time, lakehouse sizing becomes important.

Moving data science models closer to data, on the other hand, is a critical component that allows a data scientist to focus on the most important activities, such as model development, selection, and predicted explainability, rather than transferring enormous amounts of data. When data is spread, data scientists may run workloads in silos, however, when data is consolidated, they may run workloads together on a single platform like OCI. The principles are the same in both methods, but the latter is easier to administrate and apply. You can effectively scale computing and storage for data science outcomes in these circumstances.

Architecture and Integration

The architecture of your data science tools and components relates to how they integrate and interact with one another. Evaluate various tools, their production usage, and their interaction points to determine how they can deploy or grow with a data science implementation if your data science infrastructure is coupled to on-premises solutions or a cloud vendor. The smaller the criticality and reliance on integration points, the easier it will be to migrate, scale, or burst into the cloud.

If your environment is imaged and infrastructure is scripted, as in Terraform, deploying on OCI and taking advantage of the performance, flexibility, and cost savings it provides can be more beneficial. In another case, moving big, rare training workloads to the cloud while maintaining production inference workloads on-premises can improve overall performance and save money. This hybrid cross-cloud architecture with right-sized Compute shapes may have a significant influence on overall company performance and profitability if properly automated and reported.

Conclusion

You may examine your existing or anticipated Data Science environment by evaluating all of these aspects and deciding if transferring your data science workloads entirely or partially to OCI makes sense for your workload. When you have a concept, utilize the Oracle Cloud Estimator to see how your analysis compares to a small, medium, or big implementation. After that, choose reference architectures and Data Science.

You may also tailor and right-size your Data Science environment to meet your specific requirements. Oracle Cloud Infrastructure Data Science is dedicated to assisting you in making the best decision possible.

 


Here at CourseMonster, we know how hard it may be to find the right time and funds for training. We provide effective training programs that enable you to select the training option that best meets the demands of your company.

For more information, please get in touch with one of our course advisers today or contact us at training@coursemonster.com