Some analysis companies believe that capacity management, which is an essential process for any large IT enterprise, is often very complex. Moreover, in today's accelerated business world, this management is often not effectively implemented. Changes in priorities, increasing complexity, and scalable cloud infrastructure have made the traditional capacity management model less effective. Supported by new technologies and driven by innovative IT leaders, a new capacity management model is emerging. This new model regards the use of IT resources as meaningful to the business, uses automation and analysis to manage complexity, and reduces manual operations. In this article, we will discuss how to reduce the complex monitoring, analysis and forecasting involved in capacity management to an indicator of service health (current performance) and an indicator of service risk (future performance), making it easier to manage and more visible for all stakeholders. Strategic Advantages of Capacity Management Capacity management balances cost and risk In a simplified sense, IT capacity management is the basis for balancing the cost and performance of business services, of which the allocation and configuration of infrastructure is the fulcrum. If your company's infrastructure is not properly configured or sufficient to support business needs, long response time issues and outages may occur, costing the business millions. A typical way to avoid this is to overprovision your infrastructure, that is, estimate the capacity you need and double it. It is estimated that up to 50% of cloud infrastructure is unused, and this phenomenon is even more in physical storage. Overprovisioning wastes a lot of hardware, software licensing and management costs. The trick is to right-size your enterprise's infrastructure to meet current needs and know exactly when and where to add additional capacity. To effectively optimize business services, the capacity management process consists of four main steps:
What makes IT so challenging is that given the dynamic development of technology, changing business requirements and demand growth all add complexity, making the IT environment constantly changing. Time has always been the essence of performance issues, but IT staff are scattered across various tasks and projects, reducing the time to ensure service delivery. Finally, capacity management expertise is becoming increasingly scarce. According to a leading industry analyst firm Research In Action, by 2020, lack of skills in capacity and performance management will become a major constraint or risk to growth for 75% of enterprises. Perhaps because of these challenges, many technology leaders consider capacity management to be a competitive advantage, and this will become even more so in the coming years. Research In Action predicts that by 2020, 35% of enterprises will use capacity management tools to gain a competitive advantage (up from 20% today). Competitive advantages brought by effective capacity management:
Manage complexity with automation In recent years, most IT organizations that have successfully deployed capacity management have used analytics and automation. The advantages of this approach are speed and accuracy, even in very complex environments, but it takes considerable time and the right tools and processes to implement effectively. To understand this approach, let’s explore each of the core processes described earlier:
Data collection Performance data must be collected at a level of granularity sufficient to meet the needs of business transactions. For example, real-time transactions and online shopping require more granularity than batch processing. Remember, the collection tools you use must provide detailed, timely data in an automated and highly scalable manner to ensure the success of the project. Data analysis Traditionally, this analysis has been performed by capacity management experts “manually” examining data through simple tools such as spreadsheets; or by building and maintaining custom tools and queries. This type of manual analysis takes a lot of time and expertise, and taps into already thin resources in many organizations. Automation is a big part of the solution, although fewer viable solutions exist in this area. Historically, many of these “automated” solutions still require a lot of time to set up and remain limited in providing useful information. However, technology can now solve analytical problems in a more practical and efficient way. predict To accurately predict performance, we need to realize that computer systems do not behave linearly. If they were linear, predictions would be as simple as linear trends. The reality is that queues occur. Queuing is what happens when a CPU, controller, or other device has more work coming in than it can handle. Services then have to wait in line, just like waiting in line to check out at a store. When there are short or no lines, response time is proportional to the work added. You add some more work, some applications, or some infrastructure, and there is more work to be processed. Queuing starts from there, and suddenly the latency is huge. This is called the dreaded knee in the curve, and response times grow exponentially after that - the wait time is longer than the work time, and response is greatly affected. Too often, IT assumes that latency will always be linear, and they scramble frantically to address that. To avoid the inflection point, many IT organizations follow a strategy of never overloading systems with tasks, which means overprovisioning—safe but wasteful. They pay too much to avoid the inflection point. You must have a clear idea of where the inflection point will occur in order to avoid it without overprovisioning, which requires understanding how IT components interact to perform the work. A variety of techniques are used to predict performance with varying degrees of accuracy, from Excel spreadsheets to linear trends to simulation modeling to analytical modeling. However, until recently, these solutions required a lot of expertise, specialization, and time. Fortunately, forecasts can now be obtained automatically and in a very timely manner. Providing actionable information The result of effective execution of the three areas above should be the generation of actionable information and reports with visualizations. Since IT decisions often have an impact on the entire business, this information must also be presented in a way that makes sense to non-IT stakeholders. For example, based on business metrics such as sales, SLAs, or uptime rather than based on IT metrics such as memory or I/O. It is not uncommon for IT departments to spend hundreds or thousands of hours creating reports for various stakeholders. Whenever possible, reporting tasks should also be automated, allowing IT staff to focus on proactive problem solving and innovation. Case Study: How JN Data Manages Complexity Identifying and understanding interesting content across the enterprise in real time helps Henrik Tonnisen, capacity manager at JN Data, deliver market-leading services, resource efficiency and transparency to key clients, including Denmark’s third-largest bank, Jyske Bank, and Denmark’s largest mortgage company, Nykredit. Tonnisen does this by fusing technical data from tens of thousands of servers into dynamic, self-service reports that meet the needs of every business stakeholder, transforming discussions from complex technical metrics to actionable business information. Tonnisen said his team received rave reviews from stakeholders after announcing the launch of its new self-service reporting dashboard. A new paradigm Automation and analytics have proven to be effective in addressing the challenges posed by modern capacity management. However, until recently, these solutions have required significant time and expertise to implement effectively. A new paradigm is sweeping the industry. This new paradigm uses automated health and risk scoring to identify current and future performance, as well as future timeframe and severity issues. This is a game changer: it saves time, requires less expertise, and makes capacity management simpler and more accessible for all IT10 organizations. In order to facilitate the calculation of simple, easy-to-understand health and risk scores for each service, there are sophisticated algorithms running behind the scenes. Watchlists can be defined to focus attention on the services you use, accountable and easily determine the actions that need to be taken, whether it is to resolve current issues or expand capacity to avoid future problems. No longer do you need to spend countless hours looking through the data. The automated algorithms will do it for you. Why implement health and risk scoring? Health and risk scores address two main functional areas in the capacity management process:
How are health and risk scores calculated? Health Score The health score is calculated by gaining insight into each system that comprises the service. The analytical queuing network model is used to calculate actual CPU and I/O performance and is compared to the theoretical maximum performance of each system. Memory is evaluated based on current utilization and by looking for any deviations from normal activity levels for memory management. Disk space usage is evaluated by examining current free capacity and historical behavior patterns. The analysis results are aggregated and normalized to create an easily interpretable health score ranging from 0 to 100, with 0-44 indicating poor health, 45-54 indicating a warning, and 55-100 indicating good health. Risk Scoring The risk score is determined by running a capacity planning algorithm to predict how the service will perform in the future. The capacity planning algorithm predicts the impact of the service growth rate on the systems that make up the service. Analytical queueing network models are used to calculate future CPU and disk I/O performance and compare to the theoretical maximum performance of the system. These models produce a series of predictions that account for the nonlinear behavior inherent in computing systems that we discussed earlier. It does this by evaluating activity patterns and predicting disk space usage at the end of the forecast period. Based on these calculations, a risk score is generated to represent the severity of the predicted risk. The risk score is normalized to a range of 0 to 100 to represent the amount of risk, with 0-44 representing low risk, 45-54 representing warning, and 55-100 representing high risk. In addition to the risk score, the date on which poor performance or outage conditions will occur is also predicted. Risks are predicted by looking for one-time events and recurring behaviors in the forecast results and the number of days until the risk occurs is calculated. Simplicity is king With all the work happening automatically behind the scenes, capacity management is much simpler and more accessible to all IT organizations. Organizations no longer need to hire a large number of data scientists, staff time is saved, and forecasting no longer requires in-house experts. IT staff and service managers can view a single indicator of health and risk and know where to focus their attention. Accuracy Matters The accuracy of algorithms and calculations is very important. So how accurate are they?
All of these methods adapt to workload, configuration, and other environmental variations. Using these methods with sophisticated algorithms, the end result is the most accurate health and risk calculations in the industry, typically 95% accurate. Evaluate your business' options There are a variety of capacity management solutions on the market today that cater to different enterprise environments and different needs. To effectively evaluate them, it is helpful to compare features and approaches and understand how they will impact your enterprise's capacity management efforts. To determine the health of IT and business services, the following approach is typically performed, with the spiked items representing the approach taken in the new model:
To determine the risk of IT and business services, the following approach is typically performed, with the tagged items representing the approach taken in the new model:
Options such as standard threshold comparison and event detection are easier to set up, but provide much less precision. Allocation comparison and prediction are suitable for virtual environments, but lack the ability to drive resource efficiency because they need to consider what is allocated versus what is used. Queuing theory requires intelligent configuration and fine-grained data, but provides much more accurate results in determining service health and risk. When choosing an enterprise capacity management solution, you should consider the following factors:
These factors will support the potential return on your capability management investment and help determine the type of solution your organization should pursue. |
<<: Best Practices for Data Center Disaster Recovery
>>: Broadband as a Service: The End of DDoS?
Yecao Cloud is a Chinese hosting company founded ...
UK regulator Ofcom has revealed that so far in 20...
In a distributed system, it is very important to ...
5G will revolutionize the Internet of Things due ...
Recently, the French government announced that in...
Recently, Wang Xiaochu, Chairman and Chief Execut...
Hello everyone, I am amazing. Today, I will lead ...
RAKsmart also offers promotions for cloud servers...
[[397426]] Preface This article mainly analyzes t...
For software-defined wide area networks (SD-WAN),...
This article is reprinted from the WeChat public ...
With the implementation of 5G, the direction of m...
When there are a plethora of industry certificati...