Recently I stumbled across an interesting article talking about Microsoft, Opensource and ITIL where ossim was being mentioned. (the article can also be found googling for “ossim itil microsoft” in case the link breaks).
I’ve never been very keen about learning ITIL either (although I’ve heard about it everywhere during the last year) but this really caught my attention. In that paper ossim gets referenced only on the “security management” section, but I think that’s mainly caused by ossim being hard to install, setup and understand when that article was written, so I thought I give it another try from my point of view, taking the included tools into account for the different ITIL sections.
So, the goal of this article would be to extend and improve that other article, giving a thought about how I’d approach all those ITIL recommendations from an OSSIM point of view.
- Service Support
- Service Delivery
Note: The definitions after each topic have been quoted from the MS article since they’re small and concise.
The following diagram illustrates a sample support request handled according according to ITIL (thanks Gabi althought there are some typos ;-)):
(Image removed, broken link, I’m very sorry. DK.)
Solving incidents and restoring services quickly.
The incident manager
http://www.ossim.net/dokuwiki/doku.php?id=user_manual:incidents [no longer available] is the obvious choice for this activity from an OSSIM point of view, with a couple of details and exceptions mentioned below.
Five points are mentioned as important for incident management:
- Detect and Record an incident
http://www.ossim.net/dokuwiki/doku.php?id=user_manual:incidents#incidents_incidents[no longer available] (the main incident manager)
http://www.ossim.net/dokuwiki/doku.php?id=user_manual:incidents#incidents_types[no longer available] (Incident types)
- Initial Support (This requires more manual intervention, although automated urls with information could be sent to the users involved in the incident)
- Investigate and Resolve
http://www.ossim.net/dokuwiki/doku.php?id=user_manual:events[no longer available] (With forensics, realtime viewers, vulnerability databases and everything logged on a central location there are plenty of tools for doing this)
- Track, Monitor and Communicate
http://www.ossim.net/dokuwiki/doku.php?id=user_manual:dashboard#dashboard_executive_panel[no longer available] (Specific metrics / dashboards could be designed for this
Solving root cause problems to prevent future incidents.
- Problem and Error Control (Well, this is what a SIEM is used for 90% of the time, isn’t it ? Linking to the root description for overview.)
- Proactive Management
http://www.ossim.net/dokuwiki/doku.php?id=user_manual:events#events_anomalies[no longer available] (Identifying problems and errors before they occur. Anomalies can be a very valuable tool for this)
http://www.ossim.net/dokuwiki/doku.php?id=user_manual:reports[no longer available]
Maintaining all necessary information about services, service components, and relationships.
At first I was confused and didn’t see how ossim could fit into this. The following tasks are mentioned as being important for this part. As I don’t see how this fits I don’t link them to any specific sites.
- Configuration Control
- Status Accounting
- Verification and Audit
But after this they point out a series of important things that a software would have to accomplish in order to help out on this, namely:
- Discover devices on a network
- Determine the host OS
- Determine OS version and patch levels
- Determine which applications are installed
- Detect any changes to the configuration
Well, this started to sound familiar. OCS
http://www.ossim.net/dokuwiki/doku.php?id=user_manual:reports#reports_ocs_inventory [no longer available], Nmap http://www.ossim.net/dokuwiki/doku.php?id=user_manual:tools#tools_net_scan [no longer available] network inventory, the whole policy http://www.ossim.net/dokuwiki/doku.php?id=user_manual:policy [no longer available] area, p0f, pads, arpwatch all fit perfectly in this section.
Controlling the implementation of changes in the infrastructure.
This is an area where ossim can be greatly improved. OCS already includes some “Install software updates and configuration changes” functionality but it’s not fully integrated.
The rest of it is covered by the Incident manager
http://www.ossim.net/dokuwiki/doku.php?id=user_manual:incidents [no longer available], reports, executive panels and OCS.
- Filtering Changes
- Implementing Changes
- Review and Close
Controlling the rollout of new releases in the infrastructure.
Again there are things missing for this to be fully covered by ossim. Integrating Zenoss would be an option although with tight ocs integration and some additional development this should be easily accomplished without additional dependencies. Maybe webmin too.
- Build and Configure
- Test and accept
- Schedule and Plan (The incident manager would be suitable although not perfect for this)
- Communicate and Prepare (Again we’d use the subscription feature for this)
- Distribute and Install (As mentioned earlier ocs can be used for remote software installation)
Service Level Management
Defininig and implementing clear agreements for service delivery between an IT organization and its customers.
This on the other hand is something which is fully covered by ossim. Having implemented lots of metrics and measurements it is very easy to:
- Define SLAs, OLAs and UCs (Also executive panel metrics, service level, vulnmeter can be used for this)
- Define a service catalog (Typifying and tagging incidents)
- Status accounting
Financial Management for IT Services
Ensuring the proper management, maintenance, and financial operation of IT.
This is more of a human task than an ossim one. I guess metrics could be enforced if the data related to all of this is stored somewhere but I’d need to investigate it some more.
Optimizing capacity to meet service requirements at an acceptable cost.
This is a very interesting expertise area, where some parts are covered by ossim and others not so much. The article mentions Zenoss and Hyperic HQ as tools that meet the needs for this, and I guess ossim as is also meets many of the needs:
- Monitoring (Including trends and forecasting due to heavy RRD usage: Nagios, Ntop, ocs, etc…)
- Analysis (Most of ossim can be used for this)
- Demand Management (Most of ossim can be used for this)
- Modeling (Policy establishes a baseline and anomalies provide information on how this has changed over time)
- Planning (Most of ossim can be used for this)
Ensuring the availability of IT resources to meet agreed upon service levels.
This obviously is fully covered by monitors and executive panels with metrics:
- Define requirements (This involves policy, executive panel metrics, business processes, check the main descriptions)
- Availability Planning (Nagios, Ntop, OCS, etc…)
- Monitor Availability (Nagios, Ntop, OCS, etc…)
- Monitor obligations (This involves policy, executive panel metrics, business processes, check the main descriptions)
IT Service Continuity Management
Defining and maintaining appropriate Disaster Recovery plans for IT.
This is also a very manual and off-ossim task, some parts could help for this (monitoring, sla’s, etc…)
Ensuring the proper access to services as defined by agreements and industry best practices.
This is where the article mentions ossim, although not fully extending on what can be covered using ossim. Four main tasks are required for this:
- Coordinate Security Management (Using the incident manager)
- Implement Controls (This is covered by most of ossim)
- Evaluate and Audit Controls (Also this)
- Maintain and Monitor (Using the incident manager, executive panels, reports)
This article is more of an exercise of what could be done rather than a step by step guide on how to implement it. Obviously that step-by-step guide is now on my todo list but that requires much more than the couple of hours I’ve spent writing this up.
Anyway, I hope this article gave a quick overview of how ossim can be applied to ITIL.
Remember the graph at the top and the numbers ? Let’s resume where each task would fit in ossim.
- The new incident creates a log read by the agent, a response is issued by policy and a new incident inserted.
- Using the incident manager we tag and typify the incident.
- Next we analyze events, alarms, monitors and reports checking what could be wrong. We notice the service level and metrics at executive panel decreasing.
- Through the incident manager we subscribe the people who can fix the issue to the incident, and describe needed actions in order to fix them.
- Another executive panel reflects to the customer what the current issue means to their business in terms of money.
- Again we track everything using the incident manager and issue a patch rollout using OCS’s automatic soft installation feature.
- We change our inventory information based on the new automatic changes and keep track of them using the incident manager. The anomalies panel gets automatically updated once the new versions get detected.
- Nagios will reflect the downtime and affect the overall service level due to this issue.
- Policy, executive panels and directives get updated if needed after this new situation.
- The closed incident gets reported to the customer and will show up on incident reports.
And in order to finish I’d like to quote a friends comment about ITIL: ITIL is a series of things everybody with a little bit of common sense would do in an enterprise if he had the time, the people or the money to do it. So if it’s not being done I yet it will take more than a series of best practices to change that ;-)