On March 18, 2019 at approximately 11:50 PST SurePrep users began to experience significant system slowness combined with 500-level errors being thrown by the SPbinder, causing it to forcefully close.
The root cause of the issues related to the central database CPU usage running consistently at 99% across all cores. The high CPU usage accounted for the overall system slowness, and indirectly caused the 500 errors based on a specific query timeout relating to SPbinder workpaper synchronization.
Subsequent research revealed that a specific query in a database script (stored procedure) was the source of the high CPU usage. That script had been updated overnight, in the very early morning of March 18th. This update was entirely on the server side and was part of a required hotfix.
The script had previously gone through the standard review processes in the Development, Stage UAT and Production environments. That review process, however, focused primarily on the functional correctness of the results.
During the event, customers experienced slowness in multiple areas, from FileRoom usage to opening binders. Many without-leads binders were not usable due to 500-level errors appearing soon after opening, causing the binder to be forcibly closed. In short, the overall ability to use SurePrep systems was degraded.
The SurePrep architecture is designed to be highly scalable at all levels including database systems. As volume increases, additional database systems are deployed to handle the heavy-lifting required for binder-specific processing. At the center of this architecture is the central database system which acts as a “control tower”, providing binder-level directory services, document lookup, user management and overall workflow control.
The root cause of the issue was a newly-updated database script that directly impacted the central database via a high-frequency, non-optimized query that heavily taxed processor resources on the central database system.
Adding to the problem was a document-related binder-specific query that runs every time a without-leads binder is opened. This query was especially sensitive to server load and subsequently was subject to extremely frequent time-outs during the opening of without-leads binders, resulting in 500-level errors and forceful closing of those binders.
Resolution and Prevention
The immediate resolution centered on identifying and resolving the issue impacting the central database. Once the script in question was identified, an examination of the delta between the previous and updated versions informed the change required to mitigate the performance issues. That change was performed, and the database system resource usage- specifically in terms of CPU- returned to normal levels.
An additional resolution involved identification of the script that was triggering the 500-level errors. As mentioned above, this procedure was especially sensitive to high-CPU conditions. The cause of this sensitivity was identified and resolved.
A longer-term resolution relates to process. Two areas of focus have been identified for improvement:
1. The approval of the both scope of and necessity for a hotfix, which currently goes through multiple levels of review, will be expanded to include sign-off at the Executive Management level.
2. The technical review process for hotfixes will add one more layer of review, as well as a higher degree of documented focus not just on functional results but also performance related elements.
The SurePrep Support Team and the SurePrep IT Team became aware of the issue at roughly the same time, just before noon on Monday, March 18th. Those Teams immediately informed the CITO who then immediately mobilized a team of technical resources in a “war room”:
The group set up a “war room” in the Irvine, CA office and worked the problem through resolution.
Times below are PST, and approximate:
11:50 – Support and IT Team become aware of the issue
11:55 – Senior Technical resources get involved
12:10 – Senior Technical resources gather in “War room” and begin to work the issue; parallel investigations into recent changes, infrastructure, and 500-level error begin
13:00 – Infrastructure-related causes are ruled out
14:00 – Methodical analysis of recent changes reveals newly-introduced inefficiencies
14:30 – Cause of 500-level error identified
15:00 – Fix proposed for primary issue; validation begins
16:00 – Fix implemented; system operation returns to normal
23:00 – Update implemented to prevent future 500-level errors during periods of high CPU utilization