I am writing this article with the thread http://forums.oracle.com/forums/thread.jspa?messageID=3205216� in mind. We have a very large workflow application written in Oracle Application Express, supporting some 3000+ users. This application was written with care, for obvious reasons. Well for one, it’s written in Application Express, a rapid development tool which is not designed for an application of this magnitude (80+ hits per second, 15 hours a day)! Every condition, expression, process is held in pre-compiled Oracle packages/procedure/functions in order to minimize latching and thus improve performance.
All is well until the application hits peak periods, which is when our CPU was 100% busy all the time. It remained a mystery for about two years until a few days ago. For two years I’ve searched high and low for an answer, because our application has been optimized from top to bottom. We knew what was causing the problem, it is the dreaded PL/SQL block below:
declare
rc__ number;
simple_list__ owa_util.vc_arr;
complex_list__ owa_util.vc_arr;
begin
owa.init_cgi_env(:n__,:nm__,:v__);
htp.HTBUF_LEN := 255;
null;
null;
simple_list__(1) := 'sys.%';
simple_list__(2) := 'dbms\_%';
simple_list__(3) := 'utl\_%';
simple_list__(4) := 'owa\_%';
simple_list__(5) := 'owa.%';
simple_list__(6) := 'htp.%';
simple_list__(7) := 'htf.%';
simple_list__(8) := 'wpg_docload.%';
if ((owa_match.match_pattern('f', simple_list__, complex_list__, true))) then
rc__ := 2;
else
null;
null;
f(p=>:p);
if (wpg_docload.is_file_download) then
rc__ := 1;
wpg_docload.get_download_file(:doc_info);
null;
null;
null;
commit;
else
rc__ := 0;
null;
null;
null;
commit;
owa.get_page(:data__,:ndata__);
end if;
end if;
:rc__ := rc__;
end;
The dreaded PL/SQL block above was called about 7,000 times in an hour and consumes 58% of the CPU load in our system (see stats below)!
CPU Elapsed CPU per % Total
Time (s) Time (s) Executions Exec (s) DB Time SQL Id
---------- ---------- ------------ ----------- ------- -------------
4,014 19,928 6,686 0.60 58.2 c34r978mgkrmf
For two years I could not find the answer. I’ve seen similar problems experienced by other people but it was dismissed as an expected behaviour by Oracle (see “Performance degradation is the expected behavior” http://forums.oracle.com/forums/thread.jspa?messageID=3205216�). Our DBA hassled me again last week advising that I enhance our application in order to reduce the number of calls to the dreaded SQL. I told him that I did not know when the block of code was called and advised that the only way I could even have a remote chance in finding it, I would need access to HTTP logs as well as database logs. Even Oracle Ace sspadafo said “That looks like the anonymous block built by modplsql to handle the form POST”.
Equipped with renewed vigour to find a solution the problem, our DBA went away and “reversed engineed” the problem so that a solution can be found once and for all! He got back to me a few days later with this email:
Ariel,I did a bit reverse engineer on this internal ApEx API. The core of this API is part of OWA. The related OWA is inside package wpg_docload. The source code of wpg_docload is at $ORACLE_HOME/rdbms/admin/wpgdocs.sql.
Whenever the end-users download a file (images, word docus, etc.), the wpg_docload is called. Obviously we call it too frequently and causes a big performance problem.
For example, between 14:00pm and 15:00pm yesterday, we fetched more than 3000 times for each of the files below:
[oracle@puma-c-app-00 logs]$ grep access.gif junk2 | wc -l
3777
[oracle@puma-c-app-00 logs]$ grep bug.gif junk2 | wc -l
3731
[oracle@puma-c-app-00 logs]$ grep change.gif junk2 | wc -l
3360
[oracle@puma-c-app-00 logs]$ grep dmt1.jpg junk2 | wc -l
3595
[oracle@puma-c-app-00 logs]$ grep dmt2.jpg junk2 | wc -l
4006
[oracle@puma-c-app-00 logs]$ grep dmt3.jpg junk2 | wc -l
3813
[oracle@puma-c-app-00 logs]$ grep dot.gif junk2 | wc -l
3140
[oracle@puma-c-app-00 logs]$ grep htmldb_remix.js junk2 | wc -l
3729
[oracle@puma-c-app-00 logs]$ grep theme.css junk2 | wc -l
3952
[oracle@puma-c-app-00 logs]$ grep whats_new.gif junk2 | wc -l
3538
We have 2 AppServers, so the real fetches are twice. On average, they are accessed 20 times per second!
Because only one CPU can access a particular RAM address at a given time, all other CPUs have to wait. This kind of issues can't be resolved by big/better hardware. The more powerful of the host, the worse it gets. For 4 CPUs, 3 have to wait. For 8 CPUs, 7 have to wait, For 16 CPU, 15 have to wait. While the CPUs are waiting, they are spinning (controlled by a hidden parameter _spin_count in Oracle). Spinning means using 100% CPU doing nothing! This is why your host is always busy.
The solution is to move the above images from database tier to middle tier.
When the image is fetched from database, the PC can't cache it because the PC does not know if the file has changed since last fetch. (If you read the source code I mentioned above, you will see actually Oracle is fully aware of this problem in 1999/2000)
When the image is fetched from Apache, it will be cached in the PC's RAM until the file on Apache server is changed.
After we move the above images into Apache, we will never need to run the culprit API to fetch these files, the workload on database tier will be reduced dramatically. As the images are cached in the PC's RAM, the workload on middle tier will be reduced largely as well.
What do you think?
Cheers,
Han
The revelation hit me like a lightning bolt! I’ve been looking in the wrong place! All this time I’ve been targetting expressions, conditions, computations, processes etc. I never thought to look in Shared Components -> Templates to optimize the application! Who would? Right? Templates control the look and feel of the application (stylesheets, images, themes, etc.), it should contain minimal database calls, right?
Not so with Oracle Application Express. Well, every image you store in Shared Components -> Images are only accessed through a database call specified below:
wwv_flow_file_mgr.get_file?p_security_group_id=923024072426648&p_fname=[FILE_NAME]
You reference these images through the page template through a substitution string eg.:
<script src="#COMPANY_IMAGES#htmldb remix. js" type="text/javascript"></script>
<img src="#COMPANY_IMAGES#sample_picture.jpg" alt="sample picture />
When the image or js is fetched from the database, the client can’t cache it because the client computer does not know if the file has changed since the last fetch. Therefore if you got hundreds of these images on your page, it would mean that you would call the “dreaded code” hundreds of times!
After this information was revealed to me by our brilliant DBA, the fix was simple. Move all images and js source from the database to a web server. Change all references #COMPANY_IMAGES# (sometimes referred to as #WORKSPACE_IMAGES# on some installation) in the page templates.
The performance improvement was almost instant! Within an hour the CPU usage on our system dropped by 40%! Here are the stats:
CPU Usage |
Date and Time (Today) |
Percentage reduction |
683970966 |
09/02/05 00 |
11.32 |
703065671 |
09/02/05 01 |
10.93 |
709172691 |
09/02/05 02 |
11.36 |
711003880 |
09/02/05 03 |
11.37 |
369238 |
09/02/05 04 |
19.83 |
6961506 |
09/02/05 05 |
25.88 |
9324080 |
09/02/05 06 |
19.41 |
36286117 |
09/02/05 07 |
25.12 |
66275947 |
09/02/05 08 |
31.29 |
95548315 |
09/02/05 09 |
41.38 |
|
|
|
Total CPU(Include waits) |
|
|
2030852964 |
09/02/05 00 |
25.43 |
2104019110 |
09/02/05 01 |
24.63 |
2112896205 |
09/02/05 02 |
24.76 |
2118178749 |
09/02/05 03 |
24.72 |
2029337 |
09/02/05 04 |
1.24 |
12410075 |
09/02/05 05 |
31.12 |
19783820 |
09/02/05 06 |
21.76 |
88298905 |
09/02/05 07 |
20.7 |
166585318 |
09/02/05 08 |
29.21 |
238824609 |
09/02/05 09 |
41.35 |
Unbelievable, isn’t it?
Here’s an extract of the email conversation that took place between me and our DBA for your reference:
From: Ariel
Sent: Thursday, 5 February 2009 10:58
To: Han
Subject: RE: Culprit, again
Well done Han, it looks like you nailed it!
It would be interesting to see what Oracle has to say about this issue, considering that some Oracle experts doesn't have a clue when the "dreaded code" is executed, and didn't offer any viable solution to the problem.
http://forums.oracle.com/forums/thread.jspa?messageID=3205216?
eg. Oracle Ace sspadafo said "That looks like the anonymous block built by modplsql to handle the form POST".
Oracle also said "Performance degradation is the expected behavior" and closed the following service request as a non-bug:
https://metalink.oracle.com/CSP/main/article?cmd=show&type=BUG&id=4755226
I think Oracle should reconsider addressing this as a bug considering that it has the ability to run a massive system to the ground.
Hehehe.. You should work for Oracle and make the database better!
They'll probably hire you if you post your analysis below to them.
Regards,
Ariel
_____________________________________________
From: Han
Sent: Thursday, 5 February 2009 10:31
To: Ariel
Subject: RE: Culprit, again
CPU Usage Date and time (Yesterday) CPU Usage Date and Time (Today) Percentage reduction
771304282 09/02/04 00 683970966 09/02/05 00 11.32
789369951 09/02/04 01 703065671 09/02/05 01 10.93
800036699 09/02/04 02 709172691 09/02/05 02 11.36
802199588 09/02/04 03 711003880 09/02/05 03 11.37
460544 09/02/04 04 369238 09/02/05 04 19.83
9392224 09/02/04 05 6961506 09/02/05 05 25.88
11569372 09/02/04 06 9324080 09/02/05 06 19.41
48457545 09/02/04 07 36286117 09/02/05 07 25.12
96454957 09/02/04 08 66275947 09/02/05 08 31.29
162988493 09/02/04 09 95548315 09/02/05 09 41.38
Total CPU(Include waits) Total CPU(Include waits)
2723572445 09/02/04 00 2030852964 09/02/05 00 25.43
2791755271 09/02/04 01 2104019110 09/02/05 01 24.63
2808149247 09/02/04 02 2112896205 09/02/05 02 24.76
2813795083 09/02/04 03 2118178749 09/02/05 03 24.72
2054772 09/02/04 04 2029337 09/02/05 04 1.24
18015757 09/02/04 05 12410075 09/02/05 05 31.12
25285828 09/02/04 06 19783820 09/02/05 06 21.76
111341642 09/02/04 07 88298905 09/02/05 07 20.7
235337742 09/02/04 08 166585318 09/02/05 08 29.21
407219600 09/02/04 09 238824609 09/02/05 09 41.35
_____________________________________________
From: Han
Sent: Wednesday, 4 February 2009 3:15 PM
To: Ariel
Subject: RE: Culprit, again
Ariel,
I did a bit reverse engineer on this internal ApEx API. The core of this API is part of OWA. The related OWA is inside package wpg_docload. The source code of wpg_docload is at $ORACLE_HOME/rdbms/admin/wpgdocs.sql.
Whenever the end-users download a file (images, word docus, etc.), the wpg_docload is called. Obviously we call it too frequently and causes a big performance problem.
For example, between 14:00pm and 15:00pm yesterday, we fetched more than 3000 times for each of the files below:
[oracle@puma-c-app-00 logs]$ grep access.gif junk2 | wc -l
3777
[oracle@puma-c-app-00 logs]$ grep bug.gif junk2 | wc -l
3731
[oracle@puma-c-app-00 logs]$ grep change.gif junk2 | wc -l
3360
[oracle@puma-c-app-00 logs]$ grep dmt1.jpg junk2 | wc -l
3595
[oracle@puma-c-app-00 logs]$ grep dmt2.jpg junk2 | wc -l
4006
[oracle@puma-c-app-00 logs]$ grep dmt3.jpg junk2 | wc -l
3813
[oracle@puma-c-app-00 logs]$ grep dot.gif junk2 | wc -l
3140
[oracle@puma-c-app-00 logs]$ grep htmldb_remix.js junk2 | wc -l
3729
[oracle@puma-c-app-00 logs]$ grep theme.css junk2 | wc -l
3952
[oracle@puma-c-app-00 logs]$ grep whats_new.gif junk2 | wc -l
3538
We have 2 AppServers, so the real fetches are twice. On average, they are accessed 20 times per second!
Because only one CPU can access a particular RAM address at a given time, all other CPUs have to wait. This kind of issues can't be resolved by big/better hardware. The more powerful of the host, the worse it gets. For 4 CPUs, 3 have to wait. For 8 CPUs, 7 have to wait, For 16 CPU, 15 have to wait. While the CPUs are waiting, they are spinning (controlled by a hidden parameter _spin_count in Oracle). Spinning means using 100% CPU doing nothing! This is why your host is always busy.
The solution is to move the above images from database tier to middle tier.
When the image is fetched from database, the PC can't cache it because the PC does not know if the file has changed since last fetch. (If you read the source code I mentioned above, you will see actually Oracle is fully aware of this problem in 1999/2000)
When the image is fetched from Apache, it will be cached in the PC's RAM until the file on Apache server is changed.
After we move the above images into Apache, we will never need to run the culprit API to fetch these files, the workload on database tier will be reduced dramatically. As the images are cached in the PC's RAM, the workload on middle tier will be reduced largely as well.
What do you think?
Cheers,
Han
_____________________________________________
From: Han
Sent: Tuesday, 3 February 2009 11:34 AM
To: Ariel
Subject: RE: Culprit, again
Ariel,
Yes, it's ApEx internal API, but I don't think we can blame Oracle entirely. Now we know this API is bad, we shouldn't call it.
CPU Elapsed CPU per % Total
Time (s) Time (s) Executions Exec (s) DB Time SQL Id
---------- ---------- ------------ ----------- ------- -------------
4,014 19,928 6,686 0.60 58.2 c34r978mgkrmf
This SQL was called 6,686 times in an hour. Do we really need to call it so many times?
This issue was reported in 2005 as "Severe Loss of Service" by others, Oracle closed it "not a bug". See metalink bug#4755226 for details.
This issue was occurred again in 11g ApEx 3.1.2, http://forums.oracle.com/forums/thread.jspa?messageID=3205216?
Yes, it's Oracle's API, but Oracle will not do anything about it, the only choice we have is to avoid calling it.
Cheers,
Han
_____________________________________________
From: Ariel
Sent: Tuesday, 3 February 2009 10:58 AM
To: Han
Subject: RE: Culprit, again
Han,
Well, it's Oracle ApEx's own API's that's chewing up CPU. I'm hoping that their experts would have found the problem by now and enhanced the process. Oracle 10g's PL/SQL gateway is so full of patches, that 58% of the CPU load is spent checking for potential threats because requests aren't checked on the Application Server but passed to the database directly and executed by the PL/SQL gateway. During busy period, it just doesn't cope.
When is the code executed? I don't know.. I'm guessing it's every time because every request can potentially contain malicious code injected within the URL.
Oracle 11g is a single tier system. It's mechanics would be completely different to Oracle 10g. I am only hoping.
Regards,
Ariel
_____________________________________________
From: Han
Sent: Tuesday, 3 February 2009 10:45
To: Ariel
Subject: RE: Culprit, again
Ariel,
I think we need to further isolate the issue. "This piece of SQL gets executed everytime a request is made.". We need to find out what kind of "a request" it's. As I know, not all requests will run this SQL. There are many other applications on PUMA don't run it, even DMT right now is not running it that frequently. Can you give me an example of the URL you refer to please?
Why do you think 11g ApEx 3.1 helps here? If it's network traffic, I can image, but it's CPU.
Regards,
Han
_____________________________________________
From: Ariel
Sent: Tuesday, 3 February 2009 10:34 AM
To: Han
Subject: RE: Culprit, again
Han,
This piece of SQL gets executed everytime a request is made.
The code checks for exclusion strings in the URL and halts the request if excluded strings exists in the URL.
The only way we can reduce the number of times this code gets executed is to reduce the traffic during this period (2.00pm to 3:00), or upgrade ApEx to Oracle 11g ApEx 3.1, with it's embedded Oracle Application Server, which may be less prone to these issues because the App Server is in the database.
2:00 to 3:00 pm is the busiest time of the day, because Day shift staff from 3 workcentres (Coffs, Sydney and Ballarat) are frantically booking activities off to DMT before they finish off for the day.
Regards,
Ariel
_____________________________________________
From: Han
Sent: Monday, 2 February 2009 15:49
To: Ariel
Subject: Culprit, again
Ariel,
The old issue come back again.
I checked the performance issue between 2:00pm and 3:00pm today when CPU was 100% busy all the time. The #1 SQL is:
CPU Elapsed CPU per % Total
Time (s) Time (s) Executions Exec (s) DB Time SQL Id
---------- ---------- ------------ ----------- ------- -------------
4,014 19,928 6,686 0.60 58.2 c34r978mgkrmf
So 58% of CPU usage is used by this SQL. Then I find out the SQL text:
select sql_fulltext from v$sqlstats where sql_id = 'c34r978mgkrmf'
declare
rc__ number;
simple_list__ owa_util.vc_arr;
complex_list__ owa_util.vc_arr;
begin
owa.init_cgi_env(:n__,:nm__,:v__);
htp.HTBUF_LEN := 255;
null;
null;
simple_list__(1) := 'sys.%';
simple_list__(2) := 'dbms\_%';
simple_list__(3) := 'utl\_%';
simple_list__(4) := 'owa\_%';
simple_list__(5) := 'owa.%';
simple_list__(6) := 'htp.%';
simple_list__(7) := 'htf.%';
simple_list__(8) := 'wpg_docload.%';
if ((owa_match.match_pattern('f', simple_list__, complex_list__, true))) then
rc__ := 2;
else
null;
null;
f(p=>:p);
if (wpg_docload.is_file_download) then
rc__ := 1;
wpg_docload.get_download_file(:doc_info);
null;
null;
null;
commit;
else
rc__ := 0;
null;
null;
null;
commit;
owa.get_page(:data__,:ndata__);
end if;
end if;
:rc__ := rc__;
end;
Now, it looks familiar, it rings a bell. I looked around, it's actually the same culprit we identified 2 years ago (see below email).
Any chance you can limit the times of running on this piece please?
Cheers,
Han
_____________________________________________
From: Han
Sent: Monday, 27 November 2006 9:24 AM
To: Ariel
Subject: RE:
Thanks.
Any chance you can limit the times of running on this piece please:
declare
rc__ number;
simple_list__ owa_util.vc_arr;
complex_list__ owa_util.vc_arr;
begin
owa.init_cgi_env(:n__,:nm__,:v__);
htp.HTBUF_LEN := 255;
null;
null;
simple_list__(1) := 'sys.%';
simple_list__(2) := 'dbms\_%';
simple_list__(3) := 'utl\_%';
simple_list__(4) := 'owa\_%';
simple_list__(5) := 'owa.%';
simple_list__(6) := 'htp.%';
simple_list__(7) := 'htf.%';
simple_list__(8) := 'wpg_docload.%';
if ((owa_match.match_pattern('f', simple_list__, complex_list__, true))) then
rc__ := 2;
else
null;
null;
f(p=>:p);
if (wpg_docload.is_file_download) then
rc__ := 1;
wpg_docload.get_download_file(:doc_info);
null;
null;
null;
commit;
else
rc__ := 0;
null;
null;
null;
commit;
owa.get_page(:data__,:ndata__);
end if;
end if;
:rc__ := rc__;
end;
_____________________________________________
From: Ariel
Sent: Monday, 27 November 2006 9:20 AM
To: Han
Subject: RE:
Done, the process has been modified.
Regards,
Ariel
_____________________________________________
From: Han
Sent: Monday, 27 November 2006 9:15 AM
To: Ariel
Subject: RE:
Yes, it causes performance issue, please try to run it only when necessary.
_____________________________________________
From: Ariel
Sent: Monday, 27 November 2006 9:13 AM
To: Han
Subject: RE:
Yes, it's a computation that happens everytime a page loads. Is it causing a problem? I can modify it to only run once during login.
select 1
from pay_rate
where pay_rate_id = (select pay_rate from dmt_users where user_id = :DMT_USER_ID)
and to_number(to_char(sysdate - nvl(:DMT_TIME,0),'HH24')) between start_hour and end_hour
Regards,
Ariel
_____________________________________________
From: Han
Sent: Monday, 27 November 2006 9:03 AM
To: Ariel
Subject: RE:
Does this SQL ring a bell to you please?
select 1 from pay_rate where pay_rate_id = (select pay_rate from dmt_users where user_id = :DMT_USER_ID) and to_number(to_char(sysdate - nvl(:DMT_TIME,0),'HH24')) between start_hour and end_hour
Many sessions are running the same SQL.
_____________________________________________
From: Ariel
Sent: Monday, 27 November 2006 8:29 AM
To: Han
Subject:
Importance: High
Han/John,
DMT on PUMAE is running very slowly at the moment and it appears to be getting worse. Could you please investigate at you earliest possible convenience.
Thank you,
Regards,
Ariel