Exadata, Missing Metric

Understand metrics for Exadata Storage Server is important to understand how all the software features are being used and all the details from that. Here I will discuss one case where the FC_IO_BY_R_SEC metric can show not precise values. And I will discuss one missing metric that can save a lot.
If you have doubts about metrics, you can check my post about metrics, it was an introduction, but cover some aspects of how to read and use it. You can check my other post where I show how to use metric DB_FC_IO_BY_SEC to identify database problems that can be hidden when checking only from the database side.


Metrics collection

For this post, all metrics were collected from one storage server, but the behavior occurs in all cells from Exadata, and in all versions. And they were collected using the same timestamp be accurate. So, they represent one minute, but again, the same behavior that I show here occurs every time. When reading the metrics, please look at the metric name and the collect moment.
The idea for this post came when I was investigating storage usage from one database to see if we are hitting the limits for Exadata Storage. I needed to check disk, flash cache, and other details. For the behavior I show below, I already worked with Oracle about (SR, BUG, ER).



I already discussed about this metric in a previous post, but its report (according to the official doc) the “The number of megabytes of I/O per second for this database to flash cache”. Look the usage for the database DBPR1_EXA:


CellCLI> list metrichistory where collectionTime > '2019-12-13T15:42:00+01:00' and collectionTime < '2019-12-13T15:43:00+01:00' and name = 'DB_FC_IO_BY_SEC' and metricObjectName like 'DBPR1_EXA'

         DB_FC_IO_BY_SEC         DBPR1_EXA       850 MB/sec      2019-12-13T15:42:03+01:00



As you can see, at 15:42:03 of 2019-12-13 the storage server reported that during the last minute this database made around 850 MB/s of reading from flash cache.
And this was the highest database consumption (that used more than 0.5 MB/s):


CellCLI> list metrichistory where collectionTime > '2019-12-13T15:42:00+01:00' and collectionTime < '2019-12-13T15:43:00+01:00' and name like 'DB_.*BY.*SEC' and metricvalue > 0.5

         DB_FC_IO_BY_SEC         ORADB01P                1 MB/sec        2019-12-13T15:42:03+01:00

         DB_FC_IO_BY_SEC         ORADB02P                7 MB/sec        2019-12-13T15:42:03+01:00

         DB_FC_IO_BY_SEC         ORADB01V                1 MB/sec        2019-12-13T15:42:03+01:00

         DB_FC_IO_BY_SEC         ORADB03P                5 MB/sec        2019-12-13T15:42:03+01:00

         DB_FC_IO_BY_SEC         ORAD01P                 6 MB/sec        2019-12-13T15:42:03+01:00

         DB_FC_IO_BY_SEC         ORADBPR_EXA             2 MB/sec        2019-12-13T15:42:03+01:00

         DB_FC_IO_BY_SEC         DBPR1_EXA               850 MB/sec      2019-12-13T15:42:03+01:00

         DB_FC_IO_BY_SEC         DBP01                   104 MB/sec      2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         ORADB01P                1 MB/sec        2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         ORADB02P                7 MB/sec        2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         ORADB01V                1 MB/sec        2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         ORADB03P                5 MB/sec        2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         ORAD01P                 6 MB/sec        2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         ORADBPR_EXA             2 MB/sec        2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         DBPR1_EXA               850 MB/sec      2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         DBP01                   105 MB/sec      2019-12-13T15:42:03+01:00

         DB_FD_IO_BY_SEC         _OTHER_DATABASE_        1 MB/sec        2019-12-13T15:42:03+01:00

         DB_FL_IO_BY_SEC         DBP01                   1.150 MB/sec    2019-12-13T15:42:03+01:00

         DB_IO_BY_SEC            ORAD01P                 2 MB/sec        2019-12-13T15:42:03+01:00

         DB_IO_BY_SEC            ORADB1P_EXA             1 MB/sec        2019-12-13T15:42:03+01:00

         DB_IO_BY_SEC            DBP01                   2 MB/sec        2019-12-13T15:42:03+01:00



If I sum all the database usage from flash cache, it will be around 950 MB/s.




Since DB_FC_IO_BY_SEC reported just a small usage, I went check the metric for whole flash cache. So, I went to FC_IO_BY_R_SEC. This metric report (according to the docs): “The number of megabytes read per second from flash cache”. And since we are reading the metrics for flash cache direct, it report the number for all databases.
But, look the numbers:


CellCLI> list metrichistory where collectionTime > '2019-12-13T15:42:00+01:00' and collectionTime < '2019-12-13T15:43:00+01:00' and name = 'FC_IO_BY_R_SEC';

         FC_IO_BY_R_SEC  FLASHCACHE      19.343 MB/sec   2019-12-13T15:42:03+01:00

So, at the same time moment that one database metric reported around 850 MB/s from FC reading, the metrics from FC reported around 19MB/s. Something it is not correct.
And even if I check the minute before and after (because maybe was a glimpse of time collection) the numbers are not quite different:


CellCLI> list metrichistory where collectionTime > '2019-12-13T15:41:00+01:00' and collectionTime < '2019-12-13T15:44:00+01:00' and name = 'FC_IO_BY_R_SEC';

         FC_IO_BY_R_SEC  FLASHCACHE      27.519 MB/sec   2019-12-13T15:41:03+01:00

         FC_IO_BY_R_SEC  FLASHCACHE      19.343 MB/sec   2019-12-13T15:42:03+01:00

         FC_IO_BY_R_SEC  FLASHCACHE      39.600 MB/sec   2019-12-13T15:43:03+01:00



Others flash metrics

So, since the numbers appears to be complete wrong I searched for others metrics that report flash usage (disks and flash cache as example).




I started to search with the lowest level possible, going directly to the celldisk metrics. For that I used CD_IO_BY_R_LG_SEC, CD_IO_BY_R_SM_SEC (for reads), CD_IO_BY_W_LG_SEC, and CD_IO_BY_W_SM_SEC (for writes). Basically, it reports: “The rate which is the number of megabytes read in large blocks per second from a cell disk”.
Filtering just from flash disks type I had this numbers:


CellCLI> list metrichistory where collectionTime > '2019-12-13T15:42:00+01:00' and collectionTime < '2019-12-13T15:43:00+01:00' and name like 'CD_IO_BY.*SEC.*' and metricobjectname like 'FD_.*' and metricvalue > 0

         CD_IO_BY_R_LG_SEC       FD_00_exastradm01       179 MB/sec      2019-12-13T15:42:03+01:00

         CD_IO_BY_R_LG_SEC       FD_01_exastradm01       296 MB/sec      2019-12-13T15:42:03+01:00

         CD_IO_BY_R_LG_SEC       FD_02_exastradm01       200 MB/sec      2019-12-13T15:42:03+01:00

         CD_IO_BY_R_LG_SEC       FD_03_exastradm01       250 MB/sec      2019-12-13T15:42:03+01:00

         CD_IO_BY_R_SM_SEC       FD_00_exastradm01       3.161 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_R_SM_SEC       FD_01_exastradm01       3.152 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_R_SM_SEC       FD_02_exastradm01       2.990 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_R_SM_SEC       FD_03_exastradm01       3.741 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_W_LG_SEC       FD_00_exastradm01       0.859 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_W_LG_SEC       FD_01_exastradm01       1.125 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_W_LG_SEC       FD_02_exastradm01       1.028 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_W_LG_SEC       FD_03_exastradm01       0.801 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_W_SM_SEC       FD_00_exastradm01       0.982 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_W_SM_SEC       FD_01_exastradm01       0.998 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_W_SM_SEC       FD_02_exastradm01       1.006 MB/sec    2019-12-13T15:42:03+01:00

         CD_IO_BY_W_SM_SEC       FD_03_exastradm01       0.937 MB/sec    2019-12-13T15:42:03+01:00



As you can see, just picking up the CD_IO_BY_R_LG_SEC I got around 925 MB/s, a close number from the same that I got from the database metric reading the flash. Another hint from that says to me that reads are requesting more than 128KB for each access since they were counted as large (LG).




Since I was checking for bad flash report, I checked the flash log metrics but they reported close values from cell disk metrics:


CellCLI> list metrichistory where collectionTime > '2019-12-13T15:42:00+01:00' and collectionTime < '2019-12-13T15:43:00+01:00' and name like 'FL_IO.*_SEC'

         FL_IO_DB_BY_W_SEC       FLASHLOG        1.974 MB/sec    2019-12-13T15:42:03+01:00

         FL_IO_FL_BY_W_SEC       FLASHLOG        2.673 MB/sec    2019-12-13T15:42:03+01:00



So, the metrics for disk usage, was OK, reporting correct values.




As you can see, the metrics from database usage from flash cache are reporting the same values that were possible to check from the closest as from hardware report (celldisk). So, maybe could be a software misreading in-memory values.
Because of that, I checked from other metrics that can report form features that interact with flash cache. So, I checked from Smart I/O:


CellCLI> list metrichistory where collectionTime > '2019-12-13T15:42:00+01:00' and collectionTime < '2019-12-13T15:43:00+01:00' and name like 'SIO_.*SEC.*'

         SIO_IO_EL_OF_SEC        SMARTIO         1,223 MB/sec    2019-12-13T15:42:03+01:00

         SIO_IO_OF_RE_SEC        SMARTIO         34.688 MB/sec   2019-12-13T15:42:03+01:00

         SIO_IO_PA_TH_SEC        SMARTIO         0.000 MB/sec    2019-12-13T15:42:03+01:00

         SIO_IO_RD_FC_HD_SEC     SMARTIO         0.174 MB/sec    2019-12-13T15:42:03+01:00

         SIO_IO_RD_FC_SEC        SMARTIO         843 MB/sec      2019-12-13T15:42:03+01:00

         SIO_IO_RD_HD_SEC        SMARTIO         0.101 MB/sec    2019-12-13T15:42:03+01:00

         SIO_IO_RD_RQ_FC_HD_SEC  SMARTIO         0.183 IO/sec    2019-12-13T15:42:03+01:00

         SIO_IO_RD_RQ_FC_SEC     SMARTIO         850 IO/sec      2019-12-13T15:42:03+01:00

         SIO_IO_RD_RQ_HD_SEC     SMARTIO         0.000 IO/sec    2019-12-13T15:42:03+01:00

         SIO_IO_RV_OF_SEC        SMARTIO         3.392 MB/sec    2019-12-13T15:42:03+01:00

         SIO_IO_SI_SV_SEC        SMARTIO         362 MB/sec      2019-12-13T15:42:03+01:00

         SIO_IO_WR_FC_SEC        SMARTIO         0.008 MB/sec    2019-12-13T15:42:03+01:00

         SIO_IO_WR_HD_SEC        SMARTIO         0.000 MB/sec    2019-12-13T15:42:03+01:00

         SIO_IO_WR_RQ_FC_SEC     SMARTIO         0.017 IO/sec    2019-12-13T15:42:03+01:00

         SIO_IO_WR_RQ_HD_SEC     SMARTIO         0.000 IO/sec    2019-12-13T15:42:03+01:00



And as you can see the SIO_IO_RD_FC_SEC (that means “The number of megabytes per second read from flash cache by smart I/O”) reported almost the same that value (843 MB/s) from the database and cell disk.


The Missing Metric


As you can see above, the metrics close from HW reported almost the same that was reported by DB. And from features usage, almost the same too. To be more clear is that:



So, it is not miss reading from memory since the Smart I/O report correctly. Returning the review for flash cache, all metrics are:


CellCLI> list metrichistory where collectionTime > '2019-12-13T15:42:00+01:00' and collectionTime < '2019-12-13T15:43:00+01:00' and name like 'FC_IO_BY.*' and metricvalue > 0;

         FC_IO_BYKEEP_W                          FLASHCACHE      519 MB                          2019-12-13T15:42:03+01:00

         FC_IO_BY_DISK_WRITE                     FLASHCACHE      111,626,264 MB                  2019-12-13T15:42:03+01:00

         FC_IO_BY_DISK_WRITE_SEC                 FLASHCACHE      2.052 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_R                              FLASHCACHE      1,475,538,050 MB                2019-12-13T15:42:03+01:00

         FC_IO_BY_R_ACTIVE_SECONDARY             FLASHCACHE      1,132,686 MB                    2019-12-13T15:42:03+01:00

         FC_IO_BY_R_ACTIVE_SECONDARY_MISS        FLASHCACHE      237,442 MB                      2019-12-13T15:42:03+01:00

         FC_IO_BY_R_DISK_WRITER                  FLASHCACHE      114,172,096 MB                  2019-12-13T15:42:03+01:00

         FC_IO_BY_R_DISK_WRITER_SEC              FLASHCACHE      2.057 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_R_DW                           FLASHCACHE      13,071,021,100 MB               2019-12-13T15:42:03+01:00

         FC_IO_BY_R_MISS                         FLASHCACHE      42,381,380 MB                   2019-12-13T15:42:03+01:00

         FC_IO_BY_R_MISS_DW                      FLASHCACHE      133,613,408 MB                  2019-12-13T15:42:03+01:00

         FC_IO_BY_R_MISS_SEC                     FLASHCACHE      0.438 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_R_SEC                          FLASHCACHE      19.343 MB/sec                   2019-12-13T15:42:03+01:00

         FC_IO_BY_R_SKIP                         FLASHCACHE      97,002,568 MB                   2019-12-13T15:42:03+01:00

         FC_IO_BY_W                              FLASHCACHE      1,761,639,940 MB                2019-12-13T15:42:03+01:00

         FC_IO_BY_W_DISK_WRITER                  FLASHCACHE      111,615,088 MB                  2019-12-13T15:42:03+01:00

         FC_IO_BY_W_DISK_WRITER_SEC              FLASHCACHE      2.052 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_W_FIRST                        FLASHCACHE      183,171,872 MB                  2019-12-13T15:42:03+01:00

         FC_IO_BY_W_FIRST_SEC                    FLASHCACHE      2.452 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_W_OVERWRITE                    FLASHCACHE      1,475,454,720 MB                2019-12-13T15:42:03+01:00

         FC_IO_BY_W_OVERWRITE_SEC                FLASHCACHE      1.346 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_W_POPULATE                     FLASHCACHE      103,121,912 MB                  2019-12-13T15:42:03+01:00

         FC_IO_BY_W_POPULATE_SEC                 FLASHCACHE      0.381 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_W_SEC                          FLASHCACHE      4.179 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_W_SKIP                         FLASHCACHE      1,039,399,810 MB                2019-12-13T15:42:03+01:00

         FC_IO_BY_W_SKIP_LG                      FLASHCACHE      605,535,040 MB                  2019-12-13T15:42:03+01:00

         FC_IO_BY_W_SKIP_LG_SEC                  FLASHCACHE      2.695 MB/sec                    2019-12-13T15:42:03+01:00

         FC_IO_BY_W_SKIP_SEC                     FLASHCACHE      6.589 MB/sec                    2019-12-13T15:42:03+01:00



As you can above it is not the case for FC_IO_BY_R_SKIP (“The number of megabytes read from disks for I/O requests that bypass flash cache”) and FC_IO_BY_R_MISS_SEC (“The number of megabytes read from disks per second because not all requested data was in flash cache”). And even if we consider some minutes before and after:


CellCLI> list metrichistory where collectionTime > '2019-12-13T15:41:00+01:00' and collectionTime < '2019-12-13T15:44:00+01:00' and name = 'FC_IO_BY_R_SKIP';

         FC_IO_BY_R_SKIP         FLASHCACHE      97,002,568 MB   2019-12-13T15:41:03+01:00

         FC_IO_BY_R_SKIP         FLASHCACHE      97,002,568 MB   2019-12-13T15:42:03+01:00

         FC_IO_BY_R_SKIP         FLASHCACHE      97,002,568 MB   2019-12-13T15:43:03+01:00

CellCLI> list metrichistory where collectionTime > '2019-12-13T15:41:00+01:00' and collectionTime < '2019-12-13T15:44:00+01:00' and name = 'FC_IO_BY_W_SKIP';

         FC_IO_BY_W_SKIP         FLASHCACHE      1,039,399,420 MB        2019-12-13T15:41:03+01:00

         FC_IO_BY_W_SKIP         FLASHCACHE      1,039,399,810 MB        2019-12-13T15:42:03+01:00

         FC_IO_BY_W_SKIP         FLASHCACHE      1,039,401,150 MB        2019-12-13T15:43:03+01:00

CellCLI> list metrichistory where collectionTime > '2019-12-13T15:41:00+01:00' and collectionTime < '2019-12-13T15:44:00+01:00' and name like 'FC_IO_BY_.*_MISS.*SEC' and  metricvalue > 0

         FC_IO_BY_R_MISS_SEC     FLASHCACHE      0.702 MB/sec    2019-12-13T15:41:03+01:00

         FC_IO_BY_R_MISS_SEC     FLASHCACHE      0.438 MB/sec    2019-12-13T15:42:03+01:00

         FC_IO_BY_R_MISS_SEC     FLASHCACHE      0.521 MB/sec    2019-12-13T15:43:03+01:00



But if we look closely the metrics from FC above, we can see the high value for FC_IO_BY_R_DW (“The number of megabytes of DW data read from flash cache”). And if we check some minutes we can see the usage:


CellCLI> list metrichistory where collectionTime > '2019-12-13T15:41:00+01:00' and collectionTime < '2019-12-13T15:44:00+01:00' and name = 'FC_IO_BY_R_DW';

         FC_IO_BY_R_DW   FLASHCACHE      13,070,966,800 MB       2019-12-13T15:41:03+01:00

         FC_IO_BY_R_DW   FLASHCACHE      13,071,021,100 MB       2019-12-13T15:42:03+01:00

         FC_IO_BY_R_DW   FLASHCACHE      13,071,055,900 MB       2019-12-13T15:43:03+01:00



From that, we can see that at 15:41 the usage was 13,070,966,800 MB and at 15:42 the usage was 13,071,021,100 MB. And if we do the diff, between this minute the difference was 54300 MB. And if we divide by each sec we have 54300/60 = 905 MB/s.
So, the missing values were found, the flash cache read usage was counted as DW. But if you followed everything, we wad legitim reads from flash cache that was not counted at FC_IO_BY_R_SEC. So, the missing metric, in this case, it is FC_IO_BY_R_DW_SEC.




The reason why the read was counted at FC_IO_BY_R_DW and not by the generic FC_IO_BY_R_SEC is not clear. Maybe because the block size of tablespace for the database is different than 8K, maybe the requests were large (more than 128 KB are we saw in cell disk metric). It is not clear why. Or maybe FC_IO_BY_R_SEC just report what was really read by flash cache from the database. It can occur that documentation is not clear and some details why this behavior occurs need to be clarified.
But we can see that FC_IO_BY_R_SEC don’t reflect the current usage in some cases. If you want to verify the FC usage, the metric FC_IO_BY_R_SEC can, sometimes, report wrong values. If the other metrics for FC are correct (like SKIP or MISS) was impossible to check at this point. One option is check directly at celldisk for flash disk usage to verify if you reached the limits for flash usage or check directly from database usage.

Disclaimer: “The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community.”