Saturday, 19 January 2013

Histograms statistics not accurate with Auto Sample Size

I was reading that AUTO sampling in 11g generates statistics that are as accurate as 100% sampling but takes less time than 100% sampling. I decided to look at the quality of these statistics. Are they good for the optimizer ?

SQL> create table sales (order_key number) ;

Table created.


Insert few records.

SQL> begin
     for i in 1..10000
     loop
     insert into sales values(10);
     insert into sales values(20);
     insert into sales values(30);
     insert into sales values(40);
     insert into sales values(50);
     end loop;
     end ;
      /

I inserted five distinct values (10,20,30,40,50) each a 10,000 time.

Now take statistics with AUTO_SAMPLE_SIZE

SQL> exec dbms_stats.gather_table_stats(user,'SALES',method_opt=>'for all columns size 254');

This command completed in just 30 seconds. WOW !!

Let us look at the quality statistics now.

SQL> select num_rows,sample_size from user_tables where table_name='SALES';
   
  NUM_ROWS SAMPLE_SIZE
  ----------  -----------
   50000       50000

SQL> select num_distinct from user_tab_col_statistics where table_name='SALES';

  NUM_DISTINCT
  --------------------
        5

It tells me that the sample size used was 100% and we see the correct number of num_rows and num_distinct. 

These statistics are 100% accurate. Takes very little time. What is wrong then ? 

Now let us look at the sample used to generate column statistics (histograms)

When I use method_opt => 'for all columns size 254' , I instruct Oracle to create a histogram on every column

SQL> select sample_size from user_tab_col_statistics where table_name='SALES'
           and column_name='ORDER_KEY' ;

     SAMPLE_SIZE
     -------------------
        5450

OK. So Oracle used different samples to collect different statistics. It looks like NUM_ROWS and NUM_DISTINCT was derived from 100% sample but histogram was computed by sampling only 5450 rows ! I ran the same tests with 10 million rows, column sample size was still close to ~ 5550. Is it something that is hard-coded into Oracle's code ?

We create histograms to make optimizer aware of the real data distribution. Of course, quality of this particular statistics might be very very bad given this small sample size.

So, why present the optimizer with histograms when it is not even close to real ? 

No comments:

Post a Comment

19c Multitenant RAC- Default service disables parallelism

We recently upgraded one of our RAC databases from 11.2 to 19.10 on Exadata. And to everyone's surprise, almost everything on the system...