by

This is the third article of my 12c Adaptive Sampling series. Until this part we knew how ADS works for single table (also for group by clause) and join.  As you know join cardinality calculated (Jonathan Lewis has explained that in his book) as

   Join Selectivity =
             ((num_rows(t1) - num_nulls(t1.c1)) / num_rows(t1)) *
             ((num_rows(t2) - num_nulls(t2.c2)) / num_rows(t2)) /
             greater(num_distinct(t1.c1), num_distinct(t2.c2))

Join Cardinality =
        Join Selectivity *
        filtered cardinality(t1) * filtered cardinality(t2)

So to estimate Join cardinality you need estimate three factors: cardinality of both tables – included in the join and join selectivity. How do you think? Can be this mechanism get benefit from dynamic sampling? Of course estimating cardinality of both tables can be get benefit from DS but what about join selectivity? Answer is yes. Using dynamic sampling DBMS can estimate or calculate column statistics, number of distinct values and number of null values. And this information is enough for estimation join cardinality. In previous articles we have used t2 and t3 tables. In this part we also will use these tables but with different sizes (greater than previous). To increasing segment size I have used several insert into select from clause so our case is:

SEGMENT_NAME       MB (user_segments)
------------       -------
T2                 3110.00
T3                 3589.00

TABLE_NAME     NUM_ROWS       BLOCKS (user_tables)
T3             14208000       231391
T2             11810048       204150

TABLE_NAME  COLUMN_NAME  NUM_DISTINCT HISTOGRAM (user_tab_col_statistics)
T2          OBJECT_NAME  53744        NONE                                                               
T3          TABLE_NAME   9004         NONE  

TABLE_NAME         STALE_STATS (user_tab_statistics)
------------       -----------
T2                 YES
T3                 YES

And we will see following query

select  
     count(*)
from
      t2
     ,t3
where
     --t2.owner=t3.owner and
     t2.object_name=t3.table_name
 

Now I have decided setting NDV of t2.object_name to 1500. Actually it is 53744. It means number of distinct values in dictionary wills less than actual number of distinct values. However statistics is stale but number of distinct values 53744 indicates reality. Because several insert into select from statement do not change number of distinct values of object_name column. But why did I decide to change column statistic? Soon you will know that. Let me try doing changing

DECLARE
  l_distcnt  NUMBER;
  l_density  NUMBER;
  l_nullcnt  NUMBER;
  l_srec     DBMS_STATS.StatRec;
  l_avgclen  NUMBER;
BEGIN
  DBMS_STATS.get_column_stats (
    ownname => 'sh',
    tabname => 't2',
    colname => 'object_name',
    distcnt => l_distcnt,
    density => l_density,
    nullcnt => l_nullcnt,
    srec    => l_srec,
    avgclen => l_avgclen);

l_distcnt:=15000;
l_density:=1/15000;

  DBMS_STATS.set_column_stats (
    ownname => 'sh',
    tabname => 't2',
    colname => 'object_name',
    distcnt => l_distcnt,
    density => l_density,
    nullcnt => l_nullcnt,
    srec    => l_srec,
    avgclen => l_avgclen);
END;
/

Firstly I want to note that oracle completely ignores dynamic sampling statistics for computing any cardinality estimation. For Q3 query from 10046 trace file I have got (we can see that for both tables but I just provided only for T3 table).

SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring
  optimizer_features_enable(default) no_parallel  */ sum(vsize(C1))/count(*) ,
   substrb(dump(max(substrb(C2,1,32)), 16,0,32), 1,120) ,
  substrb(dump(min(substrb(C3,1,32)), 16,0,32), 1,120) , SUM(C4),
  COUNT(DISTINCT C5)
FROM
 (SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "T3")  */ "T3"."TABLE_NAME"
  AS C1, "T3"."TABLE_NAME" AS C2, "T3"."TABLE_NAME" AS C3, CASE WHEN
  ("T3"."TABLE_NAME" IS NULL) THEN 1 ELSE 0 END AS C4, "T3"."TABLE_NAME" AS
  C5 FROM "T3" SAMPLE BLOCK(0.174483, 8) SEED(1)  "T3") innerQuery

Sometimes it can be deciding increase sample size (generally with twice times) like below

SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring
  optimizer_features_enable(default) no_parallel  */ sum(vsize(C1))/count(*) ,
   substrb(dump(max(substrb(C2,1,32)), 16,0,32), 1,120) ,
  substrb(dump(min(substrb(C3,1,32)), 16,0,32), 1,120) , SUM(C4),
  COUNT(DISTINCT C5)
FROM
 (SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "T3")  */ "T3"."TABLE_NAME"
  AS C1, "T3"."TABLE_NAME" AS C2, "T3"."TABLE_NAME" AS C3, CASE WHEN
  ("T3"."TABLE_NAME" IS NULL) THEN 1 ELSE 0 END AS C4, "T3"."TABLE_NAME" AS
  C5 FROM "T3" SAMPLE BLOCK(0.348966, 8) SEED(2)  "T3") innerQuery

What does this SQL mean? This statement use estimate/calculate column (join) statistics which involved in the join (also can be apply this approach for filter predicate).  And this third method for estimating joins cardinality. Here is COUNT(DISTINCT C5) indicate number of distinct values of the join(or can be filter) column and  SUM(C4) is number of null values. I have removed all columns from the that SQL except NDV fot t3 table then:

SQL> SELECT
  2    COUNT(DISTINCT C5) as num_dist
  3  FROM
  4   (SELECT /*+ qb_name("innerQuery") NO_INDEX_FFS( "T2")  */ "T2"."OBJECT_NAME"
  5    AS C1, "T2"."OBJECT_NAME" AS C2, "T2"."OBJECT_NAME" AS C3, CASE WHEN
  6    ("T2"."OBJECT_NAME" IS NULL) THEN 1 ELSE 0 END AS C4, "T2"."OBJECT_NAME" AS
  7    C5 FROM "T2" SAMPLE BLOCK(0.402778, 8) SEED(2)  "T2") innerQuery;
  NUM_DIST
----------
37492

You know I have updated NDV to 1500 for that column and plan was

--------------------------------------------
| Id  | Operation           | Name | Rows  |
--------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |
|   1 |  SORT AGGREGATE     |      |     1 |
|*  2 |   HASH JOIN         |      |    11G|
|   3 |    TABLE ACCESS FULL| T2   |    11M|
|   4 |    TABLE ACCESS FULL| T3   |    14M|
--------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T2"."OBJECT_NAME"="T3"."TABLE_NAME")
Note
-----
   - Dynamic statistics used: dynamic sampling (level=AUTO)

And  from 10053 trace file

Best NL cost: 2037555914706.766602
          resc: 2037555914706.766602  resc_io: 2030576990663.999756  resc_cpu: 75477063522522944
          resp: 2037555914706.766602  resp_io: 2030576990663.999756  resc_cpu: 75477063522522944
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = JOIN
Join Card:  11186477465.600000 = outer (11810048.000000) * inner (14208000.000000) * sel (6.6667e-05)

As you see optimizer did not use number of distinct values which computed using sampling instead it used num_distinct from dictionary. Optimizer has such mechanism but it do not use this as expected, however dictionary statistics is STALE also using sampling number of distinct values(37492) is greater than dba_tab_col_statistics.num_distinct(15000). It means in this moment DS give us more correct information than dictionary but optimizer ignores that fact. It should not happen.

Conclusion

As you know ADS will help if your tables are small otherwise it will not help or can be bad. Depending on tables (size) involved in the joins and predicate type (join/filter) dynamic sampling can be completely ignored. Also optimizer tries (for some cases) estimate column statistics for join selectivity but it do not take benefits still from that. I hope some opportunities can be improve or fix in next releases. In additionally ADS increase parse time of the statements therefore it can be produce additionally concurrency in OLTP environment like latch/mutexes.