Tables - Random Sampling
A valuable data extraction technique is to present a random subset of records from a table. Selecting the first 1000 rows of a table is hardly random and may not be indicative of the cross section of data within a large table. Selecting records that match one or more predefined key values is even less useful.
Selecting records from the entire table based on a random row identifier within each Oracle data block will produce a more appropriate representation of the data. The following example demonstrates the SQL required to select the tenth row from each Oracle block.
SELECT . . .
FROM emp_history
WHERE SUBSTR(ROWID,6,4) = '0010'
This technique can be taken one step further to introduce a level of randomness into the actual row selection. Tying the record identifier to the current time allows us to repeat the exercise with varying results.
SELECT . . .
FROM emp_history
WHERE SUBSTR(ROWID,10,4) = TO_CHAR(SYSDATE,'"00"ss')
Note that as of Oracle 8, you must use the DBMS_ROWID package to translate the new extended rowid into the old format. For example:
SELECT 'INSERT INTO temp_size_table'||&&cr||
'SELECT '||&&qt||segment_name||&&qt||&&cr||
',COUNT( DISTINCT( SUBSTR( ROWID,1,8))) blocks'||&&cr||
'FROM &&owner..'||segment_name, ';'
FROM dba_segments
WHERE segment_type ='TABLE' AND owner = upper('&owner');
becomes
SELECT 'INSERT INTO temp_size_table'||&&cr||
'SELECT '||&&qt||segment_name||&&qt||&&cr||
',COUNT( DISTINCT(
SUBSTR( dbms_rowid.rowid_to_restricted(ROWID,1),1,8))) blocks'||&&cr||
'FROM &&owner..'||segment_name, ';'
FROM dba_segments
WHERE segment_type ='TABLE' AND owner = upper('&owner');
In Oracle 8i there is also a sample clause for use with SQL SELECT to do randomized sampling which simplifies the process further. In the example below, the BLOCK keyword instructs Oracle to perform random block sampling instead of random row sampling:
SELECT COUNT(*) * 100 FROM emp SAMPLE BLOCK (1);
Data sampling is a very important issue for very large databases. The accuracy of your results can be directly dependant on the precision of your data samples.