Maybe when you die you come before a big bearded man on a big throne and you say “Is this heaven?” And he says “Heaven? You just came from there.” — Kirk Douglas

You are here: Home / How to do Percentile Ranking in Oracle

How to do Percentile Ranking in Oracle

June 26, 2007

Recently I had to provide a script to convert a dataset of raw assessment scores into an Oracle table with the scores ordered by percentile rank. This is a common request so I figured a short, non-technical post on percentile ranking might be helpful to a lot of people.

There are three main ways to calculate percentile ranks in Oracle 9i or later:

(a) Calculate them manually with Joins and Sub-queries;
(b) Use the CUME_DIST function and format the results as a percentage;
(c) Use the PERCENT_RANK function.

In this post I’m going to focus on options (b) and (c) because the only people who would reinvent the wheel and use (a) are SQL programmers who don’t know (b) and (c) already exist.

CUME_DIST and PERCENT_RANK are built-in Oracle mathematical functions that allow you to rank a value based on its relative standing within a set of values. For example, if you hear someone say that a 1600 SAT score was in the 99th percentile (meaning 99% of all the other scores in that administration of the test were lower) a ranking formula is what tells you so.

The two functions take different approaches to determining a percentile rank and you should understand the basic difference.

CUM_DIST determines a percentile rank by calculating the ratio of the number of rows that have a lesser or equal ranking to the total number of rows in the partition.

PERCENT_RANK determines a percentile rank by setting the lowest value (that is, the first row returned by the query) equal to 0 and assigning all the remaining rows with this formula:

(n-1)/(m-1) where n is the nth row in a partition of m records.

There are other differences between the two functions but a detailed review is way beyond the promised scope of this post. However, you will do well to remember four points:

  • The two functions are similar, but (unlike many books lead you to believe) they are not identical. That’s why they return different answers when you use them side-by-side in the same query.
  • CUME_DIST returns a position of a row and PERCENT_RANK returns a rank of a row.
  • CUME_DIST always excludes 0 and PERCENT_RANK always includes it.
  • In most cases where your goal is to rank records by percentile, PERCENT_RANK is the function you want to use.

Both functions can be used in two forms: aggregate or analytic. Use the aggregate form when you want to find the percentile rank of one particular record in the database according to some criteria you specify. Use the analytic form when you want to find the percentile ranks of a group of records in the database. You can spot the aggregate form because the SELECT statement will contain a WITHIN GROUP outer table join. The analytic form will use a PARTITION BY clause instead.

The aggregate form uses this syntax:

PERCENT_RANK (expression) WITHIN GROUP
(ORDER BY order_by_clause [ASC|DESC] [NULLS FIRST|LAST] );

Here’s a simple example:

SELECT PERCENT_RANK (100000000, 1000000) WITHIN GROUP (ORDER BY total_gross, star_salary) "Percentile Rank" from movies WHERE movie_year IN ('2006');

The above SQL statement will return (from a table called Movies) the percentile rank of a particular 2006 movie that grossed $100 million and paid its starring actor $1 million.

The analytic form uses this syntax:

PERCENT_RANK () OVER
([PARTITION BY query_partition_clause] ORDER BY order_by_clause);

Here’s an example of that:

SELECT movie_name, movie_year, movie_type, total_gross,
PERCENT_RANK () OVER (PARTITION BY movie_type
ORDER BY total_gross DESC) "Percentile Rank"
FROM movies
WHERE movie_year IN ('2006');

The above SQL statement will return a table listing the percentile rankings of all movies released in 2006 according to their total box office gross.

As a closing note, remember that when using the aggregate form the number and datatypes of expressions inside the first parenthesis must match the number and datatypes of expressions inside the second parenthesis.

Now go rank some data.

Posted in Data Analysis, Jamel Cato, Oracle Tagged CUME_DIST, Data Analysis, Oracle, Oracle script, PERCENT_RANK
Next →
Cato's Life of the Mind is the personal site of Jamel Cato, a writer and database developer who can usually be found between the pages of a good book when he's not laughing with his kids. | More about me

Categories

  • Art
  • Books
  • Business Intelligence
  • Data Analysis
  • Excel
  • Gadgets
  • Life on the Web
  • Microsoft Access
  • Movies
  • Music
  • Oracle
  • Pleasure Reading
  • Reporting Tools
  • SSIS

Copyright © 2012 Cato's Life of the Mind.

Powered by WordPress, Trending, and Cato's Life of the Mind.

Hide

Home  | Contact Me  |  Sitemap  |  Legal

© 2012   Jamel Cato | All Rights Reserved