<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title> &#187; Data Analysis</title>
	<atom:link href="http://jamelcato.com/category/data-analysis/feed/" rel="self" type="application/rss+xml" />
	<link>http://jamelcato.com</link>
	<description>The Personal Site of Jamel Cato</description>
	<lastBuildDate>Sun, 11 Apr 2010 04:17:22 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=4386</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>An Easy Way to Master INDEX/MATCH Formulas</title>
		<link>http://jamelcato.com/an-easy-way-to-master-indexmatch-formulas-in-excel/</link>
		<comments>http://jamelcato.com/an-easy-way-to-master-indexmatch-formulas-in-excel/#comments</comments>
		<pubDate>Mon, 01 Sep 2008 21:15:47 +0000</pubDate>
		<dc:creator>jamel</dc:creator>
				<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Excel]]></category>
		<category><![CDATA[Jamel Cato]]></category>
		<category><![CDATA[Advanced Excel Techniques]]></category>
		<category><![CDATA[INDEX()]]></category>
		<category><![CDATA[INDEX/MATCH]]></category>
		<category><![CDATA[MATCH()]]></category>

		<guid isPermaLink="false">http://jamelcato.com/an-easy-way-to-master-indexmatch-formulas/</guid>
		<description><![CDATA[At least once a month I use an INDEX/MATCH formula to match and merge patient data from multiple Excel files. I wrote this post because when I first sought to learn the technique I found the other tutorials on the web either lacking or hard-to-follow.

If you’re reading this, chances are you have strong Excel skills [...]]]></description>
			<content:encoded><![CDATA[<p>At least once a month I use an INDEX/MATCH formula to match and merge patient data from multiple Excel files. I wrote this post because when I first sought to learn the technique I found the other tutorials on the web either lacking or hard-to-follow.</p>
<p><span id="more-32"></span></p>
<p>If you’re reading this, chances are you have strong Excel skills and already know what INDEX/MATCH formulas do. For the rest of you, here’s a short introduction:</p>
<p>INDEX/MATCH formulas, created by combining Excel’s built-in INDEX function and its built-in MATCH function into a single compound formula, are ideal when you need to:</p>
<ul>
<li>Merge data from one Excel list into another Excel list by matching records from the two lists; or</li>
<li>Use a common field from two Excel lists to lookup a second (or third or fourth) field by matching records from the two lists.</li>
</ul>
<p>For instance, suppose you had two Excel worksheets for the same group of customers. The first worksheet contains columns for Customer ID and Email Address. The second worksheet contains columns for Customer ID, Phone Number and Age. With Customer ID as the common column, you could use an INDEX/MATCH formula to add each customer’s phone number and age to the email worksheet.</p>
<p>For SQL experts, you can think of INDEX/MATCH formulas as a way to use Excel to do inner joins.</p>
<blockquote><p><em><span style="text-decoration: underline;">Quick Sidebar</span></em></p>
<p>At this point, someone is undoubtedly thinking: I could do the same thing faster in Microsoft Access with a lookup query in Design View or in Crystal Reports with the link tab of the Database Expert. You are probably correct, but this post is intended for everyday users who only have or know Microsoft Excel or situations where setting up an Access DB or a new Crystal Report is just not warranted. But I digress.</p></blockquote>
<p>A standard INDEX/MATCH formula is written like this:</p>
<p align="center"><code>Index( value_array, Match( lookup_value, lookup_array, match_type ), column_number )</code></p>
<p>The MATCH portion returns a <em>position</em> in a list. The INDEX portion returns a <em>value</em> in a cell. So combining them together allows you to lookup a value in a cell based on the position of an item in a list. (What the formula actually does is use a MATCH function as the second argument of an INDEX function.)</p>
<p><span style="text-decoration: underline;">Here’s the Trick</span></p>
<p>Instead of trying to digest all of the above, just rewrite the formula in the following way and replace the double-bracketed portions with your actual data or cell references.</p>
<p><code></code></p>
<p align="center"><code> =INDEX([[find this kind of value]],MATCH([[for this cell within the **first** list]], [[with a match within this **second** list]],0))</code></p>
<p>A few parting notes that might be additionally helpful:</p>
<ul>
<li>The MATCH portion of the formula is processed before the INDEX portion.</li>
<li>If you plan to use AutoFill to copy the formula down a column, ensure that the lookup array is either a named range or an absolute reference to a range.</li>
<li>You cannot refer to an entire column as the lookup array for the MATCH function; You must specify an exact cell range.</li>
<li>The 0 at the end of the MATCH portion is optional and one of three possible choices (1,0,-1). 0 means find an exact match. 1 means find the highest value that matches. -1 means find the lowest value that matches. If you omit this argument, it defaults to 1, which is almost always what you want.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://jamelcato.com/an-easy-way-to-master-indexmatch-formulas-in-excel/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Missing Data Techniques for Dummies</title>
		<link>http://jamelcato.com/missing-data-techniques-for-dummies/</link>
		<comments>http://jamelcato.com/missing-data-techniques-for-dummies/#comments</comments>
		<pubDate>Tue, 03 Jul 2007 13:56:46 +0000</pubDate>
		<dc:creator>jamel</dc:creator>
				<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Jamel Cato]]></category>
		<category><![CDATA[Dealing with missing data]]></category>
		<category><![CDATA[Mean Imputation]]></category>
		<category><![CDATA[Missing Data]]></category>
		<category><![CDATA[Missing Data Techniques]]></category>
		<category><![CDATA[Multiple Imputation]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://jamelcato.com/missing-data-techniques-for-dummies/</guid>
		<description><![CDATA[This is not another article explaining various missing data techniques. This is a post about how to use them without getting an advanced degree in statistics or charming that nice young Data Analyst into doing it for you.

If you Google &#8220;missing data&#8221; you will be barraged with complicated statistical techniques. That&#8217;s par for the course, [...]]]></description>
			<content:encoded><![CDATA[<p>This is not another article explaining various missing data techniques. This is a post about how to use them without getting an advanced degree in statistics or charming that nice young Data Analyst into doing it for you.</p>
<p><span id="more-9"></span></p>
<p>If you Google &#8220;missing data&#8221; you will be barraged with complicated statistical techniques. That&#8217;s par for the course, but none of these sites ever seem to tell you how to actually implement these techniques in the real world with real data. So I&#8217;ll give it a try.</p>
<p><span style="text-decoration: underline;"><br />
<strong>Not that it&#8217;s best, but this is the approach that I use:</strong></span></p>
<ul>
<li>If less than 5% of data points are missing, I use plain old <em>Listwise Deletion</em>.</li>
<li>If less than 10% of data points are missing, I use <em>Mean Imputation</em>. Yes, I know it artificially inflates central tendency and affects standard error. But since such a small amount of data is missing, I can live with it.</li>
<li>If more than 10% of data points are missing and I&#8217;m confident the missing data are MAR or MCAR, then I use <em>Multiple Imputation</em>.</li>
<li>If more than 10% of data points are missing and I believe the missing data are NMAR, then I pull out the big guns and use <em>Heckman Selection Modeling</em>.</li>
<li>If the missing variable is Race, I simply assign the record to the &#8220;Other&#8221; category and forget about it.</li>
<li>If the data are longitudinal and it&#8217;s not the first wave, then I just use the mean of the subject&#8217;s previous observations and forget about it.</li>
</ul>
<p><span style="text-decoration: underline;"><br />
<strong>And here&#8217;s how I go about it:</strong></span></p>
<p>For Listwise Deletion, I import the data in an Excel worksheet then use Autofill to select the blank records or variables.</p>
<p>For Mean Imputation, I use Excel&#8217;s AVERAGE function to calculate the mean and then use an IF statement to insert that mean into all missing records. If your dataset has tens of thousands of rows, this can be excruciatingly slow in Excel. So you should know in those cases it can be done faster in SAS with PROC MEAN.</p>
<p>For Multiple Imputation, I use SAS&#8217; PROC MI function to run the imputations and its PROC MIANALYSE function to calculate the summary statistics on the regressors. I always run five sets of imputations because studies by very smart people have shown this is enough. Excel gurus know that the summary statistics can be done in Excel by searching for the text string <em>_imputation_</em> (including the underscores) and separating each imputation into a separate worksheet. The string <em>_imputation_</em> is the delimiter that PROC MI inserts into the dataset.</p>
<p>For Heckman Selection Modeling, I use Stata&#8217;s HECKMAN command, because it lets you choose between maximum likelihood and two-stage estimation. I always use the two-stage option, although there&#8217;s probably some complicated rule somewhere in the manual explaining when to use which. SAS lovers should know that SAS (version 9 and later) can do Heckman estimation with its PROC QLIM function, but keep in mind it&#8217;s limited to the maximum likelihood version of the model. Either way, people who don&#8217;t know about these functions will be dazzled with your skills. And it sounds fabulous when the footnotes of your survey say something like, &#8220;Missing values were imputed using two-stage Heckman Correction Estimation.&#8221;</p>
<p><span style="text-decoration: underline;"><br />
<strong>For what it&#8217;s worth:</strong></span></p>
<p>If somebody asks me why I use Multiple Imputation over alternate methods, I just say, &#8220;If it&#8217;s good enough for the Census Bureau, then it&#8217;s good enough for me.&#8221; Then I walk away and go to Starbucks.</p>
<p>If somebody asks me why I use Heckman Selection Modeling, I just say, &#8220;If it&#8217;s good enough for the Nobel Prize Committee, then it&#8217;s good enough for me.&#8221; Then I turn my iPod back on and spin back towards my computer screen.</p>
<p>I&#8217;ve heard that Stata&#8217;s ICE command is better than SAS&#8217;s PROC MI function, but I&#8217;m so accustomed to using SAS and Excel for this that I&#8217;ve never tried it.</p>
<p>I know I won&#8217;t win any goodwill points from the American Statistical Association for saying this, but unless your missing data could affect something really important like, say, nuclear missile targeting, I wouldn&#8217;t lose sleep over the choice of technique you choose because <strong><em>every one of them</em> </strong>has legitimate weaknesses and the probability that no one really cares is 99.99%.</p>
<p>Jamel Cato<br />
The Blue Collar Data Analyst<br />
2007</p>
]]></content:encoded>
			<wfw:commentRss>http://jamelcato.com/missing-data-techniques-for-dummies/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to do Percentile Ranking in Oracle</title>
		<link>http://jamelcato.com/how-to-do-percentile-ranking-in-oracle/</link>
		<comments>http://jamelcato.com/how-to-do-percentile-ranking-in-oracle/#comments</comments>
		<pubDate>Tue, 26 Jun 2007 15:58:32 +0000</pubDate>
		<dc:creator>jamel</dc:creator>
				<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Jamel Cato]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[CUME_DIST]]></category>
		<category><![CDATA[Oracle script]]></category>
		<category><![CDATA[PERCENT_RANK]]></category>

		<guid isPermaLink="false">http://jamelcato.com/how-to-do-percentile-ranking-in-oracle/</guid>
		<description><![CDATA[Recently I had to provide a script to convert a dataset of raw assessment scores into an Oracle table with the scores ordered by percentile rank. This is a common request so I figured a short, non-technical post on percentile ranking might be helpful to a lot of people.
There are three main ways to calculate [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I had to provide a script to convert a dataset of raw assessment scores into an Oracle table with the scores ordered by percentile rank. This is a common request so I figured a short, non-technical post on percentile ranking might be helpful to a lot of people.</p>
<p>There are three main ways to calculate percentile ranks in Oracle 9i or later:</p>
<p>(a) Calculate them manually with Joins and Sub-queries;<br />
(b) Use the CUME_DIST function and format the results as a percentage;<br />
(c) Use the PERCENT_RANK function.</p>
<p>In this post I&#8217;m going to focus on options (b) and (c) because the only people who would reinvent the wheel and use (a) are SQL programmers who don&#8217;t know (b) and (c) already exist.</p>
<p><span id="more-5"></span></p>
<p>CUME_DIST and PERCENT_RANK are built-in Oracle mathematical functions that allow you to rank a value based on its relative standing within a set of values. For example, if you hear someone say that a 1600 SAT score was in the 99th percentile (meaning 99% of all the other scores in that administration of the test were lower) a ranking formula is what tells you so.</p>
<p>The two functions take different approaches to determining a percentile rank and you should understand the basic difference.</p>
<p>CUM_DIST determines a percentile rank by calculating the ratio of the number of rows that have a lesser or equal ranking to the total number of rows in the partition.</p>
<p>PERCENT_RANK determines a percentile rank by setting the lowest value (that is, the first row returned by the query) equal to 0 and assigning all the remaining rows with this formula:</p>
<p align="center">(n-1)/(m-1) where n is the nth row in a partition of m records.</p>
<p>There are other differences between the two functions but a detailed review is way beyond the promised scope of this post. However, you will do well to remember four points:</p>
<ul>
<li>The two functions are similar, but (unlike many books lead you to believe) they are not identical. That&#8217;s why they return different answers when you use them side-by-side in the same query.</li>
<li>CUME_DIST returns a <em><strong>position</strong></em> of a row and PERCENT_RANK returns a <em><strong>rank</strong></em> of a row.</li>
<li>CUME_DIST always excludes 0 and PERCENT_RANK always includes it.</li>
<li>In most cases where your goal is to rank records by percentile, PERCENT_RANK is the function you want to use.</li>
</ul>
<p>Both functions can be used in two forms: aggregate or analytic. Use the aggregate form when you want to find the percentile rank of <em><strong>one particular recor</strong><strong>d</strong></em> in the database according to some criteria you specify. Use the analytic form when you want to find the percentile ranks of <em><strong>a group of records</strong></em> in the database. You can spot the aggregate form because the SELECT statement will contain a WITHIN GROUP outer table join. The analytic form will use a PARTITION BY clause instead.</p>
<p>The aggregate form uses this syntax:</p>
<p><code>PERCENT_RANK (expression) WITHIN GROUP<br />
(ORDER BY order_by_clause [ASC|DESC] [NULLS FIRST|LAST] );</code></p>
<p>Here&#8217;s a simple example:</p>
<p><code>SELECT PERCENT_RANK (100000000, 1000000) WITHIN GROUP (ORDER BY total_gross, star_salary) "Percentile Rank" from movies WHERE movie_year IN ('2006');</code></p>
<p>The above SQL statement will return (from a table called Movies) the percentile rank of a particular 2006 movie that grossed $100 million and paid its starring actor $1 million.</p>
<p>The analytic form uses this syntax:</p>
<p><code>PERCENT_RANK () OVER<br />
([PARTITION BY query_partition_clause] ORDER BY order_by_clause);</code></p>
<p>Here&#8217;s an example of that:</p>
<p><code>SELECT movie_name, movie_year, movie_type, total_gross,<br />
PERCENT_RANK () OVER (PARTITION BY movie_type<br />
ORDER BY total_gross DESC) "Percentile Rank"<br />
FROM movies<br />
WHERE movie_year IN ('2006');</code></p>
<p>The above SQL statement will return a table listing the percentile rankings of all movies released in 2006 according to their total box office gross.</p>
<p>As a closing note, remember that when using the aggregate form the number and datatypes of expressions inside the first parenthesis must match the number and datatypes of expressions inside the second parenthesis.</p>
<p>Now go rank some data.</p>
]]></content:encoded>
			<wfw:commentRss>http://jamelcato.com/how-to-do-percentile-ranking-in-oracle/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
