Analytics Roundup: Open knowledge resources
By Zach Gemignani
July 6, 2007
Find more about:
data
- Comprehensive Knowledge Archive Network
- CKAN is a registry of open knowledge packages and projects... the place to search for open knowledge resources as well as register your own—be that a set of Shakespeare's works, a global population density database, the voting records of MPs, and more.
ANDs, ORs, and IFs: Comparing big lists in Excel
By Chris Gemignani
May 1, 2007
Find more about:
data
excel
One problem we face when manipulating large amounts of data in Excel is checking to see if two lists of the same length contain the same items. For instance, we might be given a list of products that a company has for sale this month, running to thousands of items, then the following month, we get another list of products for sale and we need to see if there has been any change between those two lists. This isn’t too hard to deal with when you only have a hundred or so items, but it gets a little thorny when your list runs to tens of thousands.
What we do is line the two lists up, side by side, in sorted order.

Use the simple “A1=B1” formula to compare pairs of items in the lists.

If the pairs are the same, this will be true, otherwise they’ll be false.

Copy this formula down for all your rows. Then use the AND function and give it the entire range of comparison formulas.

This will only be true if every single one of the values in your list are exactly matches. If even one comparison is false, this big AND statement will evaluate to false.
This is a quick and dirty approach. For tougher problems, we use a slightly more complicated formula in the comparison where we evaluate it to 1 if the value is true, 0 if the value is false. This gives us more flexibility to combine comparisons, but that’s a topic for another post.
19 comments | Show all comments only the last 5 are shown
Tre said:
Are you serious? This is an extremely rudimentary tutorial for this blog.
EK said:
I find these quick snapshot overviews very useful. After 20 years in analytics I am never too old to learn or be reminded of tips and techniques
derek said:
I appreciated it, although I prefer to use an array function to do the same thing. jeez, I didn't even know you could do that with Excel Boolean functions, I thought they only took parameters separated by commas, like CONCATENATE.
If only CONCATENATE did the same thing, it would actually have a use. Seriously, does anyone know what you can do with CONCATENATE(a,b,c...) that you can't do with a&b&c...?
(the first version of thos comment died without error message and was lost, so that I thought it had just gone somewhere for moderation; this may just be a problem at my end, or it may have been my failure to type "juice")
derek said:
I appreciated it, although I prefer to use an array function to do the same thing. jeez, I didn't even know you could do that with Excel Boolean functions, I thought they only took parameters separated by commas, like CONCATENATE.
If only CONCATENATE did the same thing, it would actually have a use. Seriously, does anyone know what you can do with CONCATENATE(a,b,c...) that you can't do with a&b&c...?
(earlier versions of this comment died without error message and were lost, even though I typed "juice"; it may just be a problem at my end)
derek said:
Wonderful. The first attempt disappears permanently, even though I waited for it, so the second attempt only temporarily fails to appear, so that I double-post with the third attempt :-(
James B. said:
Great new design. I love it.
I noticed that the links to state data are not working on this page:
http://juiceanalytics.com/writing/2006/03/census-data-in-google-earth/
Specifically, I'm looking for the Pennsylvania data.
Adam said:
I often use a similar technique to make sure that all the items in one list are in both. The problem is that if I just do =a2=b2, if I insert cells to add items from one list to the other, Excel automatically adjusts the references in the boolean cells and they get all screwed up. The solution I've found is to use "OFFSET" -- even if you insert cells, the references stay static, so the boolean variable is looking in the right place.
Scott H. said:
I have this problem all the time as well. Any suggestions on how to line up the rows so that they then match? I usually have to do this by hand .. inserting cells and moving the text in order to manually line the rows up.
Jason said:
Oh man, I'm the newest employee and my cube was recently moved across from the GUI jock (frequent audible sighs, grumbles and guffaws, smokers cough/cackle). And I've been learning how to run various sales reports from various databases on various days to send to various people.
At first I was happy to be learning something new, and I was feeling more like a productive member of the team, so thanks for clueing me in/bursting my bubble. But I've just finished my first year of my MBA, so there may be hope.
David said:
Juice- yeah, I've been doing this for years, but it is still news to some folks. There are some neat tricks with ones and zeroes, and even pivot tables to find missing items.
Derek- CONCATENATE() and the ampersand have very similar functions... the only benefit that I have seen to the CONCATENATE command is that it makes it easier to just click away at cells, as opposed to interrupting with an ampersand.
Jason- It's all over for you, man.
(just kidding)
David G. said:
Great tip, but it's by no means fool-proof. If just one pair doesn't match it could cause the rest of your list to come up as false - which makes it difficult to figure out which items are new.
Example:
1001 1001
1002 1002
1003 1004
1004 1005
... ...
1999 2000
2000 3000
Your entire list will be off, but you won't be able to provide a listing of what has changed - which is what I would expect the point of this exercise to be.
Wouldn't it be better to insert a column and use a combination of INDEX and MATCH to give a true comparison list that you could sort on?
ponyfizz said:
If it's just numbers, I would subtract one from the other and sort by the formula column.
All of the non-matching numbers would be grouped (at the top or bottom of the list). Simple
Rolen said:
I usually use the function, countif to handle this..
Countif(B:B, C3)
Chris Gemignani said:
3. Sorry to hear about the commenting woes. We'll fix comments in the future to appear faster. If only CONCATENATE worked on ranges!
Robbin Steif said:
To the person who wrote, this is too easy a post for Juice: you probably have a strong command of the topic, but not everyone does. And like someone else wrote: it's nice to be reminded of tricks again.
To the person who complained that then you have an error somewhere and you can find it: all you need to do is sort all the rows by the column with the proof in it, so that you can see where the problem is. You can sort Ascending, since F comes before T (how's that for easy advice?)
And finally, you can use a similar trick to dedupe email addresses, or other lists where the info will be identical. I am sure you Juicers are all over this one. But since the data are on two lines, you have to do it like this =a1=b1 And copy all the way down. Then you copy the results and repaste them right where they were, but PASTE SPECIAL - Values. Then you sort on the IF column
mike harding said:
well in the world of statistical genetics, excel doesn't have enough room in one tab. it would be useful to know how to do this in R or SAS.
AW said:
Mike,
In SAS, you'd want to use a sort-sort-merge or some inner joins.
R is on my schedule to learn in the next 2 months; so I got nothing for you there...
Brenda said:
Looks like it works for text string comparison as well!
Will said:
I think the use of MATCH is a much better option. Using the data as you have displayed it, in cell D3, you would have =IF(ISNA(VLOOKUP(C4,B$3:B$xxxxx,2,FALSE)),1,0), and then in cell D4 on downwards you would have =IF(ISNA(VLOOKUP(C4,B$3:B$xxxxx,2,FALSE)),D3+1,D3) where xxxxxx is the final row in your set. In cell D2, you have =MAX(D3:Dxxxxx). This will then tell you how many new datapoints you have.
Add a comment
Analytics Roundup: Late April edition
By Chris Gemignani
April 27, 2007
Find more about:
census
data
juicesite
population
powerpoint
presentation
presentations
pubapp
statistics
tools
- Population Estimates Data Sets
- US census data
- Are you generic? / Wilson Miner Live
- Wilson Miners post re: Django's generic views
Analytics Roundup
By Chris Gemignani
January 25, 2007
Find more about:
data
ibm
visualization
- Many Eyes: IBM's collaborative visualization
- Fernando Vargas' vision comes to light.


0 comments | Add a comment
said: