EXCEL help needed

**cololi** · 09-22-2004, 09:38 AM

Here's my dilema:

I need to calculate the area of several land use classifications based on property ownership records. It seems as though the database I am using has many duplicate records. I need to find the duplicate records and delete them. I could of course, go through each record individually, but there are 16,962 records. There should only be about 14,000 records. Handing it off to an intern is not an option at this point. Does anyone know if excel can find the duplicate records? Any help would be appreciated.

**telechuck** · 09-22-2004, 09:49 AM

Well, this is still a bit manual but..... Find a common theme as in name or area or some column that is repeatable. Maybe tract number. Then sort the data by that column. So, all the repeats will be together. Now, you scroll through and easily find the repeats and delete them. Yeah, it is slow and with 16k records would suck, but this is a way that should only take a few hours.

We had excel files with 25k records in it this summer and had to find bad data and this is how we did it. I strongly recommend some strong coffee and loud house music to help you focus on this tedious task.

**stump832** · 09-22-2004, 09:53 AM

As far as I know, Excel doesn't specifically have any functionality that will delete duplicate records, but heres how I deal with it.

1) Find the column/variable that identifies each line of data (lets call it "account number"). Note, if you have two or three fields of data which identify an observation (for example, name and address), use a concatenate function to create the ID variable (concatenate(name,address)).
2) Create a new column of data which is essentially a counter by writing a formula that checks to see if account number matches the account number above it - if it doesn't, then counter=1, else counter=last counter value+1.
3) apply an autofilter to the data - filter on your counter field, and select the 1's.
4) copy and paste that data into a new spreadsheet - voila, duplicates eliminated.

you'll probably want to spot check it when you're done to make sure everything worked.

**Hayduke** · 09-22-2004, 10:00 AM

Do what Telechuck said then follow it up by inserting a column (column "a") next to the one you sorted by (column "b")..... then in row 2, column "a" put in the formula =IF(B2=B1,"1",""). Copy this formula all the way down the sheet......press the F9 key. All the duplicates should have a "1" in column "a". Copy column "a" and paste special with values. Sort by column "a" then delete all lines with 1 (they should all be grouped together).

I'm bored.

edit: or do what stump said.....

**spanky** · 09-22-2004, 10:04 AM

Originally Posted by telechuck

Well, this is still a bit manual but..... Find a common theme as in name or area or some column that is repeatable. Maybe tract number. Then sort the data by that column. So, all the repeats will be together. Now, you scroll through and easily find the repeats and delete them. Yeah, it is slow and with 16k records would suck, but this is a way that should only take a few hours.

We had excel files with 25k records in it this summer and had to find bad data and this is how we did it. I strongly recommend some strong coffee and loud house music to help you focus on this tedious task.

To make this suggested process a little faster. As telechuck suggests... Sort by the column that you'll use to compare records, then in an adjacent blank column add a conditional statement like the following:

Code:

=IF(A2=A1,TRUE,"")

This bit of code would go in cell B2 if comparing values in column A. Any duplicate values in column A would then have a value of "TRUE" in column B.

You could just manually delete each row that is TRUE.

Alternatively, if you wanted to be a little more efficient... After getting the TRUE calculations, highlight the entire column, copy it to your clipboard, then paste just the values (paste special - values) back into the column (calculations are then removed). Re-sort the spreadsheet by column B and do a bulk delete of all the TRUE rows.

EDIT: Or do what Hayduke said...

**DBdude** · 09-22-2004, 11:06 AM

Use Access. If you don't know Access, then find someone at your office that does. If that is not an option, then PM and we can work something out.

**cololi** · 09-22-2004, 11:08 AM

Thanks guys, but I figured out a better way. Highlight the column you want to filter, go to data/filter/advanced and click on unique values. It eliminates any duplicates.

edit: regarding Access: it is a database for some mapping software, not sure if I will lose any information converting (dbf to access file) it back and forth in Access, but will be playing around with it to see.

**DBdude** · 09-22-2004, 12:11 PM

make a copy of your original data and give it a whack.

**spanky** · 09-22-2004, 12:23 PM

Originally Posted by cololi

Thanks guys, but I figured out a better way. Highlight the column you want to filter, go to data/filter/advanced and click on unique values. It eliminates any duplicates.

edit: regarding Access: it is a database for some mapping software, not sure if I will lose any information converting (dbf to access file) it back and forth in Access, but will be playing around with it to see.

That doesn't eliminate the duplicates, it just hides their values. The rows are still in the spreadsheet.

**fez** · 09-22-2004, 12:35 PM

save it as a csv then write a quick perl or php script to parse the file and rewrite to new csv file if the line isnt a duplicate.

not sure if its the easiest way to do it, but its what i'd do anyway.

<edit> if it is actually a database, instead of a excel file, it is even easier and can be done with a sql query.

**UTdave** · 09-22-2004, 12:40 PM

The lesson I learned in my MIS class was to never use a spreadsheet to store a database. It was a good lesson. Access will be way faster and more efficient for these types of things.

Thread: EXCEL help needed

Thread Tools

Rate This Thread

EXCEL help needed

Bookmarks

Bookmarks

Posting Permissions

We proudly support: