I have a question

omega
  The
Omega sector
America's Last
Line of
Defense

*How do I check for duplicates in a data base using another data base ?

Am 28.07.2012 14:17, Lynne Stevens wrote:

omega
     The
Omega sector
America's Last
Line of
Defense

*How do I check for duplicates in a data base using another data base ?

*

SELECT "A".*
FROM "A" JOIN "B" ON "A"."ID" = "B"."ID"

returns all rows from table A with an equivalent ID in table B.

omega
  The
Omega sector
America's Last
Line of
Defense

*That sounds like they will be merged ! . . They are to remain separate and the "B" file is to be used for a mailing thing after it has the duplicates removed . . the mailing thing I have figured out and what fields/columns to use ! !

Hi :slight_smile:
Database programs are a front-end that read the data that is stored in a database back-end.

Trying to simplify things a bit so this is not entirely accurate ...
With Writer, Calc etc you save a file in a certain format.  Then a variety of programs can open that file.  With a database that file is called a "database".  It's all very confusing.  The format the file is stored in is referred to as the databases "back-end" and the program you use to read the database is called the front-end.

So, is the question "How do i use a different back-end to find duplicates in my back-end" or is it "How do i use some program other than LibreOffice to find duplicates in my back-end?".  Obviously the answer to the 2nd question is best dealt with by documentation or forums or whatever that are set-up by whichever company's front-end you are using.

Regards from
Tom :slight_smile:

Hi Lynne,

*How do I check for duplicates in a data base using another data base ?

By "another data base", do you mean :

(a) another ODB file ?

(b) another type of database manager, e.g. Access vs Calc, or MySQL vs
Access, or even Access vs Access ?

(c) another table in the same ODB file or in the same database schema
managed by a unique database manager ?

(d) any or all of the above combined ?

As you can see, the possibilities belie what would appear to be a
seemingly simple question.

Andreas has answered (c) for you, i.e. where both tables are in the same
database schema.

There is no simple way to compare the contents of 2 HSQLDB/ODB files,
other than extracting the contents from one and importing it into the
other as a temporary or new table and then comparing the table results
from within the single ODB file using a JOIN query.

As for (b) LO Base does not allow you to adress two separate database
schema as far as I know from within a single query. I heard tell once
that it might be possible with Access tables being read by LO if an ODBC
connection was being used, and this only would work on Windows. Not
having tried it, I can't say whether it is true or not.

If what you are aiming for is indeed cross-schema queries in order to be
able to compare, this has been a requested (and never implemented)
feature in OpenOffice.org for many years.

Alex

Comment inline:

Lynne Stevens wrote:

omega The Omega sector America's Last Line of Defense

*That sounds like they will be merged ! . . They are to remain
separate and the "B" file is to be used for a mailing thing after it
has the duplicates removed . . the mailing thing I have figured out
and what fields/columns to use ! !

      You need to be more specific in your description. You began with
two databases that you want to check for duplicates. It has now become clear that you are working with two tables. And then you mentioned fields. What fields exist in these two tables? What determines whether a row in the "B" table is a duplicate based upon what is contained in table "A"?
      The simple answer to your question: the same way that you would do it by hand. The only problem with this answer is that it does not include any specifics.
      If you would write down how you would do this by hand step by step, then we would know what has to be done step by step, and we could suggest how to do each step. You might even see how to do it by yourself.

--Dan

omega
  The
Omega sector
America's Last
Line of
Defense

*That sounds like they will be merged ! . . They are to remain separate
and the "B" file is to be used for a mailing thing after it has the
duplicates removed . . the mailing thing I have figured out and what
fields/columns to use ! !

Hi Lynne, others

First - do you want to do this here or on the Ubuntu forums.. :slight_smile:

If this where a relational database and two tables (with the same
columns structure) then it is a simple MINUS function:

SELECT * FROM TABLE1
MINUS
SELECT * FROM TABLE2

That statement will return all records in TABLE1 that are not also in
TABLE2.

However you have here 2 CSV files, and one of those files IIRC has a
column with a different name.

I don't thin you told folks here about how you are getting the data.

Andreas may have a good way to do that completely in Calc (he is much
better at Calc then I ever will be)

However is this is something you will need to do on a continuing basis
(with new CSV files arriving from time to time) then I would set this up
under Base and attach the two CSV files as TEXT tables. With that done,
then you can use the MINUS command.

It's a little bit of work to set up the connection, but one done you can
reuse it just by overwriting the source files with the data as need.

//drew

omega
  The
Omega sector
America's Last
Line of
Defense

*I really do not care what format it is in as long as it does the job and when I am done with the file I save it back to the same format it was in ! !*

*As it is not my file and Thomas is a windoes user person NOT a computer literate person and would scream if he had to do any thinking about how to open it !

omega
  The
Omega sector
America's Last
Line of
Defense

*There are 2 FILES / Data bases each one has the same number of columns 126 each has same information except in Column R which has a single letter denoting its status as to what kind of sale it is . . Like bank sale, for sale by owner, and for sale by a realestate place and so on !

omega
  The
Omega sector
America's Last
Line of
Defense

*Oh yeah both files are CVS and are the way Thomas downloads them but being he is not to computer literate and most likely does not know he can save it in different ways . . No telling . . I ask one time and he said that is how it gets downloaded

omega
  The
Omega sector
America's Last
Line of
Defense

*Ok Thomas downloads 2 Data bases ( CVS format ) he wants to remove Duplicates from Data Base B using Data base A and THEN Data Base A gets deleted when done

Am 28.07.2012 22:03, Lynne Stevens wrote:

omega
     The
Omega sector
America's Last
Line of
Defense

*Ok Thomas downloads 2 Data bases ( CVS format ) he wants to remove
Duplicates from Data Base B using Data base A and THEN Data Base A gets
deleted when done
*

OK, that is

$ cat fileA.csv fileB.csv | sort | uniq > fileC.csv

Not sure who asked the original question, nor what OS you're on but...

If you're using Linux and the files are in fact text files (CVS) as indicated below, then I would think something to the effect of:

cat file1 file2 >> file3 | sort | uniq > uniqueRecords.txt

Might do part of the trick. I say might because it's been too many years since I hung up my Unix/Linux hat and I've forgotten most of the good stuff. Sorry I can't be more helpful.

The above translates into English as: merge file1 and file2 into file3, sort file3 and then find those lines that are NOT duplicated (i.e. unique) and store them in "uniqueRecords.txt". uniq has an option to return the opposite, i.e. those lines that ARE duplicates.

so then you'd do:
cat file3 | sort | uniq -d > duplicateRecords.txt

Now eliminate the duplicates from the duplicateRecords.txt file
cat duplicateRecords.txt | uniq > almostDone.txt

Finally, merge the first set of unique records with the "unique duplicates".
cat uniqueRecords.txt almostDone.txt >> done.csv

Here's the same thing without all the comments:
cat file1 file2 >> file3 | sort | uniq > uniqueRecords.txt
cat file3 | sort | uniq -d > duplicateRecords.txt
cat duplicateRecords.txt | uniq > almostDone.txt
cat uniqueRecords.txt almostDone.txt >> done.csv

Not having a Unix/Linux system handy I can't test it, but you should see results like this if I got it right:
Assuming file1 is composed of:
x,y,z
a,b,c
d,e,f
and file2:
a,b,c
g,h,i
j,k,l

Results for each line above should be as follows:
1.
d,e,f
g,h,i
j,k,l
x,y,z

2.
a,b,c

3. a,b,c

4.
a,b,c
d,e,f
g,h,i
j,k,l
x,y,z

Another possible tool to look into is AWK. If you know it, it could be even simpler than the above 4 line script.
diff and/or one of it's variants might be another potential tool to get the job done.

omega
  The
Omega sector
America's Last
Line of
Defense

*Hey that sounds good to me I will have Thomas bring over a set of the files
and I will use it . . will make Thomas jump up and down and happy

And then:
$ rm fileA*

Kind regards

Johnny Rosenberg
ジョニー・ローゼンバーグ

omega
    The
Omega sector
America's Last
Line of
Defense

*Oh yeah both files are CVS and are the way Thomas downloads them but being he is not to computer literate and most likely does not know he can save it in different ways . . No telling . . I ask one time and he said that is how it gets downloaded

Lynne,

What I think you are doing is loading two tables from csv files into the *same* database. With databases one finds there is a fairly precise terminology used. Once you have the data loaded then on can query the data. The tables do not need to have the same structure only be related in some logical manner.

If I understand the project you download a CSV file with addresses that has many duplicates of data in a table in the database. The next step is find only those addresses that are not duplicates so they can be appended to the existing table.

If you upload the new data into a table (Temp) you can then write a query to find new/updated entries in the existing table (Addresses).

Roughly the query would look something like this

SELECT relevant columns from Temp, each column listed in the following format t.columnname
FROM Temp AS t, Addresses AS a (using an alias you do not need to type the full table name)
WHERE t.columnname != a.columnname - more than one criteria can be used such as customer name, address, city, phone number, etc to identify the duplicates.

For the WHERE clause I would look for some unique id used in the data sets, such as a customer number.

Keep in mind that you do in fact lose the duplicates completely. I would think you want to keep one of the duplicates.

So for example if the merged files give you 4 records total:

John King
Peter Rabbit
Bugs Bunny

You end up with 2 records - Peter Rabbit and Bugs Bunny - losing John King completely. The solution I presented returns all 3 unique records, losing just one of the repeated names instead of both, which to me makes more sense. Obviously you know your application better than I do, just want to make sure you're not surprised when 25% (or whatever percentage) of your database goes missing.

omega
  The
Omega sector
America's Last
Line of
Defense

*Since it has the recorder office number ( column A ) on the line of all the John Kings and if the number is the same it will be deleted but if not and there are two different John's and it has a different recorders office number it will stay which is good . . and the column R has a letter which says it is not the same . . there are 2 columns that are in this which says which one goes and which stays . . the recorders office number Column A and the Status Column R are the ones which do all the work rest are the same stuff . .