ref: http://osdir.com/ml/db.postgresql.sql/2002-04/msg00232.html
This is a collection of many performance tips that we've gathered together at Affymetrix, and I thought it would be useful to post them to the PostgreSQL news group.
The single most helpful trick has been the "Partial index trick" at the bottom and the use of temp tables. Most of these tricks came from either this news group, or from my colleagues in the bioinformatics department, so I'd like to thank and acknowledge both groups.
I'd like to thank Tom Lane, who clearly has been working very hard on the Optimizer, and all the other people who have worked on Postgres. Your efforts have been invaluable to us. Keep up the good work!
We are currently working on a Transcriptome project, which is a follow-on to the human genome project, in which we systematically look across all parts of the genome to see what is expressed in the form of RNA. It is publicly funded by the National Cancer Institute and the data is made publicly available at: http://www.netaffx.com/transcriptome/
We currently have about 100GB of data and will soon grow to a multi-terabyte system. We have tables of up to 1 billion rows and have been able to get ~1 million row queries to run in about 5 min. We've been very pleased with postgres. After a couple of major outages in our server room, it came back up flawlessly each time. So it has been an invaluable asset for this project. We run 7.2 on Red Hat on a 2-processor machine with SAN, and we have a 128-node linux cluster which will make analysis runs against the database.
Our main request is continued enhancement of the optimizer for these heavy types of queries. Improved use of indexes, ability to control execution plans explicitly, ability to use indexes for data retrieval without touching the table in certain cases, and other such features would be very useful. I'm also curious to hear about whether there is any good clustering system for making a parallel postgres installation, and if others have experience with creating such large databases.
We've been very happy and impressed with the constant improvements to the system. Thank You!
This page is a long detailed list of performance tips for doing heavy duty queries.- Indexes 1. Indexes are critical. Create exactly the combined (multi-field) indexes that are being joined in a particular join. The order of fields in the index and in the join must match exactly.
- Indexes 2. Multi-Field Indexes. Having indexes on individual columns as well as combinations of 2,3,and 4 columns can help. Sometimes is uses the 3 version, and sometimes it uses one 2 and one singlet index. This can be helpful, especially when seq scan is turned off and you are using limit.
- Indexes 3. Remember that multiple-field indexes must have the fields in the correct order as they are accessed in the query. An index can only be used to the extent allowed by the keys. An index over (A B C) can be used to find (A B), but not (B C).
- Vacuum. Always vacuum analyze the table(s) after creating indices (or loading/deleting data).
- Limit and Order by. May have to use order by and/or limit to use the indexes. May need to use order by with limit. Sometimes order by increases speed by causing use of an index. Sometimes it decreases speed because a Sort step is required. A where condition that is sufficiently restrictive may also cause an index to be used.
- Join Order. Order of fields, joins, and order by fields has a big impact.
- Casting 1. May have to explicitly cast things. For instance where x=3 must become (where x=cast(3 as smallint)). This can make a huge difference.
- Casting 2. Simply adding abs(destype)=(cast 111 as smallint) to my query and turning seq scans off seems to change the query execution plan. Writing this as (destype=111 or destype=-111) makes the cost over 7 times higher!!
- Seq Scans 1. Can you disable seq scans? Yes, you can type "set enable_seqscan=no;" at the psql prompt and disable it. Do not be surprised if this does not work though. You can also disable merges, joins, nested loops, and sorts. Try this and attempt to enable the correct combination that you want it to use.
- Seq Scans 2. In general you would like it to use an index, but don't be afraid to try the seq scans if cost is say < 150,000 and see if it it finishes in a few minutes. For large joins with no where clause, Postgres always uses seq scans. Try to add a where clause, even a non-restrictive one, and use an index. However, remember that postgres must go get the table data too, so this can be more costly. Postgres cannot read data solely from an index (some commercial databases can).
- Seq Scans 3. Sometimes it is true that seq scans are faster. It tries to use the analyzed statistics to decide which is better. But don't always trust it, try it both ways. This is why analyzing your table will produce different execution plans at after analysis -- The analysis step will update the stats of the table. The change in estimated costs might cause a different plan to be chosen.
- Explain Output. Reading the Explain output can be confusing. In general, the numbers are a range. If you are trying to just get some rows back, you'd like the left most number to be 0. This means that the right-most number will probably not happen, because you will not really have to search the entire table. The right-most number is an upper bound. The numbers sum as you go up. What you don't want is a large number for both the min and max. Sometimes a cost of about 100,000 takes about 3 minutes. Sometimes this is not accurate. Sometimes I was able to to see a lower seq scan cost, but when I disable seq scans and used indexes, the actual performance was faster. In general the cost is in milliseconds. Use Explain Analyze which will run through they query and produce actual times.
- SQL tricks. Remember the standard SQL tricks which I will not cover here (get a good thick SQL book). For example using Like, etc. can be slow. Remember that if there is no data in your table for a given where clause, it must scan the entire result just to tell you "no results found" so know your data in advance.
- Nested loops are probably the most expensive operation.
- Having several merges and sorts can be way better than having a single nestloop in your query.
- Explicit Joins. For more than 2 joined tables, consider using explicit joins (see:http://www.ca.postgresql.org/users-lounge/docs/7.1/postgres/explicit-joins.html)
- News Groups. Try the postgres news groups: http://www.us.postgresql.org/users-lounge/index.html
- Hardware/Configuration changes. I won't go into a lot of detail here as this page is more about the query optimizer, but you can look at how much your CPU and memory is being taxed, and try running postmaster with various flags to increase speed and memory. However, if your query plan is not coming out right this will have little impact.
- Identities. You can try typing "and a.id=a.id" and this will actually help encourage the query planner to use an index. In one example, select with x=x and y=y order by x worked best (order by y too made it worse!).
- Temp tables. You may want to explicitly control the query by breaking it into several steps, with intermediate tables being created along the way. You can make these true temp tables, which will go away when you log out, or you may want to keep them around. You might want to create a procedure or script that automates/hides this process.
- Views. Views sometimes say that they are adding a step to the query planner, but it does not seem to impact query speed. But if you add more clauses to the view this may change the query plan in a bad way, which is confusing to the user.
- Stored Procedures. Try writing a stored procedure to more explicitly control the query execution. If you do this break out SQL into many small cursors instead of 1 large cursor, otherwise you will run up against the same problems.
- External programs. As above, breaking out a query into a series of small, explicit nested loops in a C, Perl, or other client program, may actually improve performance (especially if you want a subset of results/tables).
- Monitor Query Progress. Alan Williams provided a good trick to monitor the progress of a long running query. If you add to the query a sequence (select nextval('sq_test'),...) then you can use select currval('sq_test') to see how far the query has progressed.
- Partial Indices. You can use this feature to force use of an index!!! (it is also useful as a true partial index). Assume table1 below has no rows where field1=0. By doing the actions below, it stores the clause field1<>0 in pg_index and when it sees that predicate, it always uses the partial index. In this case we are using it as a full index to trick it. Example:
create index i on table1(field1) where field1 <> 0;
select * from table1 where field1<>0;
Shane Brubaker
BioInformatics Engineer
Affymetric, Inc.
No comments:
Post a Comment