Databases and PHP go together like cake and ice cream, Trinidad and Tobago, green eggs and ham - you get the picture.
After all, what’s the Web about? Making vast stores of information available to a more or less wide public, that’s what. Not that there aren’t small brochureware sites galore, but the bigger and more frequently updated the data source, the more comparative value is provided by the Web over other media.
Perhaps the single greatest advantage of PHP over similar products is the unsurpassed choice and ease of database connectivity it offers. As detailed in the “Choosing a Database for PHP” section of this blog.
PHP supports native connections to a number of the most popular databases, open source and commercial alike. Almost any database that will open its API to the public seems to be included eventually. For any unsupported databases, there’s generic ODBC support.
What Is a Database?
A database is a separate application that stores a collection of data. Each database has one or more distinct APIs for creating, accessing, managing, searching, and replicating the data it holds. Other kinds of data stores can be used, such as files on the filesystem or large hash tables in memory, but when professionals talk about databases they mean a standalone application such as Oracle or SQL Server or Sleepycat.
Why a Database for PHP?
If you’re going to the trouble to use PHP at all, you’re likely to need a database sooner or later - probably sooner. Even for something small, like a personal Weblog, you want to think hard about the advantages of using a database instead of static pages or included text files.
Maintainability and scalability
Having PHP assemble your pages on the fly from a template and a database is an addictive experience. Once you enjoy it, you’ll never go back to managing a static HTML site of any size. For the effort of programming one page, you can produce an infinite number of uniform pages. Change one, and you’ve changed them all.
There are now Web sites with hundreds of thousands of separate pages - you can rest assured that no one is maintaining them all by hand. If you have a Web concept that may eventually grow to more than a few dozen pages, you should think about moving to a database sooner rather than later.
Because a database is an application rather than a part of the operating system, you can easily transfer its structure and contents from one machine to another or (in certain cases) even from one platform to another. This is especially valuable for contractors, who may develop a project without being able to control the environment in which it will eventually be deployed - they can deliver a package of PHP plus a MySQL database schema dump in one tarball or zipfile. There are even well-known PHP programs, such as vBulletin, that keep most of their code in a database to make it easier to distribute.
Avoiding awkward programming
Certain things can be done with PHP but probably shouldn’t, because they entail ugly or risky programming moves.
Say you happen to be the commander of the starship Enterprise and are keeping a Captain’s log. Each episode is contained in a text file identified by its unique stardate, which is plugged into a template by PHP - but hey, you’re a busy spaceman with whole galaxies to explore; you don’t always have time to write in your log every day. You want to put automatically generated Next and Previous links on each page for those who wish to read in straight chronological order. It’s pretty easy to use PHP to find the previous stardated entry, but any attempt to locate the next entry can quickly become an infinite loop - because it’s easier to prove something does exist than that it doesn’t. On the other hand, if you put your log data in a database, the whole job becomes trivial. The database will tell you which is the latest entry at any given moment.
There are other types of programming tasks that a database is highly optimized to do, and given the option, you should take advantage of these chores. For instance, you should never sort data sets on the PHP side - you should learn to write your queries so they come back pre-sorted.
Although it’s possible to search multiple text files for strings (especially on Unix platforms) it’s not something most Web developers will want to do often. After you search a few hundred files, the task becomes slow and hard to manage. Databases exist to make searching easy. With a single command, you can find anything from one ID number to a large text block to a JPEG-format image.
In some cases, information attains value only when put into a searchable database. For instance, relatively few people would want to read a long text list of movie directors and their films, but many might occasionally want to search a database of that information. You could argue that it’s the searchability, as much as the information itself, that creates the value here.
A database adds another layer of security if used with its own password or passwords.
Say you use PHP to maintain a company’s customer files, filled with information about the prices each customer paid for your product and complaints that customers made. This information would be gold for your competitors, and embarrassing if it leaked to anyone - but you need to put it on the Internet so it can be accessed by your worldwide army of salespeople. If you have PHP write each new customer record to a text file, you must give the HTTP daemon user (usually Nobody or Everybody) write access to your most sensitive directory. This is not a good idea. By having PHP write to a database instead, you can maintain read-only directory permissions and also ask for a second password before the database can be altered.
Or take the case of a content site with a large number of visitors, a smaller number of writers, and a handful of editors. You can easily set database permission levels for each group so that visitors can just look at the database content (as formatted in Web pages), writers can browse and change only their own entries, and editors can browse/change/delete anything in the site.
So far, we’ve been considering only so-called two-tier sites: PHP takes raw data from some kind of storage system and turns it into HTML. However, one of the intentions of PHP is to become the “glue” in three-tier or n-tier development. If you have anything more complex than the simplest two-tier architecture, you really need a database.
An n-tier architecture is an arbitrary number of software subsystems linked by a Web site on the front end and one or more databases on the back end. One fairly common n-tier architecture is that of a big e-commerce site, which has shopping carts linking up to order-taking systems linking up to supply-chain management routines - plus product databases, customer databases, credit-card debiting, FAQ-o-matics, recommendation engines, Web-log analysis tools, caching proxies, phone-center knowledge bases, and who knows what else lurking behind the scenes. Under these conditions, you need the advanced database capabilities that we’ll describe in the “Advanced Features to Look For” section of this blog.
Potential downside: Performance
You may be concerned about performance. It’s true that a database-driven site on the same hardware and software will always be slower than a static site. It’s true that some databases are faster than others. However, the real question is whether the performance hit will even be detectable by you. If we’re talking about adding milliseconds to your latency (and we usually are), who cares? Some of the concerns about performance you read on the Internet are so overblown that they verge on the absurd. It’s also true that the minute you need to search any large volume of text in a database versus flat files, you’ve gained back any performance hit a thousand times over.
Once you need to go into the database for even one piece of information per page, it’s almost always cost-effective to put all your data in the database. Most of the overhead of a database query is front-loaded in establishing a connection. If you have to do that to get a name or title, downloading a couple thousand words of text is almost free.
In the end, only performance testing will establish whether a database is too slow for your particular setup and task - everything else is just talk. Many totally database-driven Web sites easily achieve subseconds of latency, which should be good enough for most purposes.
Choosing a Database for PHP
Although databases (even relational ones) have been around for a long time, they were quite expensive or limited in functionality until very recently. Therefore, even a lot of experienced programmers never had to learn much about choosing a database for a particular need. For that reason, we feel it’s worthwhile to review the basic factors that go into making such a decision.
You may not have a choice
Realistically, you may not have much of a choice. Many people are specifically looking for the fastest way to put their legacy database online rather than enjoying the luxury of deciding on a scripting language first and choosing a database later.
Furthermore, decisions about OS, Web server, and programming languages can make some of your decision for you. A custom Java application on a “Big Iron” Unix platform is just not going to mesh very smoothly with Microsoft SQL Server. (In theory it’s possible, but in practice you’d have to be a glutton for punishment - although the other way around isn’t so bad.) The bigger the system gets, the more constrained one’s choices are likely to be by previous decisions.
The good news is that PHP is committed to supporting many databases and other back end servers. It can help you knit up the loose ends of an architecture that has grown organically over time, as so many have. Some functions in PHP exist solely to aid in porting your data over to a more modern database.
Flat-file, relational, object-relational
Databases are kind of like kitchen equipment: The simpler the tool, the more skilled the operator needs to be to achieve a great result. Expert chefs can produce gourmet fare using nothing but a very sharp knife and a few old pots and pans, whereas amateurs must whip out the Cuisinart and the Calphalon to produce similar results.
So it is with databases. It can get almost laughable to read people’s arguments about the purported failings of this or that database, knowing that the skill of the individual user is reflected by this piece of software more than by almost any other. Suffice it to say that many technical masterpieces live in the simplest hash tables, while untold botched messes are simmering along on the latest and greatest object-oriented Java-enabled DBMS.
You can use three main types of databases with PHP: flat-file, relational, and object-relational.
Flat-file or hashing databases, such as Gnu DBM and Berkeley DB (aka Sleepycat DB2), are mostly used by or within other programs such as e-mail servers. They provide the lightest- weight and fastest means of storing and searching for data such as username/password pairs or dated e-mail messages. Old-school C programmers usually have the most experience with this type of database.
These databases do not themselves create a representation of more complex relationships between data points. Instead, this is done by the accessing client program. Although the results can be extremely impressive, it all depends on your skill as a programmer.
The relational variety is now the most common type of database. People have different ideas of what constitutes a relational database, and we don’t want to get pulled down into that particular definitional tar pit. Therefore, we’re going to arbitrarily say that databases that speak fluent SQL are relational. Most of the popular databases commonly used with PHP are relational.
But there’s relational, and then there’s relational. Certain very popular commercial databases, such as Filemaker Pro and Microsoft Access, were not designed to be used on the back end of a production Web site. Although they have a certain level of ODBC support, and therefore PHP can get data from them, they were mostly designed for ease of use rather than speed. Even worse, most users of these products refuse to avail themselves of what relational features there are, preferring to repeat text information in each entry rather than creating a separate table representing a relationship. Finally, these databases generally lack threading, locking, and other production features. People out there must be trying to use Microsoft Access with PHP, because they post to PHP mailing lists and forums, but evidently not for public sites with significant traffic. (We do, however, know developers who use Access or FileMaker Pro as development tools on their laptops so they can program on the airplane, and there are always porting and other projects using legacy data from these semi-relational databases.) You may well find that the best use of these types databases will be to prototype their eventual Web counterparts. For all the failings of Access, many developers claim it has nice data/relationship visualization features.
Finally, there are object-oriented and object-relational databases, new and still developing models of data access. The object-oriented database is intended to work more smoothly with object-oriented programming languages, whereas the object-relational is a hybrid used for data types (such as astronomical and genetic data) that are not well served by ordinary relational databases. However, PHP itself does not require an object programming style of its users or the programs with which it communicates. And despite a slew of new object features in PHP5, the developers still recognize the fundamental strength of the language lies in its simpler procedural roots. That’s not to say that you can’t use PHP with some of these, but the absolute necessity of doing so is questionable.
ODBC/JDBC versus native API
There are two generic standard APIs for database access: Open Database Connectivity (ODBC), and Java Database Connectivity (JDBC). ODBC is closely associated with Microsoft, and JDBC is even more closely associated with Sun Microsystems. Nevertheless, other companies have implemented these standards in their own products, with the addition of specific drivers for each client program.
ODBC and JDBC are more or less mutually exclusive. Something called the ODBC-JDBC bridge is used to allow Java programs to access ODBC databases, but it is very slow. There are also proprietary drivers that do the same job more quickly.
There are also databases that clients can access through their own APIs rather than ODBC or JDBC. This is invariably faster because there are fewer layers in the stack. Most open source databases fall into this category. Some of these also have ODBC or JDBC drivers. So for instance, PHP can access MySQL with a native API, whereas a Java subsystem can use the same database via JDBC. Before you commit to any multiple-access scheme, be very sure the drivers you need are available, affordable, and maintainable.
Although ODBC is slower than native APIs, it has the advantage of being an open standard. Therefore, PHP code written with the ODBC commands will mostly work with any ODBC- compliant database. This feature is very handy if you must start a project with a database that you know will not scale, such as Microsoft Access, and later switch to a more industrial- strength database. Although both are good products in their niches, it can be a lot of work to switch from a lightweight database like Mini SQL (aka mSQL) to an enterprise-ready server suite like IBM’s DB2. (Again, a good programmer who is given the time and resources can make any application relatively easy to port, while an inexperienced or rushed developer may not be able to do so.).
Advanced Features to Look For
This section mentions specific SQL database features with which you may not yet be familiar. We’re hoping you will instantly have a gut feeling, even from so brief a description, if you truly need one or more of these features.
Databases vary enormously in their user interface tools. Choices range from the starkest command-line interactions to massive Java-powered development toolkits. You pay for what you get, both in cash and in performance. Look for the lightest interface that meets your needs, because a GUI can add substantially to overhead costs.
One lower-cost alternative to the built-in GUI is a Web interface. These are often custom developed, but there are also third-party products that may meet your needs. For instance, MySQL has several freely available Web-based interfaces, which can be found by searching at Freshmeat or SourceForge. The most popular of these is probably PHPMyAdmin, which is available at the PHPMyAdmin Web site.
A subquery or subselect is an embedded
SELECT statement, like this:
SELECT * FROM table1 WHERE id IN (SELECT id FROM table2);
There are ways to work around a lack of subselects, and not everyone needs them. However, they can save some time if you consistently need to make large selects, inserts, and deletes.
SELECT INTO is a handy feature if you need to move data from one table to another frequently. The syntax can vary a little. One method is:
SELECT INTO table2(col2, col3, col7) lastname, firstname, state FROM tablet WHERE col5 = NULL;
Another way to get the same result is:
INSERT INTO table2(col2, col3, col7) SELECT lastname, firstname, state FROM tablet WHERE col5 = NULL;
A join is a way of searching for something across tables by using shared values to match up the tables. The simplest form is:
SELECT * FROM tablet,table2 WHERE tablet.id=table2.id;
This yields the complete contents of whichever rows in the two tables share ID numbers. More specific and extensive types of joins exist, including left or right, straight or cross, inner and outer, and self, but you may not need them.
Joins are very handy and timesaving, sometimes well-nigh essential, but in practice few need the more esoteric forms, so don’t reject a database out of hand for lacking a right outer join.
Threading and locking
Threading and locking are very important for multiple-tier sites and two-tier sites that have many contributors. They prevent two database calls from bumping into each other, so to speak, by giving editorial control to only a single transaction at a time.
An example that clearly illustrates the value of threading and locking is a Web site that sells tickets to popular rock concerts (assigned rather than “festival” seating). Obviously, you would not want two people to be able to purchase the same seat at the same event due to a database error. The database needs some way to recognize unique requests and let only one user (or thread) make changes at any given moment, while others are locked out until the first transaction is complete.
Unless you’re sure your project (a Web log, for instance) will have only one user at a time, be careful of committing to a nonthreaded database.
This term refers to a database design that seeks to maximize data integrity. The transactional paradigm relies on commits and rollbacks. Transactions that are concluded successfully will be committed to the database. Those that are not successfully concluded will not be saved, or the database will be rolled back to its previous condition.
Generally, transactions become more useful in situations where you want an all-or-nothing commit on a group of inserts. An e-commerce system might use rollbacks in situations where a customer’s credit card is declined, choosing not to record the customer information, the purchase order, the inventory change, and so on. Rollbacks are also useful in the case of data corruption, as when a database server experiences a hardware failure incident.
An alternative data-integrity design is called atomic. Proponents of the atomic paradigm claim it is much faster and just as safe, but transactions can be easier for a large number of programmers to work with because it puts more of the logic in the database layer.
Procedures and triggers
Procedures are stored, precompiled queries or routines on the database server. A common procedure would be one that selects out all the e-mail addresses of customers who made purchases on a particular day. If you use the same select statements over and over, procedures can package them in a handy and fast way for you.
Triggers are procedures that occur when some tripwire event is registered by the database. Depending on the database, you could write a trigger to send an account-statement e-mail to customers or associates of your site, and set it to go off at midnight every Sunday. Another handy use would be to send an e-mail to the database administrator every time an error is registered. Relatively few databases use triggers because they take a good deal of programmatic power and lots of extra cycles to track potential events.
Indexes are a way to speed up searches of large data sets. You can think of it like this: Say you need to find a particular customer’s file in a large stack of files tossed haphazardly on your desk (not that you would ever do this in real life, of course). Or you could look for the file in a set of filing cabinets sorted by some alphabetical scheme. Obviously the filing-cabinet system would be faster, because it would presort the documents into smaller buckets. In a nutshell, that is what indexes do.
If you have a million users, it will be very slow for a database to find one by last name because the program will have to look at each entry in the Lastname field and compare it to the string you’re searching for. An index placed on that column will make it possible for the database to search only part of the data set, and this will result in a faster search.
However, indexes should not be used by everyone. For one thing, they typically slow down writes while speeding up reads. They don’t necessarily speed up every type of query, and your data set may not be big enough to show an appreciable speed difference. Indexes are a scalability feature - they’ll help you a lot if you have half a billion entries, but they very well may not help you at all if you have 500.
Foreign keys and integrity constraints
The relational structure of a database is often implicit in the ways fields of one table refer to row IDs of another, but your database won’t necessarily do anything helpful to make sure that structure is respected as changes are made. One way your database can help is via cascading deletes - automatically deleting rows that depend on other rows being deleted (this is sometimes implemented as a trigger). For example, if you delete a hospital patient record, you might want all the orphaned rows in the corresponding table of patient visits to automatically be deleted too. Alternatively, a database system can simply not permit the deletion of parent rows unless potential orphans are deleted first. Whether this kind of a constraint is a lifesaver or just an annoying restriction depends on how crucial it is that the relational structure be completely reliable and consistent, and how frequently you need to do these kinds of dangerous operations. Most of these features can be implemented, although less cleanly and efficiently, with a combination of traditional keys and the client code you use to manipulate the data.
As your data store expands, you will need to think about scaling up. For a certain amount of time, one can just move the database server to faster machines with more processors and bigger disks - but sooner or later a growing database will need to be replicated on more than one server.
To do this, there must be some means of automatically keeping the different servers synched up. This usually involves a journaling system, and often a master-slave relationship between database servers. One database is designated the master, and all new data is inserted into it. A journal keeps track of these changes in chronological order. All the other servers are slaves, which serve up data rather than taking it in. They periodically read the master journal and make the same changes in themselves.
The next step up is some kind of failover mechanism, by which a slave can become the master database server if the master goes down. Think carefully about how bombproof your data needs to be, as this type of safety is expensive.
If you’ve never chosen a database before, the large choice of PHP-supported products can be dizzying at first. Table 1 will give you a first-glance introduction to the various databases most easily available to PHP users, with notes on drivers and licensing.
Table 1: PHP-Supported Databases
FF= Flat-file; R = Relational; O-R = Object-relational; U = Unix; W = Windows, C = Commercial; OS = Open Source; Sh = Shareware
Database Abstraction (or Not)
Database abstraction - writing wrapper functions or classes instead of using the bare PHP commands - is one of those quasireligious topics in programming. Some excellent programmers swear by it and make good arguments for it. Others, just as experienced and articulate, think it sounds better than it usually works out to be. We know of top PHP teams that have members in both camps.
The truth of the matter is that it’s more of an issue with PHP than with almost any other programming language because PHP is so strong in multiple database connectivity. Enterprise Java installations are often all about choosing between two or three very similar database products with JDBC connectivity, so Java developers don’t argue nearly as much about the merits of database portability. Similarly, ODBC is the standard option for ASP developers. The issue of database abstraction basically arises just because PHP hooks up to so many database products with fast native APIs.
The arguments for database abstraction basically boil down to this: You can swap databases without changing a lot of code, and sometimes it saves you some keystrokes. The arguments against database abstraction basically boil down to this: If you have to swap databases, you’ve probably already screwed up big-time, and the practice limits you to the most basic, common SQL functions used in a fixed pattern instead of letting you code in the way which takes maximal advantage of your particular database’s feature set.
For instance, one of Oracle’s coolest features is the capability to take an entire result set and dump it into a single-dimensional numerical PHP array. None of the other major databases used with PHP have this feature - they can fetch only a single row at once. To implement it yourself in a portable way could add considerable overhead to your query, because you’d have to fetch each row and rewrite it to a value in some second array. Of course, you can do this kind of thing - the question is, should you?
Not to say that database switching doesn’t happen - we’ve personally had to do it more than once. However, the experienced mind quails at the contention that you should be able to plug in different databases whenever the spirit moves you. Not to belabor the obvious, but it’s a non-trivial task to change databases! It’s not something most developers will want to contemplate unless the situation is absolutely dire. In every case in which we were personally involved, the database change was part of a complete site rewrite and involved a tremendous amount of pain for everyone involved - all-nighters, frantic attempts to flip through 1200-page database manuals, endless data transfer attempts which choked countless numbers of times. All of which is to suggest that database portability might be a good idea in theory, but in practice it usually means a bad, bad architectural mistake was made at some prior point. In the early days of the Web, bad architectural decisions were a matter of course because no one could predict which technologies and products would last. As the rate of change has slowed somewhat, and clear leaders in various product categories have emerged, this type of rescue operation should become less necessary.
As always, our advice is to put your own needs first and be skeptical of unsolicited advice. Remember that many commercial and open source PHP packages that incorporate database abstraction may have a mission different from yours - often they are focused on getting the largest installed userbase and are willing to accept some non-optimal database code to achieve it. You might just need to focus on how to keep one particular site running most smoothly. Heavily discount any advice from those who have not actually had to switch databases. When people talk about their experiences swapping databases, find out how big the change was in programming terms - migrating from MS SQL Server to MySQL is less of a stretch than switching from MySQL to Oracle. Then make a decision about what’s right for you in your particular situation.
There's one particular situation in which you may have to accept database abstraction even if it's not the right technical decision. Our experience has been that many inexperienced clients or bosses want the option of switching from some perfectly functional database to Oracle or DB2 in the undefined future. This is a delicate social-engineering concern, because they have an investment in the idea that their online business will grow massively and become a real enterprise. It is pointless to tell them that by the time they need Oracle or DB2, the sun will have consumed all its fuel and the Himalayas will be eroded into a flat plain. Smile and implement database wrapper functions.
Our Focus: MySQL
We, like most every other team who has ever written a book about PHP, love MySQL and use it in all the upcoming examples in Part II MySQL is quite likely the fastest, cheapest, simplest, most reliable database that also has most of the features you’d want and - this is the real differentiator - comes in more or less equally good Unix and Windows implementations.
Despite our love of, and faith in, MySQL, we do recognize that it will occasionally fall short of your needs. Later on in this book, we cover, admittedly in somewhat less depth, two alternatives. In my future blogs, we’ll spend some time with PostgreSQL - an open source database that aspires to the object-relational design described above. For most purposes, using PostgreSQL is akin to driving in a thumbtack with a sledgehammer, but if you simply must have some of these features, it is a strong (and free) implementation that gets the job done for a lot of people.
In my future blog we will spend some time with Oracle. You are probably already aware that Oracle is a commercial product, and you may wonder why PHP has such substantial support for such a philosophically different product. Whether by technical merit, sheer marketing genius, or mass hypnosis - Oracle is ubiquitous in the commercial setting; and PHP’s excellent support for it has opened the door for open source in a lot of organizations that might not otherwise have seen the light.
The great advantage of the Web is its capability to make large quantities of information publicly available quickly and cheaply. This functionality has been tremendously enhanced by the recent increase in availability of inexpensive, reliable databases.
Since many experienced programmers may never have had to choose a database before, we describe some of the basic points that should be taken into account in the decision-making process. These include the basic database design (flat-file, relational, or object-relational), the API or driver, and the ease of future porting. Optional features, such as transactions or a graphical interface, may also figure into the choice of database. PHP supports many databases of a variety of types, so you have an excellent chance of finding exactly the feature set you need.