Hi Garvey,
You may want to contact that company and ask their pricing for 1 or 2 chips. They say $2.40 and $3.85 in quantities of 100K but you won't be using anywhere near that amount. It might be too costly.
Do you care if different people can say the phrase and gain entrance? Or can it be anyone to say the command?
Assuming the IC is too expensive. Here is what you could do.
For different amplitudes, you could record the signal to memory, scale the signal to its maximum (ie if they said it quietly it would still use the full range).
You may also have difficulty with people talking slower or faster. To solve this, you would need to "shrink" or "stretch" the signal to fit the same length. To do that, you will need some sort of indicator to tell the person when to speak (like a tone). That lets you line up the signal to be compared with the one in memory. I don't know how you will determine the end of what they are saying.
Following all this processing, you then perform a correlation between the pattern signal and the signal just entered into the system. You do this by multiplying each sample together (which is why they should be the same length), and then adding all the multiplications together. This step is what typically kills the system because all those multiplies is processor intensive. Following the correlation, you simply need to "tune" the system (ie if the correlation is over X then open door, else, quit). The lower you make X, the more likely the door will open for any noise. The higher, the door will not open at all for any sound.
I believe this method will have problems discriminating between different people though. You can use fuzzy logic or a neural network algorithm to get more specific.
Hope some of that helps.