Siri works by first encoding sounds into a digital format, then connecting to cloud-based language models to comprehend speech and either handle requests locally or query the network. It recognizes letters from the speech using local and cloud models, estimates words, and creates possible interpretations. It then performs the most confident interpretation by determining intent and carrying out the function on the iPhone, checking its understanding if the speech is ambiguous.
2. Initial Sounds
First, the sounds of your speech are
immediately encoded into a compact
digital form, preserving its information.
3. Language Comprehenders
The signal from the iPhone connects with
the Internet and communicates with a
cloud-based model that consists of a
series of language comprehenders.
4. Speech Evaluation
Simultaneously, a local model evaluates
the speech on the device. Then, with the
cloud models, the device figures out if it
needs the network to move forth or can
handle the process locally. In other words,
a network query could be sending a text or
searching the web, while a local process
would be playing a song or setting an
alarm. If it is deemed local, then it does
not use the cloud anymore.
5. Language to Letters
Using both local and server models, the
device recognizes which letters constitute
which parts of the speech. Now that the
speech is converted into letters-vowels
and consonants, a language model can
estimate the comprised words of the
speech. Then, the system creates a list of
possible interpretations of the words that
your speech might mean.
6. Final Action
After this, it is all downhill. Of the list of
possible interpretations, the most
confident result is used. The computer
determines the intent of the speech, and
performs the function in the iPhone. If your
speech is too ambiguous at any point
during the process, the computer will defer
and make sure that the computer-
determined intent is correct.