Deep learning AI models are already challenging; multimodal models increase the difficulty considerably. With Textpert’s AiME technology, we take it a step further by combining vastly different modalities: video, a 3d spatial modality; audio, a 1d pressure signal; and natural language, words with obscure spatial qualities.
To be fair, we Textperts appreciate a good challenge… but we are not masochists.
It turns out this multimodal approach is essential to understanding mental health from behavioral cues—just like therapists do.
Though for brevity we omitted ablation studies in our initial paper on AiME, we can confirm what others have noted. For example, an AVEC 2019 paper notes a 20% increase in predictive power when fusing modalities vs. independent modalities, an increase similar to what we noted internally.