Hearing is a key modality on which several perceptual human processes rely on. Together with vision, these two modalities offer a 360 degrees wide, highly sensitive, quickly adaptive, and incredibly precise system of perception of the environment. In an exploratory robotics context, the concept of audiovisual objects is very relevant for a robot since it enables it to better understand its environment, and also to interact with it. However, how to face the cases when an object is out of sight, or when it does not emits sound, that is, the cases of missing information? The pro- posed Multimodal Fusion and Inference (MFI) system takes advantages of having (i) multimodal information and (ii) the ability to move in the environment, to implement a low-level attentional algorithm that enables a mobile robot to understand its environment in terms of audiovisual ob- jects. In the case of a missing modality, the proposed algorithm is able to infer the missing data thus providing to the robot full information to higher cognitive stages. The MFI system is based on an online and unsupervised learning algorithm using a modified self-organizing map. Furthermore, the MFI exploits the ability to turn the robot head towards objects, thus benefiting from active perception to reinforce autonomously what the system is actually learning. Results exhibits promising performances in closed-loop scenarios involving sound and image classifiers.