Perception and action are fundamental tasks for autonomous robots. Traditionnally, they rely on theoretical models built by the system's designer. But, is a naive agent able to learn by itself the structure of its interaction with the environment without any a priori information? This knowledge should be extracted through the analysis of the only information it has access to: its high-dimensional sensorimotor flow. Recent works, based on the sensorimotor contingencies theory, allow a simulated agent to extract the geometrical space dimensionality without any model of itself nor of the environment. In this paper, these results are validated using a more sophisticated auditive modality. The question of multimodality fusion is then addressed by fitting up the agent with vision. Finally, preliminary experimental results on a real robotic platform are presented.