Speech basis fashions, corresponding to HuBERT and its variants, are pre-trained on massive quantities of unlabeled speech information after which used for a variety of downstream duties. These fashions use a masked prediction goal, the place the mannequin learns to foretell details about masked enter segments from the unmasked context. The selection of prediction targets on this framework impacts their efficiency on downstream duties. As an example, fashions pre-trained with targets that seize prosody be taught representations suited to speaker-related duties, whereas these pre-trained with targets that seize phonetics be taught representations suited to content-related duties. Furthermore, prediction targets can differ within the stage of element they seize. Fashions pre-trained with targets that encode fine-grained acoustic options carry out higher on duties like denoising, whereas these pre-trained with targets centered on higher-level abstractions are more practical for content-related duties. Regardless of the significance of prediction targets, the design decisions that have an effect on them haven’t been totally studied. This work explores the design decisions and their impression on downstream job efficiency. Our outcomes point out that the generally used design decisions for HuBERT will be suboptimal. We suggest approaches to create extra informative prediction targets and exhibit their effectiveness by enhancements throughout varied downstream duties.