TEXT2AV – Automated Text to Audio and Video Conversion

. The paper aims to develop a machine learning-based system that can automatically convert text to audio and text to video as per the user’s request. Suppose Reading a large text is difficult for anyone, but this TTS model makes it easy by converting text into audio by producing the audio output by an avatar with lip sync to make it look more attractive and human-like interaction in many languages. The TTS model is built based on Waveform Recurrent Neural Networks (WaveRNN). It is a type of auto-regressive model that predicts future data based on the present. The system identifies the keywords in the input texts and uses diffusion models to generate high-quality video content. The system uses GAN (Generative Adversarial Network) to generate videos. Frame Interpolation is used to combine different frames into two adjacent frames to generate a slow-motion video. WebVid-20M, Image-Net, and Hugging-Face are the datasets used for Text video and LibriTTS corpus, and Lip Sync are the dataset used for text-to-audio. The System provides a user-friendly and automated platform to the user which takes text as input and produces either a high-quality audio or high-resolution video quickly and efficiently.


Introduction
Nowadays, the increasing demand for multimedia has led to a growing need for audio/video content.Multimedia content is something that we can listen to or watch anywhere at any time.Text, graphics, sounds, animation, pictures, and video these all are a few multimedia elements.Everyone watches TV shows, news, movies, and many more on a daily basis which is video content.Listening to music, audiobooks, recordings, podcasts, etc. falls under audio content.The rapid development of technology has increased the usage of smart devices like mobile phones, tablets, laptops, etc. Multimedia content has been used in different areas for different uses.Educational institutions during the pandemic made lecture videos for teaching, companies use advertisements which act as a powerful tool to attract customers, and social media users create different videos for recognition.Everyone searches for shortcuts to complete their tasks in an easier way to save time.Audio content is a type of multimedia that can be created by sound.The information presented in written format is a time-consuming process to understand.For example, when a teacher teaches orally in a classroom, everyone understands the topic more easily rather than when they read it.Reading a long storybook takes a lot of time to complete which may make the reader stop reading whereas the new audio book technology makes them more interested to complete.So, listening to text rather than reading helps to grasp and analyze the content quickly.
Audio content can be consumed while multitasking also.This type of multimedia content can also be listened to while doing different tasks like traveling, eating, and also while doing exercise There are different text-to-speech software available these days like Amazon Alexa, iPhone's Siri, and googles speech engine.Billions of videos have been uploaded on the internet and can be accessed by everyone.Video is another type of multimedia that shows the content in a realistic way with different colors, pictures, backgrounds, etc.We get different types of videos to choose from the text we type but they may contain extra content rather than we need.For example, if we try to search for cat eating then we get different cat videos doing different actions but might not exactly get what we wanted.This usually happens to everyone using YouTube as these videos are created and uploaded by other people.Here it is observed that the appeared videos are not being created they are just displaying the videos which are matching the keywords.By this one might end up with different outputs, unwanted content, large runtime, etc.Hence, we developed a model which generates text-to-audio and text-to-video with the available text input.

Literature Survey
Doyeon Kim and Donggyu Joo [1] , [17] suggested text-to-image-to-video generative adversarial networks using a deep neural network framework called Generative Adversarial Networks referred to as GAN.The objective was to generate high-resolution video using GAN from a two-stage learning process.Methodologies used in this paper were generator and image discriminator which are included in the architecture of adversarial networks.Data is collected from three different datasets KTH Action, MUG, and Kinetics.The Kinetics dataset consists of 500,000 video clippings containing human action whereas the MUG dataset consists ofvarious facial clippings.The accuracy obtained for the generated videos on kinetics is 77.8%.
Yitong Li and his team [2], [18] suggested video generation from text using Variational Auto Encoders (VAE) and Generative Adversarial Networks (GAN) frameworks.The objective of this paper was to generate high-quality videos from the given text.Methodologies used in this paper were gist generator, Conditional Variational Auto Encoders (CVAE).Data was obtained from the Youtube-8M dataset or by downloading videos with matching textual keywords from YouTube.Youtube-8M dataset is a large dataset that contains millions of YouTube videos that cover various areas like games, arts, animals, vehicles, etc. Accuracies achieved in this paper using DT2V, PT2V, GT2V, T2V are 10.1%, 13.4%, 19.2%, 48.8%.Amir Mazaheri and Mubarak Shah [3],f suggested video generation from given text using Latent path construction for temporal modelling.To successfully create RGB frames out of each latent representation and to gradationally boost the resolution, they came up with a solution by stacking up the pooling block.The objective of this paper was to generate more realistic videos using deep learning.Various methodologies used are CNN, conditional batch normalization (CBN), and Latent Linear Interpolation (LLI).Data is obtained from different datasets like actor and action (A2D), UCF101, and Robotic videos.The accuracy is moderate as the proposed method can handle datasets with a higher skip frame rate.ChitwanSaharia, William Chan, and Jonathan Ho [7] proposed the Imagen video: Highdefinition video generation with diffusion models A system for creating text-conditional videos built on a series of video diffusion models.Using a base video generation model and a series of interleaved spatial and temporal video super-resolution models, Imagen Video creates high-definition videos from a text prompt.They go over design choices like selecting fullyconvolutional temporal and spatial super-resolution models at specific resolutions and selecting the v-parameterization of diffusion models as they scale up the system as a high-definition textto-video model.To train their models, they combined an internal dataset made up of 14 million video-text pairs and 60 million image-text pairs with the publicly accessible LAION400M image-text dataset.
Adam Polyak, Thomas Hayes, Xi Yin, and Uriel Singer [6] proposed Make-a-video: Texttovideo Generation Without Text-video Data, A method for converting the enormous recent advancements in Text-to-Image (T2I) generation to Text-to-Video directly (T2V).Their guiding principle is straightforward: from paired text-image data, learn how the world looks and is described, and from unsupervised video footage, learn how the world moves.Make-A -Video offers three benefits.Since there is no need for paired text-video data, it speeds up T2V model training.Additionally, since there is no need for paired text-video data, the generated videos inherit the diversity (variety in aesthetic, fantastical depictions, etc.) of current image generation models.They have created a straightforward but effective method for adding cutting-edge spatial-temporal modules to T2I models.They first approximated the full temporal U-Net and attention tensors in space and time.Second, they have designed a spatial temporal pipeline to generate high-resolution and frame-rate videos with a video decoder, interpolation model, and two super-resolution models that can enable various applications besides T2V.
Levon Khachatryan, Andranik Movsisyan, and Vahram Tadevosyan [5] proposed text-to image diffusion Models are zero-shot video generators a cheaper and easier, zeroshot text-to video synthesis where no training or fine-tuning must be performed on a video dataset.They haven't used any further optimization or fine-tuning, only a pre-trained textto-image diffusion model.Their method's central idea is to enhance a pre-trained text-toimage model (such as Stable Diffusion) with temporally consistent generation.Their approach capitalizes on the excellent image-generation capabilities of pre-trained text-toimage models and increases their applicability to the video domain without requiring further training.In order to enforce temporal consistency, they present two novel and simple modifications firstly they used cross frame attention of each frame on the first frame to maintain the context, appearance, and identity of the foreground object throughout the entire sequence; and secondly, they enriched the latent codes of generated frames with motion information to keep the overall scene and the background time consistent.Which has produced generations of high-quality and timely videos.
Chenfei Wu along with his research team [4] at Duke University suggested Generating opendomain videos from natural descriptions (GODIVA).The objective of this paper is to generate videos from the text in an auto-regressive manner using a three-dimensional sparse attention mechanism.The methodologies used to generate open-domain videos from the text were VVQVAE, three-dimensional sparse attention, and Latent video transformer.The HowTo100M dataset served as the foundation for GODIVA's pre-training, while the MSR-VTT dataset served as the basis for evaluation and the Mnist dataset served as the training ground.The performance accuracy in the RM metric greatly increased to 98.34.

Proposed method
Problem Statement: Nowadays, the increasing demand for multimedia has led to a growing need for audio and video content.Multimedia content is something that we can listen to or watch anywhere at any time.The rapid development of technology has increased the usage of smart devices like mobile phones, tablets, laptops, etc. Multimedia content has been used in different areas for different uses.So, there is a need for efficient manners of converting text to audio.Most of the existing approaches are time-consuming and have low-quality audio content.The problem is that there is a need for an automated model that can convert text to high-quality audio content.The system utilizes a text-to-speech engine and spectrogram and waveform generation to analyze the text and generate high-quality audio content.Our lip sync and face model technology combine a text-to-voice engine with computer vision and deep learning techniques to produce natural-sounding speech that is synced with a 3D/2D representation of a human face.The traditional methods for generating a video are not efficient and result in low video quality.The problem is that there is a need for an automated model that can convert text to high-resolution video content.The objective of the paper is as follows: • Reading a large text of information can be a problem for individuals as it can be uninteresting and will not hold the reader's attention.• Audio and video content can be understood and consumed faster than textual content.
• So, by converting text to audio or video the same content can be delivered quickly.
• To provide a user-friendly platform to generate high-quality audio and video content quickly and efficiently.• Providing the audio in different languages and emotions to make it more attractive and interactive.
• Different types of languages to choose from so that it can be used by different types of users.
• Producing the audio output by an avatar with lip sync to make it look more attractive and human like interactions.

Architecture diagram
Taking the input in the form of text.Training and testing a GAN using super spatial resolution can be a computationally intensive task, as it requires handling large and complex images.However, with the right techniques and hardware, GANs can generate high-quality images that reveal detailed structures at the microscopic or nanoscopic level, leading to new discoveries and advancements.The GAN is trained using generator network learning to generate high resolution images and discriminator network learning to differentiate between original and fake images.Testing the GAN involves generating new samples using the trained generator network and evaluating the quality of the generated samples.Spatiotemporal models, such as those used for action recognition or video processing, require sequences of images as input to capture motion and temporal dynamics.The specific pre-processing techniques used for spatiotemporal models depend on the task and the characteristics of the dataset.It   Taking the input in the form of text.A text-to-speech (TTS) model is a type of neural network that generates synthesized speech from written text.It takes a text input and produces an audio output that sounds like human speech.The performance of a TTS model depends on various factors, such as the size and quality of the dataset, the architecture of the model, and the evaluation metrics used.Improvements in these factors can lead to better TTS models with more natural and expressive speech.A particular kind of neural network called a lip sync model, which is built on a sequence to sequence (seq2seq) model, can be used to produce lip movements that are timed to an audio recording.When given an audio recording as input, the model generates a series of lip movements that are timed to the audio.A particular kind of neural network that can be used to produce realistic photographs of human faces is a face model based on a Generative Adversarial Network (GAN).GANs are made up of two adversarial-trained neural networks, a discriminator and a generator.

Data collection
Publicly available dataset for research is used for this proposed work.Data collection is a crucial step in any paper that involves analysing information or making decisions based on data.The process of data collection involves gathering information from different sources.

Data preparation
Data pre-processing means the process of cleaning and transforming raw data into a suitable form for analysis.It is a crucial step in any data science because the quality of the input data has a huge impact on the accuracy and effectiveness of the model.Some of the data pre-processing techniques are Data cleaning: The process of removing or fixing invalid data, incorrectly formatted or missing values within the dataset.Data transformation It is the process of raw data into the required format such that it can be used for analysis.Data integration A method of processing data from multiple sources and combining them into a single source.Data reduction It is the process of reducing the size of a dataset by removing the irrelevant data.Feature engineering To improve the performance of a model it creates new features from the existing data.

Feature extraction, model construction and training
Feature extraction is a technique of selecting and transforming raw data into a set of relevant features, also known as predictors.The goal of feature extraction is to reduce the dimensionality of the data, while retaining the most important and informative characteristics of the data.In machine learning, feature extraction is used to pre-process data before feeding it into the learning algorithm, to improve the efficiency of the algorithm.Model is developed using either supervised or unsupervised learning methodology using GAN.Training and Testing using GAN (Generative adversarial networks) has some steps.Generally, GAN consists of two neural networks, the generator and the discriminator.The GAN is trained using generator network learning to generate high resolution images and discriminator network learning to differentiate between original and fake images.Testing the GAN involves generating new samples using the trained generator network and evaluating the quality of the generated samples.General steps involved in testing and training of data Pre-processing: First we need to pre-process the data and make it into the required form for the GAN model.It involves normalization, scaling and transformation of data.Model design.

Model evaluation
The proposed model is tested against the statistical measures as well as qualitative comparison with other existing approaches.

Results and discussions
Compared to other text-to-audio and text-to-video conversion techniques in use today, the proposed process provides the best results in terms of audio and video quality, accuracy and speed.The performance of the system is improved by using deep learning, computer vision techniques and complex models such as Wave2lip, GAN, diffusion models and spatiotemporal models.Experimental results show that the proposed scheme outperforms traditional methods as well as other machine learning-based methods that use simple models.For example, the program produces audio and video clips that include lip sync, natural sounds, and facial expressions resembling human speech.The algorithm is more accurate than other methods at finding key points in the text and generating points of interest.Compare-red to the current system, the system is said to increase speed and efficiency, as well as improve the performance of audio and video content.These results show that the proposed method is a good method for text-to-voice and text-to-video conversion with applications in many fields such as entertainment.In our test, we evaluate the performance of different data recommendations and compare them with other read-control methods.compared to audio and text-to-video production.We use a variety of metrics to evaluate the performance of audio and video producers, including Signal to-Motion Ratio (MOS), Signal-to-Noise Ratio (SNR), and Word Error Rate (WER).Our results show that our proposal outperforms other available methods in terms of MOS, SNR and WER.Our method uses lip sync and face modelling techniques to create various speech patterns that are visually synchronized with the facial structure.Lip sync and face modelling performance has been greatly improved using Wave2lip and GAN algorithms, resulting in more accurate and realistic facial recognition.

Text to audio
Parameter 1: Input: Choose Language, Enter text-to-speech, Input: Telugu, ��ఎ�ఉ�� �, Output : file.mp3When the user enters an input text of any language for example in this case the input of the requested text in Telugu language then an mp3 file is generated of the specified text in specified language.Parameter 2: Input: Choose Language, Enter text-to-speech, Input: Hindi, तु मक ै से जीिवतहो, Output : file.mp3 Similarly, when the user enters an input text of any language for example, in this case, the input of the requested text in Hindi language then an mp3 file is generated of the specified text in specified language.The user can download the generated audio file by saving it onto his pc or can directly listen there.This makes it interesting which allows the user to listen directly or save it for future purpose.The plan combines several advanced functions for text-to-audio and text-to-video conversion and has many advantages over traditional methods.For one, the sound output is realistic, lifelike, and users can customize it with the TTS engine, spectrogram generation, and waveform generation.Second, creating video content that syncs with the audio output makes it interesting and lively.This is done by Wave2lip, a face synchronization and face modelling technique that uses GANs and computer vision.Third, although the resolution is low, the video output is good and suitable for many applications due to the use of diffusion models, spatiotemporal models, spatial super resolution and frame interpolation.Overall, the plan provides a simple and effective framework for creating high-quality audio and video content from text.Benefits include ease of handling low resolution, the ability to produce static audio and video, and the ability to streamline the content creation process.Here are some bullet points that indicate the quality of our model.Efficiency: The plan uses cutting-edge deep learning algorithms and techniques such as Wave2lip, GAN, computer vision and diffusion models to quickly and efficiently generate high-quality audio content and video from the collection.Natural Speech Output: Combining the text-tospeech engine with lip sync and face modelling, the system can produce speech content that mimics human speech.This gives the final product an organic taste.System adaptability:

Conclusion and future enhancements
Seeing the increasing demand for multimedia content like audio and video we put forth a new model to generate text-to-audio and text-to-video just from the given text.Text-toaudio generation is the process of converting written text into an audio format like speech with the help of sound.In recent times we get to see this technology being used for voice assistants, audiobooks, Amazon Polly, speech texters, etc.To make it different from others we made improvements by including a few AI avatars which make the users listening process interesting and attractive.This produces the audio normally for simple use and can also produce audio content just like human interaction with the help of installed avatars and their corresponding lip sync.We trained the model on the proposed datasets to produce better audio output than others.We also trained the model to have a lip movement of the avatar based on the input text which makes it look similar to human interaction.
Our proposed model for video generation works well but, in the future, we can also train and include other video datasets for obtaining more accuracy.Video datasets contain a huge number of images and gifs which increases the size of datasets so, we have to work at a much better workstation.

Fig. 4 .
Fig. 4. Frame clips from Michaeljackson.mp4.It feels boring to listen to a plain audio file so, to make it more interesting and interactive user after entering the specified text the model generates an audio file.Now the user can

4. 3
Text to video Parameters 1:Input: Enter you text Output: file.mp4Input1: A Cat Using TV Remote.

Fig. 7 .
Fig. 7. Frame clip from Apandaeatingbambooonarock.mp4 /doi.org/10.1051/e3sconf/20234300102727 430 The aim of Video frame interpolation is to synthesize several frames in the middle of two adjacent frames.It is used here to generate slow motion video, frame recovery, and to increase the video frame rate in video streaming.The basic steps involved in using frame interpolation in a video generation model: Input the initial frame(s) of the video sequence to the model.Use the model to predict the motion and appearance of objects in the next frame of the sequence.The model can use various techniques, such as motion vectors, optical flow or recurrent neural networks, to generate the intermediate frames and estimate the motion.Repeat the process to generate additional frames until the desired video length is reached.Evaluate the quality of the generated video sequence using metrics such as PSNR, SSIM, or human perceptual evaluation.Iterate and refine the model if necessary to improve the quality of the generated video.