Feature Engineering — Bollywood Movies

18 min readAug 11, 2021

The What, Why and How of Feature Engineering

Artificial intelligence and machine learning have pervaded every industry, yielding substantial returns to those invested in them. As Machine Learning technologies grow more powerful and proliferate, companies are taking it as their imperative to implement these technologies to gain a competitive edge. While machine learning involves training computers to perform tasks without explicit instruction, putting together coherent data to successfully train the ML model is itself a challenge. Collecting, cleaning and engineering data is the most cumbersome part of the machine learning process.

All machine learning algorithms use data as the input to calibrate and generate output. Data is initially in its crudest form, requiring enhancement before feeding it to the algorithm. This input data comprises features, which are measurable properties of a process, often represented in the form of structured columns. The process of extracting relevant features from the data to train ML algorithms is called feature engineering. Features engineering is vital to data science as it produces reliable and accurate data and algorithms are only as good as the data fed to them.

What Is Feature Engineering?

What Is A Feature?

In machine learning, a feature is an individual property or characteristic of a process under study. To effectively train and calibrate algorithms, choosing appropriate features is a crucial step.

Features are the fundamental elements of datasets. The quality of features in the data set bears a strong influence on the quality of the output derived from machine learning algorithms.

Feature engineering spans across diverse applications. In speech recognition for instance, features for recognizing phonemes can include noise ratios, length of sounds, relative power, filter matches and many others. In building an algorithm to classify spam and legitimate mails, some of the features include presence of particular topics, length, presence of URLs, structure of the URL, number of exclamation points, number of misspellings, information extracted from the header and so on.

So, what Is Feature Engineering?

Feature engineering involves leveraging data mining techniques to extract features from raw data along with the use of domain knowledge. Feature engineering is useful to improve the performance of machine learning algorithms and is often considered as applied machine learning.

Features are also referred to as ‘variables’ or ‘attributes’ as they affect the output of a process.

Feature engineering involves several processes. Feature selection, construction, transformation, and extraction are some key aspects of feature engineering. Let’s understand what each process involves:

Feature selection involves choosing a set of features from a large collection. Selecting the important features and reducing the size of the feature set makes computation in machine learning and data analytic algorithms more feasible. Feature selection also improves the quality of the output obtained from algorithms.
Feature transformation involves creating features using existing data by the use of mathematical operations. For example, to ascertain the body type of a person a feature called BMI (Body Mass Index) is needed. If the data set captures the person’s weight and height, BMI can be derived using a mathematical formula.
Feature construction is the process of developing new features apart from the ones generated in feature transformation, that are appropriate variables of the process under study.
Feature extraction is a process of reducing the dimensionality of a data set. Feature extraction involves combining the existing features into new ones thereby reducing the number of features in the data set. This reduces the amount of data into manageable sizes for algorithms to process, without distorting the original relationships or relevant information.

Why is Feature Engineering Required?

The intention of feature engineering is to achieve two primary goals:

Preparing an input data set that is compatible with and best fits the machine learning algorithm.
Improving the performance of machine learning models

According to a survey in Forbes, data scientists spend 80% of their time on data preparation. The importance of feature engineering is realized through its time-efficient approach to preparing data that brings consistent output.

When feature engineering processes are executed well, the resulting dataset will be optimal and contain all the essential factors that bear an impact on the business problem. These datasets in turn result in best possible predictive models and most beneficial insights.

Common Feature Engineering Techniques Used

Imputation

One of the most common problems in machine learning is the absence of values in the datasets. The causes of missing values can be due to numerous issues like human error, privacy concern and interruptions in the flow of data among many. Irrespective of the cause, absence of values affects the performance of machine learning algorithms.

Rows with missing values are sometimes dropped by machine learning platforms and some platforms do not accept datasets with missing data. This decreases the performance of the algorithm due to reduced data size. By using the method of Imputation, values are introduced into the dataset that are coherent with the existing values. Although there are many imputation methods, replacing missing values with the median of the column or the maximum value occurred is a common imputation method.

One-Hot Encoding

This is one of the common encoding methods used in feature engineering. One-hot encoding is a method of assigning binary values (0’s and 1’s) to values in the columns. In this method, all values above the threshold are converted to 1, while all values equal to or below the threshold are converted as 0. This changes the feature values to a numerical format which is much easier for algorithms to understand without compromising the value of the information and the relationship between the variables and the feature.

Grouping Operations

In machine learning algorithms, a variable or instance is represented in rows and features are represented in columns. Many datasets rarely fit into the simplistic arrangement of rows and columns as each column has multiple rows of an instance. To handle such cases, data is grouped in such a fashion that every variable is represented by only one row. The intention of grouping operations is to arrive at an aggregation that establishes the most viable relationship with features.

Log Transformation

A measure of asymmetry in a dataset is known as Skewness, which is defined as the extent to which a given distribution of data varies from a normal distribution. Skewness of data affects the prediction models in ML algorithms. To resolve this, Log Transformations are used to reduce the skewness of data. The less skewed distributions are, the better is the ability of algorithms to interpret patterns.

Bag of Words

Bag of Words (BoW) is a counting algorithm that evaluates the number of repetitions of a word in a document. This algorithm is useful in identifying similarities and differences in documents for applications like search and document classification.

Feature Hashing

Feature hashing is an important technique used to scale-up machine learning algorithms by vectorizing features. The technique of feature hashing is commonly used in document classification and sentiment analysis where tokens are converted into integers. Hash values are derived by applying hash functions to features that are used as indices to map data.

Automated Feature Engineering

Automated feature engineering is a new technique that is becoming a standard part of machine learning workflow. The traditional approach is a time consuming, error-prone process and is specific to the problem at hand and has to change with every new dataset. Automated feature engineering extracts useful and meaningful features using a framework that can be applied to any problem. This will increase the efficiency of data scientists by helping them spend more time on other elements of machine learning and would enable citizen data scientists to do feature engineering using a framework based approach.

Conclusion

Despite being in its nascent stages, feature engineering has immense usefulness for data scientists to prepare data in a hassle-free and accelerated manner.

Feature engineering is an essential process in data science to reap utmost benefits from the data available. Various methods of feature engineering aim at arriving at a coherent set of data that is comprehensible and easy to process for machine learning algorithms to obtain accurate and reliable results.

Features affect the quality of output from machine learning algorithms and feature engineering aims at improving the features that go into training the algorithms.

If you’d like to learn more about this topic, please feel free to get in touch with one of our AI and data science experts for a personalized consultation.

Problem at hand

The data files Bollywood.csv and Bollywood — 2 .csv contains the box office collection and social media promotion information about the movies released in 2013–2015 period.

Following is the information about the columns (names can vary):

Serial Id of movie
Release date of movie
Release time of movie like Long Weekend (3), Festive Season (1), Holiday Season (2), and Normal (4)
Genre of movie like Romance, Thriller, Action, comedy etc.
Budget of movie in crores
Box office collection of movie in crore
YouTube views of movies trailers
YouTube likes for the movie trailers
YouTube dislikes for the movie trailers

import pandas as pd
from pandas import DataFrame as df
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import numpy as np
import warnings
warnings.filterwarnings("ignore")

Importing bollywood Dataset and looking at the data:

bollywood = pd.read_csv("bollywood.csv")

bollywood.head()

Importing bollywood2 Dataset and looking at the data:

bollywood2 = pd.read_csv("bollywood - 2.csv")

bollywood2.head()

Looking at both the data sets, we can see that the columns names are different though they signify the same thing. Renaming the columns on the both the datasets to match so that they can be analysed further and used together.

bollywood.columnsIndex(['No', 'RelDate', 'MovieName', 'ReleaseTime', 'Genre', 'Budget',
       'BoxOfficeCollection', 'YoutubeViews', 'YoutubeLikes',
       'YoutubeDislikes', 'YoutubeLikeDislikes'],
      dtype='object')bollywood2.columns = ['No', 'RelDate', 'MovieName', 'ReleaseTime', 'Genre', 'Budget',
       'BoxOfficeCollection', 'YoutubeViews', 'YoutubeLikes',
       'YoutubeDislikes']

bollywood2.head()

Now these datasets have the same columns. Understanding the elements of each of the datasets:

bollywood.isnull().sum()No                     0
RelDate                0
MovieName              0
ReleaseTime            0
Genre                  0
Budget                 0
BoxOfficeCollection    0
YoutubeViews           0
YoutubeLikes           1
YoutubeDislikes        3
YoutubeLikeDislikes    0
dtype: int64

We see that there are null values in YoutubeDislikes and YoutubeLikes

bollywood2.isnull().sum()No                     0
RelDate                0
MovieName              0
ReleaseTime            0
Genre                  0
Budget                 0
BoxOfficeCollection    0
YoutubeViews           0
YoutubeLikes           0
YoutubeDislikes        0
dtype: int64

There are no null values on bollywood2. So no Additional Validations required.

bollywood.dtypesNo                       int64
RelDate                 object
MovieName               object
ReleaseTime              int64
Genre                   object
Budget                  object
BoxOfficeCollection     object
YoutubeViews             int64
YoutubeLikes           float64
YoutubeDislikes        float64
YoutubeLikeDislikes      int64
dtype: objectbollywood2.dtypesNo                       int64
RelDate                 object
MovieName               object
ReleaseTime             object
Genre                   object
Budget                   int64
BoxOfficeCollection    float64
YoutubeViews             int64
YoutubeLikes             int64
YoutubeDislikes          int64
dtype: object

Changing the columns to appropriate data types across both the datasets

Ideally: 1. RelDate needs to be a Date Field. 2. MovieName, Genre are String Values. 3. Budget, BoxOfficeCollection are Floats. 4. YoutubeViews, YoutubeLikes, YoutubeDislikes and YoutubeLikeDislikes are all integers. 5. ReleaseTime is kind of an ennumerator

Working with each of the fields to understand if there are any issues:

Working with Release Date Field:

Attempting the convert the Date Field into standard values.

Running the below code results in an exception.

#bollywood['FormattedRelDate'] = pd.to_datetime(bollywood['RelDate'])

Doing this we encounter an exception as seen above. This indicates that the string formatting is wrong in the provided dataset. This has to be corrected for all the elements in the dataset.

Note: The code is intentionally commented to be able to run all cells.

Further understanding the data by splitting the dates to year, month and dates.

bollywood['MonthOfRel'] = bollywood['RelDate'].str[:3]
bollywood['DateOfRel'] = bollywood['RelDate'].str[4:6]
bollywood['YearOfRel'] = bollywood['RelDate'].str[-4:]bollywood.groupby(['DateOfRel'])['No'].count()DateOfRel
09    1
1-    2
10    1
11    3
12    1
13    1
14    2
15    4
18    6
19    2
2-    2
20    1
21    2
22    2
23    1
24    2
25    2
27    1
28    1
30    1
4-    3
6-    2
7-    3
8-    4
9-    1
Name: No, dtype: int64

Inspecting the dates, they are not having any invalid characters. These need not have any further filteration.

bollywood.groupby(['YearOfRel'])['No'].count()YearOfRel
-204     1
2013    19
2014    31
Name: No, dtype: int64

Looking at the above dataset we see the following incorrectly entered data years: 1. 204 It should have been 2014 as the data is between 2013–2015. Identifying the specific sample:

bollywood[bollywood.YearOfRel.isin(['-204'])]

Correcting the dataset by replacing the value by a correct value.

bollywood.RelDate = bollywood.RelDate.replace(to_replace=r'204', value='2014', regex=True)bollywood.groupby(['MonthOfRel'])['No'].count()MonthOfRel
Apr     1
Aug     1
Feb     9
Jan    10
Jul     6
Jum     1
Jun     3
Mar     7
May     5
Nov     2
Oct     2
Sep     2
Sqp     1
juu     1
Name: No, dtype: int64

Looking at the above dataset we see the following incorrectly entered data months: 1. Jum 2. Sqp 3. juu

There are two ways of resolving this data issue: 1. Assuming the typos based on character sequence and the wrongly placed character. eg: Jum is Jun with ‘m’ entered instead of ’n’. Similarly Sqp is Sep. 2. However, we cannot get conclude if juu is Jun ot Jul 3. Alternatively, we can look at the details of the movie to get the exact information to fill this as this dataset is available publicly.

Since Sqp and Jum can be corrected through (1), we can use that. For (2) I am using internet as the reference to fill the data element.

bollywood[bollywood.MonthOfRel.isin(['juu'])]

This confirms, filmistaan released in the month of Jun. Making the appropriate corrections across the bollywood dataset

bollywood[bollywood.MonthOfRel.isin(['Jum'])]

bollywood[bollywood.MonthOfRel.isin(['Sqp'])]

bollywood.RelDate = bollywood.RelDate.replace(to_replace=r'juu', value='Jun', regex=True)
bollywood.RelDate = bollywood.RelDate.replace(to_replace=r'Jum', value='Jun', regex=True)
bollywood.RelDate = bollywood.RelDate.replace(to_replace=r'Sqp', value='Sep', regex=True)
#bollywood.RelDate = bollywood.RelDate.replace("Sqp-19-2014", "Sep-19-2014")
#bollywood.head()

These replacements are done, creating another column for storing this formatted date value so that original can be referred to incase of any issues that we might detect.

bollywood['FormattedRelDate'] = pd.to_datetime(bollywood['RelDate'])
bollywood.head()

Dropping Columns created for validations

bollywood = bollywood.drop(['MonthOfRel', 'DateOfRel', 'YearOfRel'], axis = 1)
bollywood.head(1)

Applying the same checks on the bollywood2 dataset:

Attempting to format the data field directly to change to a standard format:

bollywood2['FormattedRelDate'] = pd.to_datetime(bollywood2['RelDate'])
bollywood2.head(1)

We see that the formatting is successful and there are no issues here.

Working with Genre Field:

Looking at the distribution by grouping them by Genres

bollywood.groupby(['Genre'])['No'].count()Genre
 Drama           1
 DramaGenre     11
Action           1
ActionGenre      9
ComedyGenre     12
RomanceGenre     9
Thriller         5
Thriller         3
Name: No, dtype: int64bollywood2.groupby(['Genre'])['No'].count()Genre
 Drama      23
Action      14
Comedy      24
Romance     17
Thriller    21
Name: No, dtype: int64

The following are the issues that we see: 1. Drama has a space before the text on both the datasets 2. Thriller and Action have a trailing space at the end of the text in bollywood dataset 3. ActionGenre, ComedyGenre and RomanceGenre have ‘Genre’ appended to the end. 4. DramaGenre has a leading space and ‘Genre’ appended to the end. Removing these irregularities.

bollywood.Genre = bollywood.Genre.str.strip()
bollywood.Genre = bollywood.Genre.replace("ActionGenre", "Action")
bollywood.Genre = bollywood.Genre.replace("ComedyGenre", "Comedy")
bollywood.Genre = bollywood.Genre.replace("RomanceGenre", "Romance")
bollywood.Genre = bollywood.Genre.replace("DramaGenre", "Drama")
bollywood2.Genre = bollywood2.Genre.str.strip()

Confirming that this has removed the irregularities:

bollywood.groupby(['Genre'])['No'].count()Genre
Action      10
Comedy      12
Drama       12
Romance      9
Thriller     8
Name: No, dtype: int64bollywood2.groupby(['Genre'])['No'].count()Genre
Action      14
Comedy      24
Drama       23
Romance     17
Thriller    21
Name: No, dtype: int64

Looking at Budget and BoxOfficeCollection:

#bollywood.groupby(['Budget'])['No'].count() #uncomment and run this.#bollywood2.groupby(['Budget'])['No'].count() #uncomment and run this.

Looking Budget we see the following issues:

Budget column in bollywood2 has no issues but is all in interger values. This has to be converted to floats.
Modifying the bollywood data set to match this result set.

bollywood2['Budget'] = pd.to_numeric(bollywood2['Budget'], downcast='float')
bollywood['Budget'] = bollywood['Budget'].replace(to_replace=r'Cr', value='', regex=True)
bollywood['Budget'] = pd.to_numeric(bollywood['Budget'], downcast='float')#bollywood.groupby(['Budget'])['No'].count() #uncomment and run this.#bollywood.groupby(['BoxOfficeCollection'])['No'].count() #uncomment and run this.#bollywood2.groupby(['BoxOfficeCollection'])['No'].count()#uncomment and run this.

There are basically two issues here with BoxOfficeCollection: 1. Data Type is not consistent. BoxOfficeCollection on bollywood2 should be converted to float. 2. bollywood dataset has ‘cr’ appended in some cases which is causing the data to not be in a standard format.

bollywood2['BoxOfficeCollection'] = pd.to_numeric(bollywood2['BoxOfficeCollection'], downcast='float')
bollywood['BoxOfficeCollection'] = bollywood['BoxOfficeCollection'].replace(to_replace=r'Cr', value='', regex=True)
bollywood['BoxOfficeCollection'] = pd.to_numeric(bollywood['BoxOfficeCollection'], downcast='float')

Working with YoutubeLikes and YoutubeDislikes:

YoutubeLikes:

We know there is a null value in YoutubeLikes only in bollywood dataset. The other dataset is fine.
Getting the row where the null value exists:

bollywood[bollywood.YoutubeLikes.isin([np.nan])]

LikesDislikes contains the sum of Likes and Dislikes for a row. So the value of NaN in Youtube Likes can be obtained from YoutubeLikeDislikes — YoutubeDislikes

bollywood['YoutubeLikes'] = bollywood['YoutubeLikes'].fillna(bollywood['YoutubeLikeDislikes']- bollywood['YoutubeDislikes'])
#confirming the null value in YoutubeLikes is replaced
bollywood.isnull().sum()No                     0
RelDate                0
MovieName              0
ReleaseTime            0
Genre                  0
Budget                 0
BoxOfficeCollection    0
YoutubeViews           0
YoutubeLikes           0
YoutubeDislikes        3
YoutubeLikeDislikes    0
FormattedRelDate       0
dtype: int64

Youtube Dislikes: 1. We know there are 3 null values in Youtube Dislikes in Bollywood Dataset.

Identifying the rows where there are null values:

bollywood[bollywood.YoutubeDislikes.isin([np.nan])]

bollywood['YoutubeDislikes'] = bollywood['YoutubeDislikes'].fillna(bollywood['YoutubeLikeDislikes']- bollywood['YoutubeLikes'])
#confirming the null value in YoutubeLikes is replaced
bollywood.isnull().sum()No                     0
RelDate                0
MovieName              0
ReleaseTime            0
Genre                  0
Budget                 0
BoxOfficeCollection    0
YoutubeViews           0
YoutubeLikes           0
YoutubeDislikes        0
YoutubeLikeDislikes    0
FormattedRelDate       0
dtype: int64

Identifying any other issues on the Datasets:

bollywood.describe()

bollywood2.describe()

We see that the YoutubeLikeDislikes is missing in bollywood2 dataset. Adding that column:

bollywood2['YoutubeLikeDislikes'] = bollywood2['YoutubeDislikes']+ bollywood2['YoutubeLikes']

Working with ReleaseTime:

bollywood.groupby(['ReleaseTime'])['No'].count()ReleaseTime
1      3
2     10
3      4
4     33
44     1
Name: No, dtype: int64bollywood2.groupby(['ReleaseTime'])['No'].count()ReleaseTime
FS    14
HS     8
LW    11
N     66
Name: No, dtype: int64

There are essentially two issues here: 1. An incorrect value in bollywood data set which has a ReleaseTime as 44 2. The ReleaseTime values are different in the case of both the datasets

bollywood.ReleaseTime = bollywood.ReleaseTime.replace(44,4)
#Replacing 44 with 4bollywood.groupby(['ReleaseTime'])['No'].count()ReleaseTime
1     3
2    10
3     4
4    34
Name: No, dtype: int64

From the Problem Statement we understand that: 1. Release time of movie like Long Weekend (3), Festive Season (1), Holiday Season (2), and Normal (4) 2. Comparing this with strings with data elements in bollywood2, we can infer that LW = Long Weekend, FS = Festive Season, HS=Holiday Season and N = Normal 3. The data is now converted to the given Types of ReleaseTimes as described in (1)

Replacing data elements on bollywood2 to match with that of bollywood

bollywood.ReleaseTime = bollywood.ReleaseTime.replace(1,"Festive Season")
bollywood.ReleaseTime = bollywood.ReleaseTime.replace(2,"Holiday Season")
bollywood.ReleaseTime = bollywood.ReleaseTime.replace(3,"Long Weekend")
bollywood.ReleaseTime = bollywood.ReleaseTime.replace(4,"Normal")
bollywood2.ReleaseTime = bollywood2.ReleaseTime.replace('FS',"Festive Season")
bollywood2.ReleaseTime = bollywood2.ReleaseTime.replace('HS',"Holiday Season")
bollywood2.ReleaseTime = bollywood2.ReleaseTime.replace('LW',"Long Weekend")
bollywood2.ReleaseTime = bollywood2.ReleaseTime.replace('N',"Normal")bollywood2.groupby(['ReleaseTime'])['No'].count()
#Confirming that both the datasets have the same value sets for ReleaseTimeReleaseTime
Festive Season    14
Holiday Season     8
Long Weekend      11
Normal            66
Name: No, dtype: int64

Looking at Youtube Views:

#bollywood.groupby(['YoutubeViews'])['No'].count() #Uncomment this to run this code, commented to reduce the size of notebook#bollywood2.groupby(['YoutubeViews'])['No'].count() #Uncomment this to run this code, ommented to reduce the size of notebook

Looking at the Youtube Views on both the datasets we can see that there are no data quality issues with the dataset.

Merging the two datasets:

merged_bollywood = [bollywood, bollywood2]
dataset = pd.concat(merged_bollywood)
dataset.shape(150, 12)

Validating the data on the merged data set by looking for nulls:

dataset.isnull().sum()No                     0
RelDate                0
MovieName              0
ReleaseTime            0
Genre                  0
Budget                 0
BoxOfficeCollection    0
YoutubeViews           0
YoutubeLikes           0
YoutubeDislikes        0
YoutubeLikeDislikes    0
FormattedRelDate       0
dtype: int64

There are no null values. Also we see there are 150 rows.

Checking for Duplicate Values in the dataset

Movie Names cannot be exact duplicates

dataset.shape(150, 12)

Grouping the movies by name:

distinct_movies = dataset.groupby(['MovieName'])['Budget'].count().reset_index().rename(columns={"Budget":"Occurrences"})distinct_movies.shape(149, 2)

We see that there is a difference of one, this would mean that there is one such value which is a duplicate. Sorting the above by descendening and identifying the duplicate record.

distinct_movies.sort_values(by=['Occurrences'], inplace=True, ascending=False)distinct_movies.head(1)

We can see that there is a duplicate entry in the movie names. Exploring the other other attributes of this movie.

dataset[dataset.MovieName.isin(['Lekar Hum Deewana Dil'])]

We see all of the other attributes are a match, thus, removing one of the occurrence.

dataset = dataset.drop_duplicates(subset=['MovieName'])

Summary of Issues:

The following are the issues that have been identified and recitified during the data preparation process: 1. Issues with Release Date: a. There are three dates with incorrect month names. These are replaced with the correct names. b. There is an year which is incorrectly entered. This is also corrected by replacing it with the correct value. c. A formatted date field column is also created for different types of analytical uses. Original field if leftout with alteration. 2. Issues with Genre field: a. Drama has a space before the text on both the datasets b. Thriller and Action have a trailing space at the end of the text in bollywood dataset c. ActionGenre, ComedyGenre and RomanceGenre have ‘Genre’ appended to the end. d. DramaGenre has a leading space and ‘Genre’ appended to the end. Leading and Trailing spaces have been removed and where ever ‘Genre’ is appended to the end, that is removed. 3. Issues with Budget Field: a. Data Type is not consistent across the datasets. b. bollywood dataset has ‘cr’ appended in some cases which is causing the data to not be in a standard format. 4. Issues with BoxOfficeColleciton Field: a. Data Type is not consistent across the datasets. b. bollywood dataset has ‘cr’ appended in some cases which is causing the data to not be in a standard format. 5. Issues with ReleaseTime: a. A value is incorrectly entered in bollywood dataset. This is manually corrected as it is only one value. b. The values are not consistent between the two datasets. These have been replaced with the standard values from the Assignment Problem Statement. 6. Issues with YoutubeLikes Field: a. One Null Value in YoutubeLikes in the bollywood dataset. b. This has been rectified by the use of the additional column that is provided as YoutubeLikeDislikes. This column contains the sum of likes and dislikes. Using this information, we have replaced the null value in YoutubeLikes with a value that is equal to YoutubeLikeDislikes — YoutubeDislikes from that specific row. 7. Issues with YoutubeDislikes Field: a. There are three null values in YoutubeDislikes in bollywood dataset. b. These values also can be obtained as a difference between YoutubeLikeDislikes and YoutubeLikes. 8. Once the data is cleaned on both the datasets and the datasets are merged we see that there is a movie with duplicate entries. One of these is removed.

Meta Information of the Dataset

dataset.info()#MetaInfo<class 'pandas.core.frame.DataFrame'>
Int64Index: 149 entries, 0 to 98
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   No                   149 non-null    int64         
 1   RelDate              149 non-null    object        
 2   MovieName            149 non-null    object        
 3   ReleaseTime          149 non-null    object        
 4   Genre                149 non-null    object        
 5   Budget               149 non-null    float32       
 6   BoxOfficeCollection  149 non-null    float32       
 7   YoutubeViews         149 non-null    int64         
 8   YoutubeLikes         149 non-null    float64       
 9   YoutubeDislikes      149 non-null    float64       
 10  YoutubeLikeDislikes  149 non-null    int64         
 11  FormattedRelDate     149 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float32(2), float64(2), int64(3), object(4)
memory usage: 14.0+ KB

There are a total of 149 entries in the dataset.

Movies in each Genre per ReleaseTime

Genre_ReleaseTime = dataset.groupby(['Genre','ReleaseTime'])['Budget'].count().reset_index().rename(columns={"Budget":"Occurrences"})
Genre_ReleaseTime.sort_values(by=['Genre'], inplace=True, ascending=False)
#Genre_ReleaseTimepivot = Genre_ReleaseTime.pivot_table(index=['Genre'], columns = 'ReleaseTime', values=['Occurrences'], aggfunc='sum')
pivot

The above is the distribution for Movies that are released in each Genre for each of the ReleaseTime.

Month of the year where maximum movie releases are seen

dataset['yearOfRelease'], dataset['monthOfRelease'] = dataset['FormattedRelDate'].dt.year,dataset['FormattedRelDate'].dt.monthMonth_Year = dataset.groupby(['yearOfRelease','monthOfRelease'])['Budget'].count().reset_index().rename(columns={"Budget":"Occurrences"})
Month_Year_max = Month_Year.loc[Month_Year.groupby(["yearOfRelease"])["Occurrences"].idxmax()]
Month_Year_max.sort_values(by=['Occurrences'], inplace=True, ascending=False)
Month_Year_max

We can see that the year 2014 and Month 5 (May) has the most number of releases followed by July 2013 with 9 Releases and January 2015 with 5 releases.

Month of the year where we see high budgeted movies

High Budgeted movies are the ones with budgets over 25 crs. Creating a column that holds this value.

dataset['HighBudgetMovies'] = np.where((dataset['Budget'] >= 25), 'High Budget', 'Low Budget')

dataset.head(1)

Understanding the general spread of High Budget Movies across all years.

Month_Year_Budget = dataset.groupby(['yearOfRelease','monthOfRelease','HighBudgetMovies'])['Budget'].count().reset_index().rename(columns={"Budget":"Occurrences"})
Month_Year_Budget_HighBudget = Month_Year_Budget[Month_Year_Budget.HighBudgetMovies.isin(['High Budget'])]
Month_Year_Budget_HighBudget.sort_values(by=['yearOfRelease','Occurrences'], inplace=True, ascending=False)
Month_Year_Budget_HighBudget['ConcatVal'] = Month_Year_Budget_HighBudget["yearOfRelease"].astype(str)+":"+Month_Year_Budget_HighBudget["Occurrences"].astype(str)

Month_Year_Budget_HighBudget

We see that the year 2014 has three months that have the same results. Retaining all the results:

Month_Year_Budget_HighBudget_max = Month_Year_Budget_HighBudget.loc[Month_Year_Budget_HighBudget.groupby(["yearOfRelease"])["Occurrences"].idxmax()]
Month_Year_Budget_HighBudget_max.sort_values(by=['Occurrences'], inplace=True, ascending=False)
Summarized_Budget = Month_Year_Budget_HighBudget_max[['yearOfRelease','Occurrences']]
Summarized_Budget['ConcatVal'] = Summarized_Budget["yearOfRelease"].astype(str)+":"+Summarized_Budget["Occurrences"].astype(str)
Summarized_Budget
budget_result = pd.merge(Summarized_Budget, Month_Year_Budget_HighBudget, on=['ConcatVal'])
budget_result[['yearOfRelease_x','monthOfRelease','Occurrences_y']]

We can see that the year 2013 and Month 7 (July) has the most number of high budget releases that is 5 followed by February, Match and April of 2014 with 4 Releases and January 2015 with 3 releases.

Top 10 movies with max ROI:

Calculation for ROI:

ROI = (Box Office Collection — Budget)/ Budget

dataset['ROI'] = (dataset['BoxOfficeCollection'] - dataset['Budget'])/dataset['Budget']

dataset.head(1)

HighROI  = dataset[['MovieName','FormattedRelDate','ROI']]
HighROI.sort_values(by=['ROI'], inplace=True, ascending=False)
HighROI.head(10)

Link to the files can be found here.