Data Migration and Analytics
​
Objective:
Ingesting on-premise data and transforming it using Azure Data Factory, Databricks and analysing and reporting using Synapse and PowerBI.
In this project the aim is to build an end-to-end data pipeline and perform ETL on the data present on the on-premises databases for generating insights and creating reports using the power of distributed computing using a cloud provider. There is a lot of data that is structured and exists in a database that is located on-premises. The quantity of data may be too large to be able to be transformed as required using the available compute power on-premises. Also, setting up infrastructure may not always be feasible as it requires planning, time and money and thus increases the lead time. Also, the infrastructure may become outdated or unnecessary in the near future. So, moving to cloud is the easiest way to achieve the
desired results without much overhead.
​
Tools Used : SQL Server, Azure Data Factory, Data Lake Gen 2,
Azure Databricks, Azure Synapse, PowerBi
​
​
Project Link to view the project.
Data Streaming and Analytics
Objective:
Stream real-time data using cloud services for storage and analysis in real-time.
In this project the aim is to simulate a real-time data streaming scenario where data is being Generated continuously and it needs to be transferred and processed instantly. It finds application in various domains where timely and up-to-date information is crucial like Internet of Things (IoT), financial services, social media analysis, online gaming and streaming, traffic management and logistics, cybersecurity and network monitoring, healthcare and remote monitoring, energy and utilities. The high-level architecture diagram of the project is as below. The data source here is a csv file containing stock market data. In a real scenario the data will be fetched using an API. The Kafka producer and consumer have been running at the same time to simulate the real scenario. The Kafka server and Zookeeper instance have been deployed on the EC2 instance and are both running at the same time. The data is being streamed to an S3 container and a glue crawler is storing it in a table and the data can be queried in real-time using Athena.
Tools used: Python, Kafka, AWS, EC2, S3, Glue Crawler, Glue Table, IAM, AWS CLI, Athena
Project Link to view the project.
Salary Prediction
Objective:
In this project, the focus is on text mining. The data set for this exercise includes information on job descriptions and salaries. I used this data set to see if we can predict the salary of a job posting (i.e., the Salary column in the data set) based on the job description. This is important, because this model can make a salary recommendation as soon as a job description is entered into a system. This will be useful for prospective job seekers. Though this does not ensure the person will be offered the salary determined by the model. But this model will give an idea of what the salary for that position should be like.
Tools Used: Jupyter Notebook,Python,MS Excel
​
1. The first model (SGD Regressor model) performs the best because its RMSE value is lower than the other model (Random Forest Regressor).
2.the baseline is 33047.83186817777
3.Yes, The model performs better than the baseline since the model's RMSE value is better(lower) than the baseline RMSE.
4. Yes, it does exhibit overfitting. I tried changing the values of the parameters (max_depth, n_estimators, learning_rate) to make the RMSE values of train set similar to that of test set RMSE but it did not provide a better scope of improving overfitting.
​
Salary Prediction link to view the project
Web Scraping
Objective:
The objective of this project was to scrape Cars.com. Web scraping is an important technique for extracting useful data from websites.
​
Tools used: Python, Jupyter Notebook, Microsoft Excel
Cars.com was used and a car brand was selected and the Zipcode was entered. Cars within 20 miles was returned. 20 results were displayed per page and there was 10 pages. Details like Name, Mileage, Dealer Name, Ratings, Number of reviews, Price were extracted and stored in an Excel file.
​
Project Link to view the project.
Unemployment in the US
Objective:
Unemployment is a very significant marker for a country’s economy. I chose this topic because unemployment creates an undue burden on an individual and their family and ultimately affects the country’s overall economy. Through this project I try to understand what factors plays a crucial role in driving unemployment and if there are nuances which are often not stated but can be found as trends when aggregated over a period. For my project I have decided to find the correlation between the party in power and what the employment rate was like under them.To understand the correlation between the demographics and the unemployment. Also include factors like education and race.
Tools used: Tableau, Tableau prep Builder, MS Excel
I focused on the visualization for this project. I downloaded several datasets from Bureau of Labor Statistics and merged them using Tableau Prep builder and MS Excel. I had to thoroughly format the data for it to be usable in Tableau. I tried to find the demographics and how it was affected. One thing was clear that higher education guaranteed higher pay. There was higher unemployment among men than women. There were a large section of people who could not find suitable jobs as per their skillset and there was a comparison between the party in power but that was only for one term and so the results need to be compared across a broader timespan.
Unemployment report link to view the project.
Fetch Weather Data using API and store it for future analysis
Objective:
The objective of this project is to fetch data using the Openweathermap API and store it in an Excel file for future use.
​
Tools used: Python, PyCharm, Microsoft Excel
The main aim of this project is to gather data using APIs and store it. Gathering data from APIs is a prevalent practice in real-world applications. APIs (Application Programming Interfaces) allow different software systems to communicate and exchange information. This stored data can be used for analysis or warehousing into Databases or Data Lakes. I have used Python libraries and the free account on OpenWeather API for retrieving data like city name, date, temperature, humidity, wind speed, Pressure. The python editor used was PyCharm. HTTP errors and user input errors were handled. The data was written into an excel file.