project-proposal-2024

DataEase: Simplifying Data Preparation for Machine Learning Tasks

Abstract

DataEase is your all-in-one dataset preparation solution built for both new and seasoned data scientists, revolutionizing the initial stages of working with machine learning. This tool makes data cleansing, advanced preprocessing, and integration less complicated and more manageable for everyone, without the need for extensive technical knowledge. With the goal of minimizing the time and work required to transform raw data into a form that is ready for analysis and model training, DataEase does it with ease. Focusing on reliability, scalability, and maintainability, it is dedicated to improving the most time consuming dataset preparation stage, which will ultimately promote improved ML model development.

Author

Name: Manvi Narang

Student number: 47324804

Functionality

Features for user registration and email verification.
Functionality for secure login with password reset and recovery.
Offers automated cleaning features for handling missing data, outliers, and duplicate entries.
Provides capabilities to perform advanced preprocessing functions like normalization, encoding, and splitting datasets with just one click.
Provides hassle free integration with multiple data sources such as CSV, SQL databases, and web scraping tools.
Offers a user-friendly GUI enabling users to efficiently manage their data preparation during as well as after the process.
Provides users with flexible export options that support various formats and direct integration with ML libraries.
Also provides real-time feedback on data quality and gives suggestions for improvement.

Scope

The primary aim of the MVP is to highlight the implementation of DataEase’s key features.

Uploading Dataset: Users will have the ability to either upload datasets directly in CSV format or connect to external SQL databases.
Preprocessing Tools: Users will be able to choose the preprocessing tasks(basic to intermediate tasks) through a user friendly interface/ drop down.
Real time feedback: Offer real time data feedback post task selection. Include options to offer automated suggestions prior to initiate preprocessing. Make sure these are actionable insights.
Monitor progress: GUI shows user a progress bar for their task and any alerts if user input is required.
Review and Export: Present users with detail about the preprocessing done and allow for flexible export options in different formats or option to integrate with certain ML libraries for further analysis.

Quality Attributes

Scalability: DataEase is designed to grow with the user’s needs, capable of handling datasets ranging from small projects to larger to more complex datasets.
Maintainability: It is designed with clear separation of concerns between data collection, cleaning, preprocessing and management. It allows simple update releases and bug fixes in addition to segregated development and testing.
Reliability: DataEase is built to handle failures and recover from them without data losses as well as gives consistent performance through all datasets while maintaining integrity and accuracy of original data.

To make sure DataEase is an adaptive and useful tool, we prioritize scalability, maintainability and reliability. This ensures that DataEase can handle increasing data volumes and complexity, adapting to users’ evolving needs as well as allows new updates and bug fixes to make sure it is adaptive. Consistent as well as accurate data preparation is one of the topmost priority for DataEase, fostering user trust.

If tradeoffs were necessary, Reliability would take top priority for DataEase considering that on this attribute, the entire goal and functionality of the project relies. This would be followed by Maintainability and then scalability.

Evaluation

1. Scalability: Scalability can be evaluated as-

Perform load testing with datasets of increasing sizes (up to 1TB) to test system performance.
Monitor system resource utilization (CPU, memory) at different load levels.
Simulate peak load conditions to assess the system’s maximum handling capacity and at what point the it stops scaling so that we can identify the reasons behind it.

2. Maintainability: Maintainability can be evaluated as follows-

Calculate ‘change implementation elapse time’ for new updates, bug fixing to check system’s adaptability and response efficiency.
Track Mean Time to Repair (MTTR) for quick issue resolution insights.
Conduct regular code reviews.

3. Reliability: Evaluating Reliability would involve-

Establish benchmarks for key reliability metrics such as system uptime, error rate, recovery time, and data integrity.
Simulate controlled failure scenarios like network disruptions, resource limitations and dependency failures.
Execute chaos experiments in a controlled environment and check for performance against the established metrics.
Analyze vulnerabilities, do improvements and repeat till it exceeds standards.