Dr. Jean-Claude Franchitti

Labs

1 .

Search online to learn what a virtual machine is. You are setting up a virtual machine (VM) on Microsoft Azure and would like to perform data science experiments. Research the best way to gain access to all the tooling you need without having to research and install the individual tools on your own.

2 .

Select three examples of commercial or open-source DBMSs that use different data models. Install the trial versions of each one of these DBMSs and illustrate their use via a simple tutorial example. Document your work and evaluate the benefits and drawbacks of each system based on your experience.

3 .

Explore MySQL and experiment with MySQL Workbench to build a simple website using Django. Refer to the instructions and tutorial for more information.

4 .

Build a simple Django application that implements a social media website and uses a cloud-based data management service for data management. (Hint: You can use this article from Medium that contains some guidance.)

5 .

Explore how to use AWS service areas when solutioning use cases for a data lake. Data are stored in a raw state initially, and some use cases will use raw data as is. More often, solutions require varying degrees of data preparedness based on a collection of query usage profiles that correlate to actual use cases. Based on the solution, data may be refined and staged with the intent to promote modularity and reuse. The goal is to not overprocess the dataset because it is intended for multiple purposes downstream, such as AWS RedShift for relational analytics, AWS Elasticsearch for text search, or an optimized distributed file system for low-cost active archive storage, which can be queried with an MPP SQL engine.

6 .

Investigate how to put together an end-to-end data management infrastructure for a recommender application being built by a start-up. The application is expected to collect hundreds of gigabytes of both structured (customer profiles, temperatures, prices, and transaction records) and unstructured (customers’ posts/comments and image files) data from users daily. Predictive models will need to be retrained with new data weekly and make recommendations instantaneously on demand. Data collection, storage, and analytics capacity would have to be extremely scalable. The questions at hand are: How can you design a scalable data science process and productionize the models? What are the tools needed to get the job done? You will need to explain how to set up a data pipeline,

7 .

Leverage the types of choices suggested in the associated diagram, decide between on-premises and cloud services, choose a cloud service provider if applicable (in particular, investigate the cloud service provider’s ML/DL capabilities and build your solution to avoid cloud vendor lock-in), and develop robust cloud management practices.

8 .

Search the Internet for available informatics platforms and experiment with any of the ones you find.