Join Charles Kelly for an in-depth discussion in this video NumPy, data science, and IMQAV, part of pandas for Data Science.
- [Narrator] IMQAV is an acronym for ingest, model, query, analyze, visualize. It provides an overview of data science and an understanding of pandas' role within data science. IMQAV can refer to the way that teams or departments are organized. It can be used as the architecture for a system to provide an overview of the way that tools and components within a system are organized. Furthermore, IMQAV can be applied to a set of tasks.
In this video, I will discuss some of the software tools within the IMQAV framework. I apologize if I leave out your favorite tool or system from the list below. Note that the taxonomy in these lists is not absolute. In many instances, a tool begins within one category within this taxonomy and then grows to include multiple categories from the taxonomy. In some cases, data are generated very rapidly by a large number of devices. For example, consider a scenario where a road has many robotic vehicles.
Each vehicle has dozens or perhaps hundreds of sensors and each sensor is generating large amounts of data. If the system attempted to store all of the generated data into a database that models the system, some of the data might be lost because the modeling database might not be able to write data as rapidly as it is being generated. In this scenario, an ingestion system can be used to rapidly store the data. The modeling database can extract data from the ingestion system in a time frame that ensures that data is not lost.
The following are some software tools that support ingestion. Kafka, Rabbit Messa Q, Fluentd, Sqoop, and Kinesis. Modeling is a set of data architecture techniques to create data storage that is appropriate for a particular domain. The major categories include relational, key value, columnar, document, and graph oriented databases. Query refers to extracting data from storage and modifying that data to accommodate anomalies such as missing data.
The major categories of query include batch, batch SQL, and streaming. Additionally, some of the database systems listed above have their own query languages. In some cases, such as Cassandra, the query language is similar to SQL. In other cases, such as Neo4j, the query language is distinct from SQL. Analyze is a broad category that includes techniques from computer science, mathematical modeling, artificial intelligence, statistics, and other disciplines.
The list below includes analysis techniques and software libraries that support these techniques. Some of the categories of analysis software include statistics, optimization mathematical modeling, and machine learning. Historically, the techniques that are collectively referred to as machine learning were referred to as statistical learning. Later, a similar set of techniques were developed within the framework of neural networks. I separated the categories into libraries that support batch machine learning and interactive machine learning.
Visualize refers to transforming data into visually attractive and informative formats. Most likely you'll see these visualization techniques in the form of reports. The following are some popular visualization tools and libraries. This course includes a plotting chapter which provides an overview of matplotlib and the plotting functions for panda series and pandas data frame. The pandas library provides data structures, data analysis tools, and visualization tools.
As such, it spans the analyze and visualize components of IMQAV. Pandas also includes tools for selecting and displaying data. Some of these mimic SQL, the database query language. As such, pandas also spans the query component of IMQAV. No matter how you classify pandas, it is a great tool for data science.
Watch this course to gain an overview of pandas. Charles Kelly helps you get started with time series, data frames, panels, plotting, and visualization. All you need is a copy of the free and interactive Jupyter Notebook app to practice and follow along.
- Using the Markdown language and Jupyter Notebook
- Creating objects
- Selecting objects
- Using operations
- Merging data
- Creating series
- Creating data frames
- Creating panels
- Annotating plots and data frame plots