Join Charles Kelly for an in-depth discussion in this video NumPy, data science, IMQAV, part of NumPy Data Science Essential Training.
- [Instructor] The IMQAV framework provides an overview of data science, and an understanding of NumPy's rule within data science. IMQAV is an acronym for ingest, model, query, analyze, and visualize. IMQAV can be applied to an organization. In this case, it can refer to the ways that teams or departments are organized. IMQAV can be used as an architecture. In this case, it can provide an overview of the way that tools and components within a system are organized.
IMQAV can be applied to a set of tasks. In this case, it can refer to the skills needed to complete each unit of work within a set of tasks. The lists below provide the names of software tools that support each component within IMQAV. In advance, I apologize if I leave out your favorite tool or system from the list below. Note that the taxonomy in these lists is not absolute. In many instances, a tool begins within one category within this taxonomy, and then grows to include multiple categories.
I hope that you find the taxonomy and the list useful. Ingestion is a set of software engineering techniques to adapt high volumes of data that have arrived rapidly, often via streaming. Ingestion is sometimes necessary when there is a mismatch between the volumes of data and or the rates, which are sometimes referred to as the velocity of data, are generated. And the time needed to write large amounts of data into systems that support modeling.
The following are some of the tools that support ingestion: Kafka, RabbitMQ, Fluentd, Sqoop, and Kinesis. Modeling is a set of data architecture techniques to create data storage that is appropriate for a particular domain. The following is a listing of database systems used to support modeling. The major categories for database systems are relational, key value, columnar, document-oriented, and graph-oriented.
Within each of these categories, there are some popular choices. Within the relational category there's MySQL, Postgres, and RDS. Within key value, we have Redis, Riak, and DynamoDB. Within columnar we have Casandra, Hbase, and Redshift. Within document-oriented databases we have MongoDB, ElasticSearch, and CouchBase. And finally within the graph database category, we have Neo4J, OrientDB, and ArangoDB.
Query refers to extracting data from storage, and modifying that data to accommodate anomalies, such as missing data. The major categories for query include: batch, batch SQL, and streaming. Within these categories we have batch, MapReduce, Spark, and Elastic MapReduce. Within batch SQL we have Hive, Presto, and Drill. And within streaming query we have Storm, Spark Streaming, and Samza. Analyze is a broad category that includes techniques from computer science, mathematical modeling, artificial intelligence, statistics, and other disciplines.
The list below includes analysis techniques and software libraries that support these techniques. NumPy falls most naturally in the analyze component of IMQAV. Some of the categories of analysis software include: statistics, optimization, mathematical modeling, and machine learning. Historically, the techniques that are collectively referred to as machine learning, were referred to as statistic learning. Later, similar techniques were developed within a framework of neural networks.
I separated a category into libraries that support batch machine learning and interactive machine learning. Within the statistics category we have SPSS, SAS, R, Statsmodel, SciPy, and Pandas. Optimization and mathematical modeling are available through SciPy and other libraries. The techniques within these libraries include: linear, integer, and dynamic programming, gradient and Lagrange methods. Within machines learning batch category we have H2O, Mahout, and SparkML.
Within the interactive category, we have Scikit-Learn. Visualize refers to transforming data into visually attractive and informative formats. The following are popular visualization tools and libraries: Matplotlib, Seaborn, Bokeh, Pandas, D3, Tableau, Leaflet, Highcharts, and Kbana. This course includes a plotting chapter, which provides an overview of Matplotlib. I've included this chapter because it often useful to visualize data that you have generated or analyzed with NumPy.
- Using Jupyter Notebook
- Creating NumPy arrays from Python structures
- Slicing arrays
- Using Boolean masking and broadcasting techniques
- Plotting in Jupyter notebooks
- Joining and splitting arrays
- Rearranging array elements
- Creating universal functions
- Finding patterns
- Building magic squares and magic cubes with NumPy and Python