Big Data Vietnam: 2016

Sunday, December 11, 2016

Deep Learning in Python for beginner

Deep learning is all the rage. So what do you need for deep learning?

Step 1: Understand machine learning in general

The math you’ll need
Get your basics in Python down
Make sure you understand the necessary statistics
And then it’s time to master the basics of machine learning

Does your machine have the necessary requirements?

You’ve got to have a good enough GPU and CPU. The basic levels is a 4 GB GPU and a decent CPU.

Step 2: Get started with deep learning

Check out the very informative Hacker Guide to Neural Networks by Andrej Karpathy who learned what he knows at Stanford. Blogs not your thing? Prefer a textbook? Then grab the free online book Neural Networks and Deep Learning written by Michael Nielsen.

That not your cup of tea either? Then you’ll be glad to know that there is also the possibility to get the video. It has been very conveniently divided up into 27 parts so that you can really digest one bit before you decide to head on to the next one.

But wait, there’s more for you to learn! You’ve also got to get to grips with the different deep learning libraries and software packages. Yup, I never said it was going to be easy. To get a good idea of what’s going on check out the Wikipedia page.

Step 3: Choosing your area

Deep learning has found its way into several fields. This includes vision, natural language processing, speech and audio, and reinforcement learning. You can choose each one of these areas to really start to get to grips with the topic.

Deep Learning for Computer Vision

read the DL for Computer Vision blog, which will give you the basic ideas that you’re going to need to delve into this particular project. The project itself is called Facial Keypoint Detection. And what of the library, you ask. I’m glad you did. It’s called Nolearn.

Deep Learning for Natural Language Processing

Here the primer is called Deep Learning, NLP, and Representations. With it, you’ve got the opportunity to build chatbots, which – as you no doubt know – is an incredibly fast-growing area within the computer science community with a huge number of companies jumping on board in order to please customers without having to keep a huge staff of living reps on hand to handle the traffic.

Check out these two parts of the project that you will need. The library that you’ll be required to use is the Tensorflow library.

Deep Learning for Speech/Audio

It’s incredible to think that only a few years ago it wasn’t possible for computers to recognize different types of speech. That’s all changed now, with computers no longer calling mom every time you try to dial anybody else.

The blog post you should look into is called Deep Speech: Lessons from deep learning. The project title is called Music Generation Using Magenta (Tensorflow) and the necessary library? This will surprise you. It’s called Magenta. A real shocker, don’t you think?

Deep Learning for Reinforcement Learning

Reinforcement learning allows computers to get better through the process of trial and error, which honestly is pretty cool and has made it so that computers can now beat humans at more traditional games. Can you take it to the next level and have computers beat us at the more complex and involved modern games?

If you think you can then you should check out Deep Reinforcement Learning: Pong from Pixels. This will serve you as both the primer and the project. You’ll also be happy to learn that you don’t need any library (was happy too strong a word?).

Step 4: BUILD simple something first

http://www.pyimagesearch.com/2014/09/22/getting-started-deep-learning-python/

Thursday, December 8, 2016

Fast Data - Bước kế tiếp của Big Data

Xử lý nhanh một lượng dữ liệu khổng lồ được tạo ra từ Internet, điện thoại di động, các thiết bị thông minh và ứng dụng xã hội là thách thức lớn tiếp theo của doanh nghiệp.

Nếu muốn các ứng dụng trở nên thông minh hơn thì phải cần tới phương thức xử lý dữ liệu nhanh hơn (Fast data). Tuy nhiên, hệ thống cơ sở dữ liệu truyền thống lại quá chậm chạp trong khi nhu cầu phân tích xử lý dữ liệu đòi hỏi thời đáp ứng thời gian thực.

Một xu hướng đang nổi lên là sử dụng công cụ mã nguồn mở trong doanh nghiệp nhằm xử lý luồng dữ liệu với các truy vấn phức tạp và tăng cường khả năng giao dịch. Một khái niệm mới về dữ liệu được nhắc đến khá nhiều trong năm 2014, đó là NewSQL, được xem là nền tảng của Fast Data. NewSQL là hệ quản trị CSDL với hiệu năng mở rộng và tốc độ cao nhằm giải quyết các vấn đề về xử lý giao dịch trực tuyến (Online transaction processing) của hệ quản trị cơ sở dữ liệu SQL truyền thống.

Triển khai & Ứng dụng ?

Công cụ mã nguồn mở

Cách đây gần 10 năm, việc phân tích dòng dữ liệu lên đến petabyte là một điều không tưởng đối với các thiết bị phần cứng. Những công cụ mã nguồn mở như Apache Hadoop cung cấp một nền tảng phân tán mạnh để lưu trữ và quản lý dữ liệu lớn. Nền tảng này chạy ứng dụng trên các cụm phần cứng lớn, xử lý khối dữ liệu lên đến petabyte trên hàng ngàn hệ thống quản lý dữ liệu phân tán và hệ thống ảo hóa. Điều này làm giảm chi phí và tăng cường sức mạnh bảo mật cho doanh nghiệp, từ đó lĩnh vực Dữ liệu lớn (Big data) ra đời.

Một cuộc cách mạng tương tự đang xảy ra với tên gọi là Dữ liệu nhanh (Fast data). Big Data thường được tạo ra bởi các luồng dữ liệu khổng lồ với tốc độ sản sinh đáng kinh ngạc, ví dụ như dữ liệu về việc nhấp chuột theo thời gian thực, thông tin về tài chính hay dữ liệu của các cảm biến. Thường thì những điều này diễn ra hàng ngàn đến hàng chục ngàn lần mỗi giây. Kiểu dữ liệu này được giới chuyên gia đặt cái tên là "vòi rồng".

Khi nói về vòi rồng trong dữ liệu lớn, chúng ta không đo khối lượng kiểu như gigabyte điển hình mà là bằng lượng dữ liệu MB trên mỗi giây hay terabyte mỗi ngày. Chúng ta đang nói về tốc độ cũng như khối lượng và đó là điểm cốt lõi lõi của sự khác biệt giữa dữ liệu lớn và kho dữ liệu. Big Data không chỉ lớn mà còn có tốc độ xử lý cao.

Những lợi ích của Big Data sẽ không còn nếu nó không mới và nhanh chóng di chuyển từ các "vòi rồng" đến các nền tảng kiểu như Hadoop Distributed File System (hệ thống lưu trữ chính của công cụ Hadoop) hay công cụ phân tích hệ quản lý cơ sở dữ liệu quan hệ (RDBMS) hoặc thậm chí là các tập tin dữ liệu thông thường. Những đặc tính đại diện cho xu hướng dữ liệu mới này phải là luôn đáp ứng, luôn trong tình trạng sẵn sàng và có khả năng xử lý xử lý theo lô (batch processing).

Công nghệ về dữ liệu mới phải phù hợp với môi trường xung quanh mới được xem là có giá trị. Hoạt động dữ liệu không quá tốn kém và phù hợp với các dòng sản phẩm phần cứng phổ biến. Cũng giống như giá trị của dữ liệu lớn - Big Data, giá trị trong dữ liệu nhanh - Fast Data được xem là những kì vọng của tương lai về mô hình hàng đợi thông điệp (message) và hệ thống truyền tải như Apache Kafka hay Apache Storm.

Từ những nhu cầu về cơ sở dữ liệu tương lai đó nên đã có sự ra đời NoSQL năm 1998 trước đây và hiện nay là NewSQL.

Giá trị Fast Data

Cách tốt nhất để nắm bắt được giá trị của dữ liệu đầu vào là việc phương thức xử lí khi thông tin được truyền đến. Giá trị của dữ liệu đó được thể hiện ở việc tiết kiệm thời gian xử lý mà không có tác động của con người.

Để xử lý dữ liệu khi đến hàng chục ngàn đến hàng triệu sự kiện mỗi giây, bạn sẽ cần hai công nghệ: Đầu tiên, hệ thống truyền tải có khả năng cung cấp các sự kiện nhanh ở nguồn vào; và thứ hai là hệ thống lưu trữ có khả năng xử lý nhanh mỗi khi dữ liệu được truyền đến.

Nguồn cung của Fast Data

Hai hệ thống trực tuyến phổ biến đã nổi lên trong vài năm qua: Apache Storm và Apache Kafka. Được phát triển bởi đội ngũ kỹ sư tại Twitter. Storm đáng tin cậy trong việc xử lý dòng dữ liệu vô tận ở mức hàng triệu tin đến mỗi giây. Kafka, được phát triển bởi đội ngũ kỹ sư tại LinkedIn, đây là một hệ thống hàng đợi thông điệp phân bố với hiệu suất cao. Cả hai hệ thống truyền này giải quyết đáp ứng được nhu cầu xử lý dữ liệu nhanh.

Một điểm khác biệt là Kafka được thiết kế để có một hàng đợi thông điệp và giải quyết các vấn đề về các hiện trạng công nghệ hiện có. Đó là loại một dạng hàng đợi cao cấp với khả năng mở rộng không giới hạn, sử dụng mô hình hạ tầng chia sẻ... Một tổ chức có thể triển khai một cụm Kafka để đáp ứng tất cả các nhu cầu truyền tin của mình theo hàng đợi

Xử lý với Fast Data

Truyền tin chỉ là một phần của giải phápmà cơ sở dữ liệu quan hệ truyền thống có xu hướng giới hạn hiệu suất. Một số có thể lưu trữ dữ liệu với tốc độ cao, nhưng tốc độ sẽ giảm khi chúng phải thực hiện thêm các thao tác như xác minh, bổ sung trước khi được chuyển đổi. NoSQL thích hợp cho các mô hình lưu trữ dữ liệu có tính đặc thù như object oriented, document oriented, xml database,…

Tuy nhiên, nếu bạn đang thực hiện các truy vấn phức tạp và các hoạt động kiểm soát quá trình trao đổi thông tin giữa một cơ sở dữ liệu với một phương tiện của người truy cập thì giải pháp NewSQL có thể đáp ứng được hiệu suất cũng như độ phức tạp giao dịch.

NoSQL có nghĩa là Non-Relational - không ràng buộc. Tuy nhiên hiện nay người ta thường hiểu là NoSQL là Not Only SQL - Không chỉ SQL. Đây là thuật ngữ chung cho các hệ CSDL không sử dụng mô hình dữ liệu quan hệ. NoSQL đặc biệt nhấn mạnh đến mô hình lưu trữ cặp giá trị - khóa và hệ thống lưu trữ phân tán. NoSQL đặc biệt thích hợp cho các ứng dụng rất lớn (dịch vụ tìm kiếm, mạng xã hội ,…). Với những ứng dụng vừa và lớn thì RDBMs vẫn thích hợp hơn.

NewSQL nhằm mô tả hệ thống có khả năng mở rộng của NoSQL, trong khi vẫn cung cấp các đặc tính ACID (tính toàn vẹn của cơ sở dữ liệu) đảm bảo quan hệ dữ liệu thông thường. NewSQL đang được sử dụng khá nhiều ở các công ty lớn trên thế giới và thuật ngữ được gọi lần đầu tiên bởi tổ chức the 451 Group vào năm 2011

Tuesday, December 6, 2016

Industries/Fields where you applied Analytics, Data Science, Data Mining in 2016?

Source: http://vote.sparklit.com/poll.spark/203792

Advertising (45)	4%
Agriculture (12)	1%
Automotive/Self-Driving Cars (19)	2%
Banking (48)	5%
Biotech/Genomics (19)	2%
Credit Scoring (25)	2%
CRM/Consumer analytics (59)	6%
Direct Marketing/ Fundraising (15)	1%
E-commerce (30)	3%
Education (27)	3%
Entertainment/ Music/ TV/Movies (15)	1%
Finance (60)	6%
Fraud Detection (38)	4%
Games (11)	1%
Government/Military (21)	2%
Health care (38)	4%
HR/workforce analytics (15)	1%
Insurance (40)	4%
Investment / Stocks (21)	2%
IT / Network Infrastructure (30)	3%
Junk email / Anti-spam (4)	0%
Manufacturing (19)	2%
Medical/ Pharma (20)	2%
Mining (17)	2%
Mobile apps (12)	1%
Oil / Gas / Energy (31)	3%
Retail (35)	3%
Science (50)	5%
Search / Web content mining (21)	2%
Security / Anti-terrorism (13)	1%
Social Good/Non-profit (9)	1%
Social Media / Social Networks (30)	3%
Social Policy/Survey analysis (6)	1%
Software (28)	3%
Supply Chain (25)	2%
Telecom / Cable (30)	3%
Travel / Hospitality (15)	1%
Web usage/Log mining (32)	3%
Other (24)	2%

Monday, December 5, 2016

HDFS Cheat Sheet

HDFS Guide (File System Shell) Commands

The Hadoop File System is a distributed file system that is the heart of the storage for Hadoop. There are many ways to interact with HDFS including Ambari Views, HDFS Web UI, WebHDFS and the command line. The first way most people interact with HDFS is via the command line tool called hdfs. This is a runner that runs other commands including dfs. This replaces the old Hadoop fs in the newer Hadoop. This guide is for Hadoop 2.7.3 and newer including HDP 2.5. The HDFS client can be installed on Linux, Windows, and Macintosh and be utilized to access your remote or local Hadoop clusters. The easiest way to install is onto a jump box using Ambari to install the Hadoop client. I also recommend installing all the clients it recommends including Pig and Hive. There is a detailed list of every command and option for each version of Hadoop.

Every day I am looking at different Hadoop clusters of various sizes and there will be various tools for interacting while HDFS files either via web, UI, tools, IDEs, SQL and more. The one universal and fastest way to check things is with the shell or CLI. The following are always helpful and usually hard or slower to do in a graphical interface.

The first command I type every single day is to get a list of directories from the root. This gives you the lay of the land.

To List All the Files in the HDFS Root Directory

Usage: hdfs dfs [generic options] -ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...]
Example:
hdfs dfs -ls /
Found 35 items
drwxrwxrwx   - yarn   hadoop          0 2016-10-06 16:05 /app-logs
drwxrwxrwx   - hdfs   hdfs            0 2016-11-04 11:56 /apps
drwxr-xr-x   - yarn   hadoop          0 2016-09-15 21:02 /ats
drwxrwxrwx   - hdfs   hdfs            0 2016-10-05 21:07 /banking
...

You can choose any path from the root down, just like regular Linux file system. -h shows in human readible sizes, recommended. -R is another great one to drill into subdirectories. Often you won't realize how many files and directories you actually have in HDFS. Many tools including Hive, Spark history and BI tools will create directories and files as logs or for indexing.

Create an empty file in an HDFS Directory

Usage: hadoop fs [generic options] -touchz <path> ...
Example:
hdfs dfs -touchz /test2/file1.txt

This works the same as Linux Touch command. This is useful to initialize a file. Sometimes you want to test a user's permissions and want to quickly do a write. This is the quickest path for you. You can also bulk upload a chunk of files via: hdfs dfs -put *.txt /test1/ The reason I want to do this so I can show you a very interesting command called getmerge.

Concatenate all the files into a directory into a single file

Usage:  hdfs dfs [generic options] -getmerge [-nl] <src> <localdst>
Example:
hdfs dfs -getmerge -nl /test1 file1.txt

This will create a new file on your local directory that contains all the files from a directory and concatenates all them together. The -nl option adds newlines between files. This is often nice when you wish to consolidate a lot of small files into an extract for another system. This is quick and easy and doesn't require using a tool like Apache Flume or Apache NiFi. Of course, for regular production jobs and for larger and greater number of files you will want a more powerful tool like the two mentioned. For a quick extract that someone wants to see in Excel, concatenating a few dozen CSVs from a directory into one file is helpful.

Change the Permissions of a /new-dir

Usage: hdfs dfs [generic options] -chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...
Example:
hdfs dfs -chmod -R 777 /new-dir

The chmod patterns follow the standard Linux patterns, where 777 gives every user read-write-execute for user-group-other.

Change the Owner and Group of a New Directory: /new-dir

Usage: hdfs dfs [generic options] -chown [-R] [OWNER][:[GROUP]] PATH...
Example:
hdfs dfs -chown -R admin:hadoop /new-dir

Change the ownership of a directory to the admin user and the Hadoop group. You must have permissions to give this to that user and that group. Also, the user and group must exist. For changing permissions, it is best to sudo to the hdfs user which is the root user for HDFS. Linux root user is not the root owner of the HDFS file system.

Delete all the ORC files forever, skipping the temporary trash holding.

Usage:  hdfs dfs [generic options] -rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...
Example:
hdfs dfs -rm -R -f -skipTrash /dir/*.orc
Deleted /dir/a.orc

We want to skipTrash to destroy that file immediately and free up our space, otherwise, it will go to a trash directory and wait for a configured period of time before it was deleted. I use -f to force the deletion. I want these files gone!

Move A Directory From Local To HDFS and Delete Local

Usage: hdfs dfs [generic options] -moveFromLocal <localsrc> ... <dst>
Example:
hdfs dfs -moveFromLocal /tmp/tmp2 /tmp2
[hdfs@tspanndev10 /]$ hdfs dfs -ls /tmp2
Found 2 items
-rw-r--r--   3 hdfs hdfs          0 2016-11-18 15:55 /tmp2/a.txt
-rw-r--r--   3 hdfs hdfs          5 2016-11-18 15:55 /tmp2/b.txt
[hdfs@tspanndev10 /]$ ls -lt /tmp/tmp2
ls: cannot access /tmp/tmp2: No such file or directory

If you want to move a local directory up to HDFS and remove the local copy, the command is moveFromLocal.

Show Disk Usage in Megabytes for the Directory: /dir

Usage: hdfs dfs [generic options] -du [-s] [-h] <path> ...
Example:
hdfs dfs -du -s -h /dir
2.1 G  /dir

The -h gives you a human readble output of size, for example Gigabytes.

When in doubt of what command you want to use or what to do next, just type help. You will also get a detailed list for each individual command.

hdfs dfs -help
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] [-v] [-t [<storage type>]] <path> ...]
[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]
[-expunge]
[-find <path> ... <expression> ...]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]]
[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-truncate [-w] <length> <path> ...]
[-usage [cmd ...]]
 ...

You can also use the older format of: hadoop fs. This will work on older Hadoop installations as well.

Since you are logged in as the hdfs super user, you can also use the HDFS Admin commands.

HDFS DFS Administration Overview

Usage: hdfs dfsadmin
Note: Administrative commands can only be run as the HDFS superuser.
[-report [-live] [-dead] [-decommissioning]]
[-safemode <enter | leave | get | wait>]
[-saveNamespace]
[-rollEdits]
[-restoreFailedStorage true|false|check]
[-refreshNodes]
[-setQuota <quota> <dirname>...<dirname>]
[-clrQuota <dirname>...<dirname>]
[-setSpaceQuota <quota> [-storageType <storagetype>] <dirname>...<dirname>]
[-clrSpaceQuota [-storageType <storagetype>] <dirname>...<dirname>]
[-finalizeUpgrade]
[-rollingUpgrade [<query|prepare|finalize>]]
[-refreshServiceAcl]
[-refreshUserToGroupsMappings]
[-refreshSuperUserGroupsConfiguration]
[-refreshCallQueue]
[-refresh <host:ipc_port> <key> [arg1..argn]
[-reconfig <namenode|datanode> <host:ipc_port> <start|status|properties>]
[-printTopology]
[-refreshNamenodes datanode_host:ipc_port]
[-deleteBlockPool datanode_host:ipc_port blockpoolId [force]]
[-setBalancerBandwidth <bandwidth in bytes per second>]
[-fetchImage <local directory>]
[-allowSnapshot <snapshotDir>]
[-disallowSnapshot <snapshotDir>]
[-shutdownDatanode <datanode_host:ipc_port> [upgrade]]
[-getDatanodeInfo <datanode_host:ipc_port>]
[-metasave filename]
[-triggerBlockReport [-incremental] <datanode_host:ipc_port>]
[-help [cmd]]
Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
hdfs command [genericOptions] [commandOptions]

There are number of commands that you may need to use for administrating your cluster if you are one of the administrators for your cluster. If you are running your own personal cluster or Sandbox, these are also good to know and try. Do Not Try These In Production if you are not the owner and fully understand the dire consequences of these actions. These commands will be affecting the entire Hadoop cluster distributed file system. You can shutdown data nodes, add quotas to directories for various users and other administrative features.

WARNING: Enter Safemode for Your Cluster

Usage: hdfs dfsadmin [-safemode enter | leave | get | wait | forceExit]
Example:
hdfs dfsadmin -safemode enter
Safe mode is ON

Do not do this unless you need to do cluster maintenance such as adding nodes. You will be entering read-only mode. You need to do safemode leave to get out of this. These commands may take time as they wait for things to write and jobs not accessing the servers.

To Get a Report of Your Cluster

Usage: hdfs dfsadmin -report
Example:
hdfs dfsadmin -report
Configured Capacity: 75149430272 (69.99 GB)
Present Capacity: 55889761113 (52.05 GB)
DFS Remaining: 26116294782 (24.32 GB)
DFS Used: 29773466331 (27.73 GB)
DFS Used%: 53.27%
Under replicated blocks: 1295883
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
-------------------------------------------------
Live datanodes (1):
Name: 111.11.111.11:50010 (dataflowdeveloper.com)
Hostname: dataflowdeveloper.com
Decommission Status : Normal
Configured Capacity: 75149430272 (69.99 GB)
DFS Used: 29773466331 (27.73 GB)
Non DFS Used: 19259669159 (17.94 GB)
DFS Remaining: 26116294782 (24.32 GB)
DFS Used%: 39.62%
DFS Remaining%: 34.75%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 8
Last contact: Fri Nov 18 16:28:59 UTC 2016

For additional administration commands, see the references below. The above list of commands will help you with most uses and analysis you will need to do.

Resources:

Pages