Tuesday 3 May 2016

Day 17 - Neo4j, what is it?

Until yesterday, I had no idea what is the Neo4j technology. Now, I know a bit, because I went to the a hackathon event named GraphHack @ Graph Connect 2016. It was a pre-event to the proper Graph Connect 2016 where people were grouped (not everyone. I, for example, had a group, but we were not able to create a decent model/graph at that moment, so we didn't present anything, but we did try to organise the data and plot the data into a chart). The topic was Travel & Transportation. 

Neo4j is, basically, a graph database which helps you to create relational charts.
A graph database, also called a graph-oriented database, is a type of NoSQL database that uses graph theory to store, map and query relationships. A graph database is essentially a collection of nodes and edges.

The picture below is an excellent example of a network analysis:


It's a chart showing an airport terminal, gates and its venues nearby, like restaurants and other places organised by category. The main benefits of using these graphs are clear visualisation, easier understanding and time and costs reducing in searches.

As I didn't know how to do this using Neo4j, I tried to code in R and plot a network graph, using the data they provided in the event. Here are the results:


Tcharam! There is a network chart!
It was made using a simple example which I found on a very informative website. You can get the R code below:

setwd("C:/Users/Re/Documents/Neo4j")

library(network) 
library(sna)
library(ndtv)
library(igraph)

nodes <- read.csv("Dataset1-Media-Example-NODES.csv", header=T, as.is=T)
links <- read.csv("Dataset1-Media-Example-EDGES.csv", header=T, as.is=T)

nrow(nodes); length(unique(nodes$id))
nrow(links); nrow(unique(links[,c("from", "to")]))

links <- aggregate(links[,3], links[,-3], sum)
links <- links[order(links$from, links$to),]
colnames(links)[4] <- "weight"
rownames(links) <- NULL

net <- graph.data.frame(links, nodes, directed=T)

plot(net)

net <- simplify(net, remove.multiple = F, remove.loops = T)
plot(net, edge.arrow.size=.4,vertex.label=NA)

This code allows you to create that chart above, which is precious to visualise network connections into the data.

Lately, I'd been very busy but, I know that writing at least three or four times in the week is important to study and practice vocabulary.

That's all! Thank you and see you tomorrow!

Sources:
http://neo4j.com/developer/graph-database/
http://graphconnect.com/
http://kateto.net/network-visualization

Thursday 21 April 2016

Day 16 - Let's start again...

Shame on me!
I had not written on the blog for days!!

Considering I am working part-time and studying on evenings, attending to meet ups and having a social life, I really need to manage my time and keep writing here. So writing it's a thinking style which is essential to my learning process at the moment.

Now, let's talk about what really matters: R programming!

I've just started a new Coursera web course (11th April) about Statistical Inference, which covers mainly Statistics subjects and how to code statistics using R. The best point is they encourage you to run the SWIRL package to practice some practical exercises. The downside is you need to pay the course fees to complete the weekly quiz. But it's okay, once you have at least the SWIRL to try and it's allowed to making some mistakes while you're learning it.


If you didn't have it install yet, first of all, you need to do it!

install.packages("swirl")

Then, to run this package, you just need to open R-Studio and write in the console:

library(swirl)

Install the Statistical Inference module:
install_from_swirl("Statistical Inference")

Run SWIRL package:
swirl()

We also had another Meet Up event yesterday and it was fantastic and very informative, as usual!
The R-Ladies group is becoming stronger and noted. I think it's a great new project and can achieve high levels of engagement and quality. Take a look: R-Ladies London.


If you still don't know R-Ladies, here's a brief overview: it's a recently created Meet Up group that unites girls over London city who want to learn more about R programming language. It's free of charge and you just need to subscribe to the events. Normally, it occurs on a monthly basis and you need to rush for book your seat!

That's all, folks.
I'll tell you more about other great ideas tomorrow.

Cya :)




Friday 8 April 2016

Day 15 - LondonR - Mango Solutions

This week is so busy that I could not write anything for the blog every day.
Actually, I got a part-time job to save money for my 2016 plans and, to meeting new people. I mean by "meeting" is leaving home every day and talking with other folks. 

So, last Tuesday I attended to one of the LondonR events, organized by Mango Solutions, a company that offers analysis solutions, consulting, training and application development for other parties. The event was fantastic, as usual: there were many bright people, most of them is already using or learning R programming and loads of people are trying to improve processes, tools, methods, which is great! If you want to achieve further career growth, you need to spend time in networking with the "right people". It means talented, intelligent and thinking-outside-of-the-box people to acquire more knowledge and even ask for mentoring.

They also brought news regarding EARL conference in this year. The best description for the EARL is written on their website:

EARL is a Conference for users and developers of the open source R programming language. The primary focus of the Conference is the commercial usage of R across a range of industry sectors with the aim of sharing knowledge and applications of the language.

So, during the talk the most interesting subject I've heard was a company named Rosetta, which created a development platform regarding many languages that are linked (like R, Python and also Excel) and work together on the same console. It is amazing! It's like a dream because you can code in your preferred language and you don't need to translate it to another one, just grabbing all codes and to put them together, in the end, then everything operates magically.

Well, my plans are studying this weekend, so for sure I'll have more subjects to talk about soon.
That's it for today :)

See ya!

Sources:
www.mango-solutions.com
earlconf.com
www.rosettahub.com
www.diegoluiz.com

Monday 4 April 2016

Day 14 - Hacker Rank competition results!!

No posts for the weekend! :(

But I indeed accomplished loads of tasks during these days, like completing all Hacker Rank puzzles, from the Statistics competition. In the beginning, the puzzles were easy to moderate levels, but the last ones were an almost impossible mission to accomplish. This website for competitions is excellent, once you can practice not only when the game is running, but there are also small puzzles to solve whenever you want regarding several different programming languages, like R, SQL, Python, Javascript, C++, and others.

Within the topics, the Statistics challenge covered subjects as:
  • Standard Deviation
  • Basic Probability
  • Normal Distribution
  • The Central Limit Theorem
  • Correlation
  • Linear Regression
  • Correlation and Regression Lines
  • Multiple Linear Regression
  • Predictions
For people who are passionate for statistics, like me, it was a fantastic time for having some fun!
The week results were not surprising: only 10% of total attendees in competition actually completed it. And I finished in 96th in the rank! Yay! :)

Current Rank: 96

By the way, I am trying to create my study planning for three months, and it's a bit complicated to do this, mainly when you have few experience dealing with the subject you want. But I am sure I'll get it! :)
To finish this day, I would like to let this checklist within qualification skills a data analyst should have, according to NCS (National Careers Service) website:
  • a high level of mathematical ability 
  • good IT skills 
  • the ability to analyse, model and interpret data 
  • strong problem-solving skills 
  • a methodical and logical approach 
  • the ability to plan work and meet deadlines 
  • a high level of accuracy and attention to detail 
  • good interpersonal skills to work as part of a team;
  • excellent written and spoken communication skills including report writing
So, these skills will be my primary focus on next three months.
Hence, I don't want to be a Data Analyst. I want to be the BEST Data Analyst!

Tomorrow I will tell you about the last Meet Up event I attended last Thursday: 
Data Science in Spacecraft Missions & Space Research, hosted by Royal Statistical Society and presented by ASI.

Have good dreams!
See ya!

Source: 

Thursday 31 March 2016

Day 13 - Hacker Rank Day 2 and Meetup

Hello!

Yesterday was the second Hacker Rank day for Statistics challenge.
But, honestly, all challenges have been easy since now. Day 2 was about basic probability, nothing with code or functions. Just a piece of paper and a pen.

Task #6
Bag X contains 5 white balls and 4 black balls. Bag Y contains 7 white balls and 6 black balls. You draw 1 ball from bag X and, without observing its color, put it into bag Y. Now, if a ball is drawn from bag Y, find the probability that it is black.

It is not that difficult, but you need to spend some time to think about the solution. As I haven't finished the first day's exercises, completing all puzzles until last midnight was a personal obligation (2 of Day-1 and 3 of Day-2). I am looking forward the next tasks!

About last evening, I attended a Meetup event of Outreach Digital again, and it was fascinating, as usual. The subject was SEO & Content Marketing Workshop - How To Make Sure Your Content is Found, and its speaker was Nichola Stott (@NicholaStott). Nichola has more than 20 years of experience in technology communications, including previous experience at Yahoo!.

You can see in this link the talk content: http://www.slideshare.net/NicholaStott/optimizing-content-with-seo-and-social-media

Let's talk about SEO, so, which means Search Engine Optimization, a concept focused on growing visibility in organic (non-paid) search engine results.



SEO encompasses both the technical and creative elements required to improve rankings, drive traffic, and increase awareness in search engines.

You can look up more in this website: https://moz.com/beginners-guide-to-seo

Monday 28 March 2016

Day 12 - Hacker Rank Competition!

Despite it's Bank Holiday today in the UK, there're no bad times to studying. So, I woke up in this morning and started searching for new information and activities on the web.

Then, I realised I had subscribed myself to this Hacker Rank statistical competition one week ago, and it'd start this Monday. I ran to the website www.hackerrank.com and after a couple of news reading, the challenge began.

The first puzzle was kind of difficult to me. When I figured out I could use some functions, as sum(), mean() and others (Because I was trying to solve by creating functions, e.g., Sum = Len(x)/n), it explained 80% of the puzzle. However, for Mode() function, it didn't work. Trying to write a code and looking for help in Google search, I found some help: someone wrote code to calculate the statistical mode in few lines. Then, I changed it a little bit and joined in my code lines, submitted and it worked!
I will post the answer here next Monday or it'd be against the competition's rules.
They'll release three puzzles per day, resulting in 21 puzzles in total! OMG!

This competition made me realise I need to study Statistics and Maths urgently.

So, I'll post some basic concepts here to further consideration:

Mean: it is just the average of the n values observed.


Sample Standard Deviation: The sample standard deviation, s, is often a more useful measure of the spread than the sample variance, S-square, because S has the same units (inches, pounds, etc.) as the sampled values and mean of x


One very useful concept is ANOVA table, which means Analysis of Variance: it tests the hypothesis that the means of two or more populations are equal. ANOVAs assess the importance of one or more factors by comparing the response variable means at the different factor levels. 


This picture above show an ANOVA table built in Excel 2013, what is very simple to do. I'll teach you later to do this. However, the critical skill is interpreting this small table within numbers. 

See what each column means in Statistics:

SS = Sum of Squares
df = Degrees of Freedom 
MS = Mean Squares
F = F Test
P-value = P-value

My explanation wasn't for dummies in Statistics; later I'll explain better about all of these concepts.

See ya!

Sources:

Saturday 26 March 2016

Day 11 - R, data.frames and colors

Hello, everyone!

After two days without posting anything, I was gathering loads of information to post here.
Everybody knows that studying by yourself is hard sometimes because you need to create your plan based on tips you read in websites, talking with professionals and following the common sense. Meanwhile, the best thing about learning by yourself is creating your own agenda. It means you can choose the featured subjects to learn faster than a one-year course, for example.

So, taking a look at Quora.com, I found some good answers to questions, like "What do I need to learn to become a Data Analyst/Scientist?". The best choice option, in my opinion, is the answer below:

1. Master Microsoft Excel
2. Learn Basic SQL
3. Learn Basic Web Development
4. Dive into a Concentration

However, I'd join two more technical skills to this tip: R and Python.
Because, nowadays, as both are free and open-source languages, have loads of packages which can run almost any functionality, most of the companies look for professionals who have good skills in them to join their teams.

Also, I was reading a blog post from a website I like very much (http://www.datasciencecentral.com/profiles/blogs/the-professionalization-of-data-science) and found this phrase, which describes Data Science field:
"The simple truth is that data science is a vast and complicated field and - like law and medicine - much too big and complex for a person to master in one lifetime." - Michael Walker

Well, now I'll show some new R functions I've learnt these last two days (including today).

margin.table
According to R documentation: This is really just apply(x, margin, sum) packaged up for newbies, except that if margin has length zero you get sum(x). (e.g., margin.table(UCBAdmissions))
as.data.frame.table
It's a function to transform your data set in a data frame in a table format. Very useful, by the way. (e.g., admit1 <- as.data.frame.table(UCBAdmissions))

colors()
You can see all available colors' names in R to use in any code.

palette()
It's just a colors palette with the featured ones.

Another functionality that is very helpful is a package named "RColorBrewer" which description is Creates nice looking color palettes, especially for thematic maps.

display.brewer.all()
If you type this function above, it returns a nice colors palette, like this one:

One way to create a great colorful chart is using this feature (brewer.pal) and play with the colors and their colors palettes. For example:

x <- c(12, 4, 21, 17, 13, 9)

barplot(x, col = brewer.pal(6, "Greens"))
barplot(x, col = brewer.pal(6, "YlOrRd"))
barplot(x, col = brewer.pal(6, "RdGy"))
barplot(x, col = brewer.pal(6, "BrBG"))
barplot(x, col = brewer.pal(6, "Dark2"))
barplot(x, col = brewer.pal(6, "Paired"))
barplot(x, col = brewer.pal(6, "Pastel2"))
barplot(x, col = brewer.pal(6, "Set3"))

By the way, I mentioned the R-Ladies event I went last Wednesday, but I didn't say anything about it.
As I said last time, it's a meet up for beginners and most of the ladies are novices. We also have some mentors to help in all levels questions, what is great.

Here is link of material we used in the last lesson: http://rpubs.com/crt34/march-workshop-full

In the first half, Chiin explained primary uses of R-Studio, reading .csv files and plotting a graph using ggplot2 and Geo Chart (which allows you plot map graphs).

In the second half, we learnt Clustering + Extension, a Machine Learning technique. I, particularly, didn't understand so much step-by-step, but I knew that studying is part of my every day, then I'll read more about it. Clustering is, in a general way, categorizing the data by its structure.

It is very useful to learn new concepts you have a little knowledge, for example, R language.
I am so grateful people like Chiin exists!

Happy Easter Day!
See ya tomorrow! :)