Mapping of my project
My first breakthrough was when I mapped out the entire project. This project was worked on for a month before this mapping. Before this point in the project, I did not see the future of the project. After I finished mapping out everything I realized that I had gaps in the data that prevented the project from continuing.
Getting accepted to the DC Web Women’s code(her) conference
My other breakthrough was when I was accepted to the DC Web Women’s code(her) conference which will be held on September 12. The DC Web Women’s code(her) conference is for all types of girl coders. Being accepted meant that someone wanted to see my project on predicting the Stanley Cup for the 2015-2016 season.
The data you have is not always the data you need
When you have a data science project or any project in general you have to locate data that will best represent the project. Sometimes, that data is not enough or right but you will may not know that until the project has been worked for a long period of time. This is a huge lesson that I have come to terms with. One day while I was working with the original data I had compiled I realized that I did not have the right stats to be able to predict who will the next Stanley Cup. Once that was figured out I spent some time finding the other data needed to complete this project.
This week I finished loading the data we had looked at so far and began the process of analyzing it to try and figure out if they was anything missing or wonky.
There are many different types of graphs that can be used to display data. I ran histograms, bar graphs, and scatterplots. Some graphs are not useful for the data you have or for the projects you are working on. One thing I graphed was a scatterplot that compared wins to shots per game played. This graph had all the data points plotted horizontally and sometimes three different points would be stacked on top of each other which made it hard to decipher. This particular graph did not show a distinct correlation between wins and shots. We were looking at multi-year, cumulative data versus specific seasons. Obviously changing the inputs or point of reference would change the outcome.
Looking over the graphs I made, I noticed that my hockey coaches have been telling my team the truth over the past 6 seasons. They have told my team to shoot when we can because the more we shoot the more likely we are to score and the more we score the more likely we are to win.
This concept also applies to the NHL. For example, Alex Ovechkin (Ovi), the captain of the Washington Capitals, was #1 in goals for three consecutive seasons. He has hit the record for 50 goals in back to back seasons (2013-2014 and 2014-2015). Since Ovi was #1 in goals for three consecutive seasons he was also #1 in shots those same seasons.
I know it is obvious but seeing the data charted out made it easier to see the correlation between shots and wins.
I am working on a project about predicting who will win the next Stanley Cup using R. R is a statistical programming language and is also the most popular language for data science. If you want to learn more about data science see my first post, overview of data science for business.
The first step of the project is figuring out who your team is and defining the project goal. For this project, the team is just my employer (aka my mom) and myself but we are treating the audience as a part of the team. I am doing this project because it is teaching me how to code. This project is also ensuring that I have skills I can fall back on after graduating from college (in case my interests in college don’t earn me any money). My employer’s goal is to use this internship as a way to get me interested in technology and get exposed to other technology opportunities, besides just programming. She thinks I sit on the fringe of technology and am not an active participant. You can see her thoughts on following different paths on her blog.
The second step of the project is to collect and manage the data. I am getting my data from nhl.com as it is seems like the most obvious place and it is pretty complete. I now have the data collected and I am starting to look for key indicators that will help me to predict who will win the next Stanley Cup.
One struggle I am facing is graphing the data so that it can be better analyzed. Graphing the data gives a visual interpretation of any missing or invalid values. This is a plug-in to R and I am still learning. Another challenge is finding the most efficient way to do things. For example, I needed to copy and paste a season attribute to a whole set of data (every record in the dataset). I manually pasted the season to about 900 rows which took a while when I should have been done in 5 minutes if I had used the excel shortcut. The shortcut can be found here.
This is where the project is right now and you will be given additional information once the project moves further along.
If it was not for my mom making me do this I would not have a blog which means I would not be writing this. Thanks mom! I am writing about what I have learned from and my thoughts on my internship so far. I have interned for 3 weeks now.
I like the internship more than I did at the beginning. This internship has given me new skills that are essential to work in the real world. I am learning essential skills that will help me be successful in any future job. So far, I have been helping compile the list of potential customers for future marketing initiatives; helped run some simple quality tests and setup this blog site plus wrote the first couple of blogs. I also started learning about data science for my big project we’ll start on next week.
I have learned a lot about Excel. One thing I have learned is how to run a vlookup. Vlookups help consolidate data from two separate excel spreadsheets into a single one. I found this video about vlookups useful but it is a little on the long side. This made my life easier, especially since I originally thought I would have to copy each email address over manually. I have relearned formulas and shortcuts. In small excel files, it will be easier to find the information you are looking for but in large files it will be harder to find the data you want. Using formulas and shortcuts find the information more efficiently. I used the find and merge duplicates quite a bit for the work I was doing. There are many ways to accomplish the same goal. While I followed a slightly different method, here is one video on finding duplicates.
For another project, I ran unit tests. Unit tests are used to see if a program is working or not. I had to compile the results on a wiki site. This was pretty boring work. One lesson I learned is that not every job is going to be fun but it is worth it. You need to find balance between basic, boring tasks and more interesting work.
Again, I would like to thank my mom for making me write this blog. Not only is she getting me to write about this internship but she is also trying to get me to write my school papers and essays better. I really appreciate it. I know that my internship is going to get harder but I feel like I can do it.
This post summarizes the third chapter of the book Data Science for Business which is the Introduction to Predictive Models: From Correlation to Supervised Segmentation.
One fundamental idea of data mining is finding/selecting important, informative “attributes” of entities by data. A model is a simplified representation of reality to serve a purpose. A predictive model is a formula for estimating unknown values of interest. The formula is a hybrid of mathematical and logical rules. A descriptive model is a model where the primary purpose is not to estimate a value but gain insight into underlying phenomenon or process. I find predictive models to be interesting because I think better results can come from a predictive model.
Supervised learning is a model creation where the model describes relationship between a set of attributes and the target variable. The model estimates the value of the target variable as function of the features. Supervised learning is like a teacher giving us information followed by some examples and/or problems that need to be solved. An unsupervised learning is like just getting asked the problems and having to figure it out on your own.
The creation of models from data is known as model induction. Induction is a term from philosophy referring to generalizing specific cases to general rules. The input data for an induction algorithm is training data. You will want all resulting groups to be as pure as possible to the target variable. There are complications. Those complications are that attributes rarely split group perfectly, not all attributes are binary, meaning that many attributes have 3 or more values,
and some attributes take on numeric values.
The most common splitting criterion is information gain which is based on purity measure called entropy. Entropy is the measure of disorder, uncertainty, and consumed energy in a system. Information gain is defined to measure how much attributes decrease entropy over the whole analysis created.
Information gain can be defined as the measurement that figures out how much an attribute decreases over entropy over the data created. The original set is known as the parental set and the resulting splits of the attribute values are the children sets.
A tree structured model looks like an upside down tree. The tree is made of interior and terminal nodes. Each data point will end up at one and only one terminal node. The branches will emit from each interior node. Each interior node contains a test of an attribute like balance or age and each branch represents distinct values of the attributes.
The procedure of a classification tree induction is a recursive process of divide and conquer, where the goal at each step is to select an attribute to split the current group into subgroups that are as pure as possible. If a single path is traced from the root node to a leaf, collecting the conditions that are on the path, a rule is created. The reality that there is a 100% chance that members of a part of a class that this occurrence will happen is called “overfitting.”
I am reading Data Science for Business by Foster Provost and Tom Fawcett. This post is summarizing the first two chapters: Data Analytic Thinking; Business Problems and Data Science Solutions. The data science cycle is broken into five stages: business understanding, data understanding, data preparation, modeling, and evaluation.
The ultimate goal of Data Science is described as improved decision making. After the business question has been determined a model must be picked based on the data that will best find the solution. Data science needs access to the data and benefits from sophisticated data engineering while data processing is important for data oriented business that does not involve extracting knowledge or data driven decision making. Data drawn decision making and big technologies improve business performance. Data engineering and processing are critical parts of Data science.
Let’s say that some phone companies want to be able to tell if a customer will leave after their contract expires. The process of bringing in customers as customers leave is called churn. A data science team should be able to come up with a model that will be able to predict which customers will leave after their contract expires.
The process of data science is in a 5 stage cycle. The cycle always starts with a business question and will eventually end at deployment. The point of defining the business question is to understand the problem that needs to be solved. This stage is where an analyst’s creativity plays a huge part. The next stage of the cycle is data understanding. This is where the data science team has to understand the strengths and limitations of the data. Some data is free, some requires effort to obtain, some may need to be purchased, and some data will not exist which will require projects that have nothing to do with the main problem to require. The solution path may change in this stage.
The third stage of the cycle is data preparation. Typical types of data preparation includes converting data to a tabular format, removing/inferring missing values, and converting data to different types. The general concern of this stage is “leaking”. A leak is a data point that can be found in historical data but cannot be found or used when a decision needs to be made. The fourth stage of the cycle is modeling which is the primary place where data mining techniques are applied to the data.
The fifth stage of the cycle is evaluation. The purpose is to assess data mining results in depth and to gain confidence that the results are valid and reliable before moving on. The evaluation helps to ensure that the model satisfies the original business goals. The evaluation results of data mining include qualitative and quantitative assessments. Data scientists must think about comprehensive evaluation methods because getting detailed information on performance of models they used may be difficult or impossible.
This is the general overview of data science in business. Other relevant terms in data science are statistics, query, data warehouses, and machine learning. The field of Statistics provides us with huge amounts of knowledge that underlies analytics.A query is a specific request for a subset of data or for statistics about data, formulated in a technical language and posed to a database system. Data warehouses collect and merge data from across an enterprise, often from multiple transaction-processing systems, each with its own database. Machine Learning methods are developed in several fields combined in the study of Machine Learning, Applied Stat, and Pattern Recognition. The field of Data Mining started off as an offshoot of Machine Learning and they are closely linked. However, Data Mining does not include robotics or computer vision.